<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>大邓和他的PYTHON</title>
    <link>/</link>
    <description>Recent content on 大邓和他的PYTHON</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en</language>
    <lastBuildDate>Fri, 12 Sep 2025 00:00:00 +0000</lastBuildDate><atom:link href="/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>LIST | 社科(经管)数据挖掘文献资料汇总</title>
      <link>https://textdata.cn/blog/the_text_analysis_list_about_ms/</link>
      <pubDate>Mon, 15 Apr 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/the_text_analysis_list_about_ms/</guid>
      <description>如何从网络世界中高效地采集数据？是否能从文本中挖掘出人类的偏见等认知信息？如何从杂乱的文本数据中抽取文本信息(变量)？本文汇总的列表将让你对文本、对Python文本分析个全面的了解</description>
      <content:encoded><![CDATA[<p>个人感觉博客 <strong><a href="https://textdata.cn/">textdata.cn</a></strong> 精华就在这里了。 不定期更新， 内容聚焦于Python文本分析在经管、社科等领域的应用。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">- 营销
- 会计学
- 经济学
- 心理学
- 社会学
- ...
</code></pre></div><p>读几篇文章能加深对各领域文本分析方法应用的理解。</p>
<br>
<h2 id="管理学">管理学</h2>
<ul>
<li><a href="https://textdata.cn/blog/read_this_you_will_know_what_is_text_mining/">读完本文你就了解什么是文本分析</a></li>
<li><a href="https://textdata.cn/blog/2023-11-05-xjtu-text-mining-in-ms/">视频2023 | 文本分析在经济管理研究中的应用</a></li>
<li><a href="https://textdata.cn/blog/2022-09-08-dufe-text-mining-in-ms/">视频2022 | 文本分析在经济管理研究中的应用</a></li>
<li><a href="https://textdata.cn/blog/2023-11-03-organization-science-with-word-embeddings/">OS2022 | 概念空间 | 词嵌入模型如何为组织科学中的测量和理论提供信息</a></li>
<li><a href="https://textdata.cn/blog/2023-10-11-how-can-machine-learning-empower-management-research/">管理世界 | 机器学习如何赋能管理学研究？——国内外前沿综述和未来展望</a></li>
<li><a href="https://textdata.cn/blog/2023-04-08-measurement_of_psychological_factors_and_their_economic_impact/">管理世界 | 政府与市场心理因素的经济影响及其测度</a></li>
<li><a href="https://textdata.cn/blog/2023-11-02-measure-cognitive-diversity-through-language-discursive-diversity/">MS2022 | 使用语言差异性测量 <strong>团队认知差异性</strong></a></li>
<li><a href="https://textdata.cn/blog/2023-10-10-measure-the-speed-of-policy-diffusion-from-top-to-down/">管理科学学报 | 使用LDA算法计算政策扩散速度与扩散程度</a></li>
<li><a href="https://textdata.cn/blog/2022-09-07-management-science-disrupt-science-and-technology">Management Science | 使用网络算法识别创新的颠覆性与否</a></li>
<li><a href="https://textdata.cn/blog/2024-08-02-automating-grounded-theory-development-in-qualitative-research-with-large-language-models/">arXiv2024 | 使用大语言模型自动进行定性研究中的扎根理论开发</a></li>
<li><a href="https://textdata.cn/blog/research_with_tm_in_chinese_top_ms_journal/">近年《管理世界》《管理科学学报》使用文本分析论文</a></li>
</ul>
<p><br><br></p>
<h2 id="营销">营销</h2>
<ul>
<li><a href="https://textdata.cn/blog/text_mining_in_marketing_research/">文本分析在市场营销研究中的应用</a></li>
<li><a href="https://textdata.cn/blog/jcr_concreteness_computation/">JCR2021 | 计算文本的 <strong>语言具体性</strong></a></li>
<li><a href="https://textdata.cn/blog/2023-10-16-measurement-of-consumer-certainty-in-language/">JMR2023 | 测量消费者的 <strong>语言确定性</strong></a></li>
<li><a href="https://textdata.cn/blog/2022-12-03-scraping-web-data-for-marketing-insights/"><strong>JM2022 | 梳理营销领域使用网络爬虫技术的研究</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-04-12-semantic-brand-score/">JBR2018  | <strong>语义品牌评分(Semantic Brand Score)</strong></a></li>
<li><a href="https://textdata.cn/blog/automate_text_analysis_in_market/">营销研究中文本分析应用概述(含案例及代码)</a></li>
</ul>
<p><br><br></p>
<h2 id="会计金融">会计&amp;金融</h2>
<ul>
<li><a href="https://textdata.cn/blog/2024-04-19-ai-improve-firm-productivity/">管理世界2024 | 使用管理层讨论与分析测量「<strong>企业人工智能指标</strong>」</a></li>
<li><a href="https://textdata.cn/blog/text_mining_in_2021_management_world/">管理世界| 使用文本分析&amp;机器学习测量 「<strong>短视主义</strong>」</a></li>
<li><a href="https://textdata.cn/blog/2022-11-03-mda-measure-digitalization/">管理世界 | 使用 经营讨论与分析测量 「<strong>企业数字化</strong>」</a></li>
<li><a href="https://textdata.cn/blog/manager_tone_analysis_with_lm/">管理世界 | 使用LM中文金融词典对年报进行语调分析</a></li>
<li><a href="https://textdata.cn/blog/2024-12-31-using-regex-to-compute-the-financial_constraints/">管理世界 | 使用md&amp;a数据中计算 「<strong>企业融资约束指标</strong>」</a></li>
<li><a href="https://textdata.cn/blog/2024-06-17-firms-rhetorical-nationalism/">MOR | 使用md&amp;a测量「<strong>企业民族主义指标</strong>」</a></li>
<li><a href="https://textdata.cn/blog/2023-01-06-mda_informative_content/">中国工业经济 | 使用Python测量MD&amp;A「<strong>信息含量</strong> 」指标</a></li>
<li><a href="https://textdata.cn/blog/2023-01-13-information-content-of-critical-audit/">金融研究 | 使用Python测量关键审计事项「<strong>信息含量</strong>」指标</a></li>
<li><a href="https://textdata.cn/blog/2023-12-20-measure-china-economic-policy-uncertainty/">代码 | 使用「新闻数据」测量 「<strong>经济政策不确定性EPU指标</strong>」</a></li>
<li><a href="https://textdata.cn/blog/2024-04-25-firm-economic-policy-uncertainty/">代码 | 使用 「MD&amp;A文本」测量「<em><strong>企业不确定性感知FEPU指标</strong></em>」</a></li>
<li><a href="https://textdata.cn/blog/2023-05-23-soft-cosine-similarity/">管理科学学报 |  使用「<strong>软余弦相似度</strong>」测量业绩说明会「<strong>答非所问程度</strong>」</a></li>
<li><a href="https://textdata.cn/blog/2023-09-08-earnings-communication-conference-forward-looking-statements-information/">中国管理科学 | 使用业绩说明会文本数据测量 「<strong>上市公司前瞻性信息</strong></a>」</li>
<li><a href="https://textdata.cn/blog/2023-01-10-similarity_of_cental_bank_monetary_policy/">金融研究 | 央行货币政策文本相似度计算与可视化</a></li>
<li><a href="https://textdata.cn/blog/2024-06-20-using-python-to-caculate-herfindahl-hirschman-index/">代码 | 如何用Python计算「<strong>专利知识宽度</strong>」(赫芬达尔—赫希曼指数)</a></li>
<li><a href="https://textdata.cn/blog/2023-10-07-esg-measurement/">使用文本分析度量企业ESG属性</a></li>
<li><a href="https://textdata.cn/blog/2019-12-08-lazy-prices/">文本相似 | Lazy Prices公司年报内容变动预示重大风险</a></li>
<li><a href="https://textdata.cn/blog/2024-12-31-measure-corporate-culture-using-word2vec/">使用 Word2Vec 和 TF-IDF 计算五类企业文化</a></li>
<li><a href="https://textdata.cn/blog/2023-01-12-review_about_accounting_text_mining/">转载 | 国外会计文本信息实证研究述评与展望</a></li>
<li><a href="https://textdata.cn/blog/2024-12-31-the-experience-of-ceo-to-vector-with-graphe-embeddings/">如何用图嵌入(网络思维和嵌入思维)表征企业，表征高管的职业经历</a></li>
<li><a href="https://textdata.cn/blog/2023-01-16-papers-using-text-mining-tech-in-journal-of-economic-research/">近年《经济研究》中「文本分析」相关论文</a></li>
<li><a href="https://textdata.cn/blog/2022-11-16-literature-review-textmining-in-finance-yao2020/">转载 | 金融学文本大数据挖掘方法与研究进展</a></li>
<li><a href="https://textdata.cn/blog/2023-08-26-text-analysis-in-accounting/">CAR2023 | 文本分析在会计中的应用</a></li>
<li><a href="https://textdata.cn/blog/accountingtext/">视频分享 | 会计领域中的Python文本分析</a></li>
</ul>
<br>
<br>
<h2 id="经济学">经济学</h2>
<ul>
<li>
<p><a href="https://textdata.cn/blog/2023-04-09-narrative-economic-method/">叙事经济学：揭示经济中的叙事</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2022-12-30-review-about-socioeconomic-status-analysis/">转载 | 大数据驱动的「社会经济地位」分析研究综述</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2022-09-19-quantitative-history-economic/">文献汇总 | 量化历史学与经济学研究</a></p>
</li>
</ul>
<br>
<br>
<h2 id="心理学">心理学</h2>
<ul>
<li><a href="https://textdata.cn/blog/2025-02-17-gpt-is-an-effective-tool-for-multilingual-psychological-text-analysis/"><strong>PNAS | GPT 是多语言心理文本分析的有效工具</strong></a></li>
<li><a href="https://textdata.cn/blog/2023-03-31-pnas-measure-replicability-of-psychology-with-ml/">PNAS | 14000+篇心理学顶刊论文可复现性调研</a></li>
<li><a href="https://textdata.cn/blog/2022-11-14-pnas_naming_unrelated_words_predicts_creativity/">PNAS | 使用语义距离测量一个人的 <strong>创新力</strong>(<strong>发散思维</strong>)得分</a></li>
<li><a href="https://textdata.cn/blog/2023-03-10-psychological-research-with-word-embeddings/">基于词嵌入技术的心理学研究: 方法及应用</a></li>
<li><a href="https://textdata.cn/blog/2023-10-18-the-relationship-between-semantic-distance-with-creativity/">心理科学进展 | <strong>语义距离</strong> 与 <strong>创造性思维</strong> 关系的元分析</a></li>
<li><a href="https://textdata.cn/blog/2023-02-13-computing-cultural-psychology-with-big-data/">转载 | 大数据时代的「计算文化心理学」</a></li>
</ul>
<p><br><br></p>
<h2 id="社会学">社会学</h2>
<ul>
<li><a href="https://textdata.cn/blog/2022-12-03-social-computing-methodology-about-big-data-and-artificial-intelligence/">转载 | 社会计算驱动的社会科学研究方法</a></li>
<li><a href="https://textdata.cn/blog/2022-04-07-word-embeddings-in-social-science/">转载 | 大数据时代下社会科学研究方法的拓展——基于词嵌入技术的文本分析的应用</a></li>
<li><a href="https://textdata.cn/blog/2023-12-28-visualize-the-culture-change-using-people-daily-dataset/"><strong>可视化 | 人民日报语料反映七十年文化演变</strong></a></li>
<li><a href="https://textdata.cn/blog/from_sysbol_to_embeddings_in_computational_social_science/">转载 | 从符号到嵌入：计算社会科学的两种文本表示</a></li>
<li><a href="https://textdata.cn/blog/2021-12-19-pnas_historical_language/">PNAS | 历史语言记录揭示了近几十年来认知扭曲的激增</a></li>
<li><a href="https://textdata.cn/blog/2022-01-02-pnas_love_separate/">PNAS | 情侣分手3个月前就有预兆！聊天记录还能反映分手后遗症</a></li>
<li><a href="https://textdata.cn/blog/2023-03-13-linguistic-positivity-in-historical-texts-reflects-dynamic-environmental-and-psychological-factors/">PNAS | 历史文本中的语言积极性反映了动态的环境和心理因素(含Python代码)</a></li>
<li><a href="https://textdata.cn/blog/2023-03-15-39faq-about-word-embeddings-for-social-science/">词嵌入技术在社会科学领域进行数据挖掘常见39个FAQ汇总</a></li>
<li><a href="https://textdata.cn/blog/2021-12-28-pnas_culture_bridges/">PNAS | 文本网络分析&amp;文化桥梁 Python 代码实现</a></li>
<li><a href="https://textdata.cn/blog/2022-04-09-literature-about-embeddings/">文献汇总 | 词嵌入 与 社会科学中的偏见(态度)</a></li>
<li><a href="https://textdata.cn/blog/2022-04-01-embeddings-and-attitude/">词嵌入测量不同群体对某概念的态度(偏见)</a></li>
<li><a href="https://textdata.cn/blog/2023-03-03-extracts-cognitive-information-and-visualization-with-embedings/">可视化  |  词嵌入模型用于计算社科领域刻板印象等信息（含代码）</a></li>
<li><a href="https://textdata.cn/blog/2021-12-27-pnas_text_fluency/">PNAS | 词汇熟悉度对线上参与和资金筹集的预测性效用</a></li>
</ul>
<p><br><br></p>
<h2 id="其他">其他</h2>
<ul>
<li><a href="https://textdata.cn/blog/2025-04-23-word-embedding-reflect-human-attitude/">文化几何学：通过词嵌入分析反映文本背后的社会文化(变迁)</a></li>
<li><a href="https://textdata.cn/blog/2023-04-07-sapir-whorf-hypothesis/">语言相对性论 | 语言是否决定/影响人的思维和认知</a></li>
<li><a href="https://textdata.cn/blog/2023-11-16-how-to-understand-the-meaning-of-gpt/">Word Embeddings、Transformer与GPT：一文揭示三者关系</a></li>
<li><a href="https://textdata.cn/blog/2023-11-13-violatating-privacy-via-inference-with-large-language-model/">大模型的隐私推断能力 | 不可不防的大模型“人肉搜索”能力</a></li>
<li><a href="https://textdata.cn/blog/text_readability/">文本可读性研究及应用清单</a></li>
<li><a href="https://mp.weixin.qq.com/s/mefUYQnTn8vdWV78c9lRBw">多维度、细粒度情感词库的核心思想与建设过程概述</a></li>
<li><a href="https://textdata.cn/blog/2024-08-04-label-text-data-with-large-language-model/">LLM数据标注：是否胜过人类？</a></li>
<li><a href="https://textdata.cn/blog/2024-08-06-using-the-ollama-local-large-model-to-predict-the-sentiment-category-of-online-comments/"><strong>实验 | 使用本地大模型预测在线评论情感类别</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-08-03-literature-document-parsing-using-large-language-models-with-code/">实验 | 使用本地大模型从论文PDF中提取结构化信息</a></li>
</ul>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>LIST | 可供社科(经管)领域使用的科研数据集清单</title>
      <link>https://textdata.cn/blog/datasets_available_for_management_science/</link>
      <pubDate>Thu, 03 Apr 2025 00:00:00 +0000</pubDate>
      
      <guid>/blog/datasets_available_for_management_science/</guid>
      <description>可供社科(经管)使用的数据集</description>
      <content:encoded><![CDATA[<p>按照科研层次，将数据集(资源)类型划分为如下四方面</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">- 社会
- 企业
- 用户
- 其他
</code></pre></div><p>本列表所展示的数据集，均为整理自网络公开内容。 为方便经管社科领域学者开展大数据范式的科学研究，本列表将展示如何用 Python 处理这类大体量数据集。</p>
<p>如有任何问题， 可加微信 372335839，备注「姓名-学校-专业」。</p>
<p><br><br></p>
<h2 id="社会">社会</h2>
<ul>
<li><a href="https://textdata.cn/blog/2023-12-14-daily-news-dataset/">数据集 | 含 人民日报/经济日报/光明日报 等 120 家报纸(2025.3)</a></li>
<li><a href="https://textdata.cn/blog/2023-12-22-renmin-gov-leader-comment-board/">数据集 | 人民网地方领导留言板原始文本(2011-2023.12)</a></li>
<li><a href="https://textdata.cn/blog/2025-03-03-global-cell-towers-dataset/">数据集 | 4877w 条全球手机蜂窝基站数据(2006~2024.5)</a></li>
<li><a href="https://textdata.cn/blog/2023-04-13-3571w-patent-dataset-in-china-mainland/">数据集 | 5112w+专利申请数据集(1985-2025)</a></li>
<li><a href="https://textdata.cn/blog/2024-06-22-usa_today_daily-news-dataset/">数据集 | USA Today 新闻数据集(2012~2024)</a></li>
<li><a href="https://textdata.cn/blog/2024-07-12-china-daily-dataset/">数据集 | ChinaDaily 新闻数据集(2008 ~ 2024)</a></li>
<li><a href="https://textdata.cn/blog/2024-07-12-entrepreneur-dataset/">数据集 | 企业家 Entrepreneur 杂志数据集(1996 ~ 2024)</a></li>
<li><a href="https://textdata.cn/blog/2025-03-05-nytimes-news-dataset-from-2000-to-2025/">数据集 | 纽约时报 NYTimes 新闻数据集(2000~2025.3.1)</a></li>
<li><a href="https://textdata.cn/blog/2024-06-03-podcasts-dataset/">数据集 | 30w 播客(Podcast)的 560w 条评论数据(2005-2023)</a></li>
<li><a href="https://textdata.cn/blog/2024-06-05-wenzheng-hunan-dataset/">数据集 | 30w 条「问政湖南」留言&amp;回复数据(2010-2024)</a></li>
<li><a href="https://textdata.cn/blog/2023-12-03-china-mainland-corporate-registration-information/">数据集 | 2.49 亿条中国工商注册企业信息(23.9 更新)</a></li>
<li><a href="https://textdata.cn/blog/2023-05-07-china-law-judgment-documents-datasets/">数据集 | 中国裁判文书网(2010-2021)</a></li>
<li><a href="https://textdata.cn/blog/2023-09-03-government-procurement-contract-data/">数据集 | 372w 政府采购合同公告明细数据（2024.03)</a></li>
<li><a href="https://textdata.cn/blog/2023-12-17-gov-anual-report-dataset/">数据集 | 国、省、市三级政府工作报告文本(1954-2023)</a></li>
</ul>
<br>
<ul>
<li><a href="https://textdata.cn/blog/2025-03-21-the-arxiv-metadata-dataset-of-millions-of-scholarly-papers/">数据集 | arXiv 网站 269w 学术论文元数据 (2007 ~ 2025)</a></li>
<li><a href="https://textdata.cn/blog/2025-03-05-netherlands-daily-news-dataset-from-2015-to-2025/">数据集 | NOS.nl 荷兰新闻数据集(2015~2025.2.28)</a></li>
<li><a href="https://textdata.cn/blog/2024-07-13-cbs-news-dataset/">数据集 | CBS News 新闻数据集(1998 ~ 2024)</a></li>
<li><a href="https://textdata.cn/blog/2024-04-16-douban-movie-1000w-ratings-comments-dataset/">数据集 | 使用 1000w 条豆瓣影评训练 Word2Vec</a></li>
<li><a href="https://textdata.cn/blog/2024-04-17-douban-book-3394w-ratings-comments-dataset/">数据集 | 3394w 条豆瓣书评数据集</a></li>
<li><a href="https://textdata.cn/blog/2023-12-29-china-area-dataset/">数据集 | 2024 年中国全国 5 级行政区划（省、市、县、镇、村）</a></li>
<li><a href="https://textdata.cn/blog/2023-12-29-china-area-division-change/">数据集 | 行政区划代码历史沿革数据集</a></li>
<li><a href="https://textdata.cn/blog/2023-04-12-china-poi-datasets/">数据集 | 3.9G 全国 POI 地点兴趣点数据集</a></li>
<li><a href="https://textdata.cn/blog/2025-03-14-google-map-review-dataset/">数据集 | 6.6 亿条美国谷歌地图 POI 评论数据(~2021.9)</a></li>
<li><a href="https://textdata.cn/blog/2024-07-08-open-sanctions-dataset/">数据源 | 使用该网站可查询被制裁的个人、企业组织等制裁清单</a></li>
<li><a href="https://textdata.cn/blog/2025-03-14-uk-glassdoor-review-dataset/">数据集 | Glassdoor 网站 990w 条英国公司(职位)评论数据(2008~2023.7)</a></li>
<li><a href="https://textdata.cn/blog/2025-03-17-the_mother_of_all_movie_review_datasets/">数据集 | 5513w 条外文电影评论数据(1902~2024)</a></li>
</ul>
<p><br><br></p>
<h2 id="企业">企业</h2>
<ul>
<li><a href="https://textdata.cn/blog/2025-03-03-china-share-market-interaction-platform-dataset/">数据集 | 536w 条「上证 e 互动、深证互动易」问答记录(2011-2024.12.31)</a></li>
<li><a href="https://textdata.cn/blog/2024-01-14-usa-sec-10k-report-dataset/">数据集| 美股年报数据(2000-2025)</a></li>
<li><a href="https://textdata.cn/blog/2024-01-21-hk-stock-market-anual-report/">数据集 | 港股年报文本数据集(2007 ~ 2025.04)</a></li>
<li><a href="https://textdata.cn/blog/2024-01-18-neeq-china-listed-on-nation-equities-exchange-and-quotation-system-anunal-year-report/">数据集 | 三板上市公司年报(2002-2025.06)</a></li>
<li><a href="https://textdata.cn/blog/2025-02-25-china-fund-annual-report-dataset/">数据集 | 1998-2023 年中国基金年度报告</a></li>
<li><a href="https://textdata.cn/blog/2025-03-06-china-recruitment-dataset-of-listed-companies/">数据集 | 上市公司招聘数据(2014~2023)</a></li>
<li><a href="https://textdata.cn/blog/2025-03-06-chinese-fresh-graduates-recruitment-dataset/">数据集 | 应届生招聘数据集(2014~2024.12)</a></li>
<li><a href="https://textdata.cn/blog/2024-06-26-hongkong-environmental-social-governance-dataset/">数据集 | 2012 年-2025 年港股 ESG 报告数据集</a></li>
<li><a href="https://textdata.cn/blog/2023-03-23-china-a-share-market-dataset-mda-from-01-to-21/">数据集 | 2001 年-2024 年 A 股上市公司年报&amp;管理层讨论与分析</a></li>
<li><a href="https://textdata.cn/blog/2023-08-11-china-a-market-corporate-social-responsibility-dataste/">数据集 | 2006 年-2023 年 A 股企业社会责任报告/环境报告书/可持续发展报告</a></li>
<li><a href="https://textdata.cn/blog/2024-04-18-china-a-listed-company-figure-characteristic-dataset/">数据集 | 上市公司董监高人员的个人特征/教育背景/任职情况</a></li>
<li><a href="https://textdata.cn/blog/2023-04-17-china-a-market-inquiry-letter-datasets/">数据集 | 2014 年-2023 年「问询函」</a></li>
<li><a href="https://textdata.cn/blog/2023-09-08-china-a-share-market-listed-company-earnings-communication-conference/">数据集 | 84w 条业绩说明会问答数据(2005-2023)</a></li>
<li><a href="https://textdata.cn/blog/2023-12-07-patent-application-dataset-of-listed-company-in-china-a-market/">数据集 | 上市公司 208 万条专利数据集 (1991-2022)</a></li>
</ul>
<br>
<ul>
<li><a href="https://textdata.cn/blog/2023-03-24-load-w2v-and-expand-your-concpet-dicitonary/">词向量 | 使用 MD&amp;A2001-2024 语料训练 Word2Vec 模型</a></li>
<li><a href="https://textdata.cn/blog/2025-03-11-layline-insider-trading-dataset/">数据集 | Layline 美股内幕交易数据集</a></li>
<li><a href="https://textdata.cn/blog/2024-07-19-csrwise-dataset/">数据集 | 聚焦美股企业社会责任 CSR Wire 网站新闻数据集(1999-2024)</a></li>
<li><a href="https://textdata.cn/blog/2024-01-03-listed-company-arbitration-dataset/">数据集 | 36330 条上市公司仲裁数据(2000-2021.9)</a></li>
<li><a href="https://textdata.cn/blog/2023-04-26-entrusted-loan-dataset/">数据集 | 07-21 年上市公司「委托贷款公告」</a></li>
<li><a href="https://textdata.cn/blog/2022-11-25-senior-manager-resume-dataset/">数据集 | 90w 条中国上市公司高管数据</a></li>
<li><a href="https://textdata.cn/blog/2022-12-10-1850w-poi-dataset/">数据集| 1850 万条世界地图 POI 兴趣点数据集</a></li>
</ul>
<p><br><br></p>
<h2 id="用户">用户</h2>
<ul>
<li><a href="https://textdata.cn/blog/2025-03-05-consumer-complaint-dataset/">数据集 | 1500w+ 消费者投诉数据集(2018 ~ 2024.8)</a></li>
</ul>
<br>
<ul>
<li><a href="https://textdata.cn/blog/2025-03-06-consumer-finance-complaints-dataset/">数据集 | 消费者金融投诉数据集(2011 ~ 2025.3)</a></li>
<li><a href="https://textdata.cn/blog/2024-04-10-kiva-crowdfunding/">数据集 | 众筹平台 kiva 借贷信息</a></li>
<li><a href="https://textdata.cn/blog/2023-11-22-1000w-github-developer-dataset/">数据集 | 1000 万 Github 用户数据</a></li>
<li><a href="https://textdata.cn/blog/2023-11-22-open-dataset-gharchive-org/">数据集 | 使用 GH Archive 获取 Github 社区用户数据</a></li>
<li><a href="https://textdata.cn/blog/2023-12-24-instagram-influencer-dataset/">数据集 | 3.3 万 Instagram Influencer 的 1018 万条推文数据</a></li>
<li><a href="https://textdata.cn/blog/yelpdataset_10g/">数据集 | YelpDaset 酒店管理类数据集</a></li>
<li><a href="https://textdata.cn/blog/2022-12-08-indiegogo-dataset/">数据集 | 200 万条 Indiegogo 众筹项目信息</a></li>
<li><a href="https://textdata.cn/blog/2022-12-04-kickstarters_dataset/">数据集 | 23w 条 Kickstarter 项目信息</a></li>
<li><a href="https://textdata.cn/blog/2023-05-10-100m-bilibili-user-info-dataset/">数据集 | B 站/哔哩哔哩 1 亿用户数据(脱敏)</a></li>
<li><a href="https://textdata.cn/blog/2023-03-06-zhihurec-dataset/">数据集 | 80w 知乎用户问答数据(脱敏)</a></li>
<li><a href="https://textdata.cn/blog/2023-03-06-bedtime-news-datasets/">数据集 | 马前卒工作室 睡前消息文稿汇总</a></li>
</ul>
<p><br><br></p>
<h2 id="其他">其他</h2>
<ul>
<li><a href="https://github.com/hiDaDeng/Chinese-Pretrained-Word-Embeddings"><strong>推荐 | cntext 训练出的免费公开词向量</strong></a></li>
<li><a href="https://textdata.cn/blog/2023-03-24-load-w2v-and-expand-your-concpet-dicitonary/">词向量 | 使用 MD&amp;A2001-2024 语料训练 Word2Vec 模型</a></li>
<li><a href="https://textdata.cn/blog/2023-12-28-train-word2vec-using-renmin-gov-leader-board-dataset/">词向量 | 使用人民网领导留言板语料训练 Word2Vec 模型</a></li>
<li><a href="https://textdata.cn/blog/2025-04-17-training-a-gloVe-model-using-china-judgements-corpus/">词向量 |   使用裁判文书语料训练 GloVe 词向量</a></li>
<li><a href="https://textdata.cn/blog/2023-11-10-training-word2vec-model-using-china-3751w-patent-application-dataset/">词向量 | 使用 1985 年-2025 年 5000w 专利申请摘要训练 Word2Vec 模型</a></li>
<li><a href="https://textdata.cn/blog/2023-11-20-word2vec-by-year-by-province/">使用 5000w 专利申请数据集按年份(按省份)训练词向量</a></li>
<li><a href="https://textdata.cn/blog/embeddings_resource_usage_method/">词向量 | 中文词向量资源汇总 &amp; 使用方法</a></li>
<li><a href="https://textdata.cn/blog/pretained_nlp_models/">词向量 | 汽车、金融等 9 大领域预训练词向量模型下载资源</a></li>
<li><a href="https://textdata.cn/blog/2023-03-08-edgar-w2v-and-corpus/">词向量 | 25 年数据的预训练词向量模型(EDGAR）</a></li>
<li><a href="https://textdata.cn/blog/2022-10-16-aligned-word-vectors/">词向量 | 多语言对齐词向量预训练模型</a></li>
<li><a href="https://textdata.cn/blog/2023-04-05-chinese-concreteness-dictionary-from-behavior-research-method/">词典 | 中文心理词典，含具体性、可成象性等指标</a></li>
<li><a href="https://textdata.cn/blog/2024-02-27-ancw-affective-norms-for-4030-chinese-words/">词典 | ANCW 4030 词的中文情感词典(效价、唤醒度、主导度、具体性)</a></li>
<li><a href="https://textdata.cn/blog/2023-03-20-nature-six-semantic-dimension-database/">词典 | Nature 通用中英文六维语义情感词典</a></li>
<li><a href="https://textdata.cn/blog/chinese_semantic_kb/">词典 | 中文语义常用词典(ChineseSemanticKB)</a></li>
</ul>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>资源 | 中文 GloVe&amp;Word2Vec 词向量模型列表</title>
      <link>https://textdata.cn/blog/2025-04-18-chinese-pretrained-word-embeddings/</link>
      <pubDate>Fri, 18 Apr 2025 00:00:00 +0000</pubDate>
      
      <guid>/blog/2025-04-18-chinese-pretrained-word-embeddings/</guid>
      <description>中文语料预训练模型列表， 使用 cntext2.x 训练出的预训练语言模型， 主要分 GloVe 和 Word2Vec 两种。</description>
      <content:encoded><![CDATA[<p>中文语料预训练模型列表， 使用 cntext2.x 训练出的预训练语言模型， 主要分 GloVe 和 Word2Vec 两种。</p>
<br>
<h2 id="一中文预训练模型">一、中文预训练模型</h2>
<p>使用 <a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/">cntext2.x</a> 训练得到的中文预训练模型资源，汇总如下</p>
<p>对中文语料进行了近义测试和类比测试， 其中斯皮尔曼秩系数(Spearman&rsquo;s Rank Coeficient) 取值[-1,1], 取值越大表示模型越符合人类的认知。</p>
<p>类比测试有首都国家（CapitalOfCountries）、省会省份（CityInProvince）、家人关系（FamilyRelationship）、社会科学(管理、经济、心理等 SocialScience) 的类别准确率测试。</p>
<br>
<table>
<thead>
<tr>
<th>数据集</th>
<th>词向量</th>
<th>网盘</th>
<th>斯皮尔曼秩系数</th>
<th>首都国家(%)</th>
<th>省会省份(%)</th>
<th>家人关系(%)</th>
<th>社会科学(%)</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://textdata.cn/blog/2023-12-17-gov-anual-report-dataset/">中国政府工作报告</a></td>
<td><strong><em>人民政府(国省市)工作报告-GloVe.200.15.bin</em></strong></td>
<td><a href="https://pan.baidu.com/s/1IdK8RU9L8mp6I2nhcoSmyA?pwd=ht2s">https://pan.baidu.com/s/1IdK8RU9L8mp6I2nhcoSmyA?pwd=ht2s</a></td>
<td>0.38</td>
<td>30.73</td>
<td>98.86</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<td><a href="https://textdata.cn/blog/2023-12-17-gov-anual-report-dataset/">中国政府工作报告</a></td>
<td><strong><em>人民政府(国省市)工作报告-Word2Vec.200.15.bin</em></strong></td>
<td><a href="https://pan.baidu.com/s/1GoTjMbUcYS4jN6w4GqlqBA?pwd=qb5b">https://pan.baidu.com/s/1GoTjMbUcYS4jN6w4GqlqBA?pwd=qb5b</a></td>
<td>0.35</td>
<td>30.06</td>
<td>96.00</td>
<td>0.00</td>
<td>16.67</td>
</tr>
<tr>
<td><a href="https://textdata.cn/blog/2023-05-07-china-law-judgment-documents-datasets/">中国裁判文书网</a></td>
<td><strong><em>裁判文书-GloVe.200.15.bin</em></strong></td>
<td><a href="https://pan.baidu.com/s/1a0Fisvnkl8UaQZrHP7olCQ?pwd=8w49">https://pan.baidu.com/s/1a0Fisvnkl8UaQZrHP7olCQ?pwd=8w49</a></td>
<td>0.37</td>
<td>7.69</td>
<td>98.86</td>
<td>75.53</td>
<td>25.00</td>
</tr>
<tr>
<td><a href="https://textdata.cn/blog/2023-12-22-renmin-gov-leader-comment-board/">留言板</a></td>
<td><strong><em>留言板-Word2Vec.200.15.bin</em></strong></td>
<td><a href="https://pan.baidu.com/s/1n7vwCOBnrye1CYrt_IBqZA?pwd=9m42">https://pan.baidu.com/s/1n7vwCOBnrye1CYrt_IBqZA?pwd=9m42</a></td>
<td>0.45</td>
<td>19.33</td>
<td>100</td>
<td>61.40</td>
<td>20%</td>
</tr>
<tr>
<td><a href="https://textdata.cn/blog/2023-03-23-china-a-share-market-dataset-mda-from-01-to-21/">A 股年报</a></td>
<td><strong><em>mda01-23-GloVe.200.15.bin</em></strong></td>
<td><a href="https://pan.baidu.com/s/1vXvbomHjOaFBeEz7GV0R6A?pwd=y6hd">https://pan.baidu.com/s/1vXvbomHjOaFBeEz7GV0R6A?pwd=y6hd</a></td>
<td>0.34</td>
<td>78.13</td>
<td>100</td>
<td>0</td>
<td>37.50</td>
</tr>
<tr>
<td><a href="https://textdata.cn/blog/2023-03-23-china-a-share-market-dataset-mda-from-01-to-21/">A 股年报</a></td>
<td><strong><em>mda01-23-Word2Vec.200.15.bin</em></strong></td>
<td><a href="https://pan.baidu.com/s/11V1RyqH_cKE9eju0Mm-1TQ?pwd=kcwx">https://pan.baidu.com/s/11V1RyqH_cKE9eju0Mm-1TQ?pwd=kcwx</a></td>
<td>0.41</td>
<td>27.27</td>
<td>97.14</td>
<td>10</td>
<td>44.44</td>
</tr>
<tr>
<td><a href="https://textdata.cn/blog/2024-01-21-hk-stock-market-anual-report/">港股年报</a></td>
<td><strong><em>英文港股年报-Word2Vec.200.15.bin</em></strong></td>
<td><a href="https://pan.baidu.com/s/1ISGAoZnA_1Ben6M2DCliOQ?pwd=nagx">https://pan.baidu.com/s/1ISGAoZnA_1Ben6M2DCliOQ?pwd=nagx</a></td>
<td>&mdash;</td>
<td>&mdash;</td>
<td>&mdash;</td>
<td>&mdash;</td>
<td>&mdash;</td>
</tr>
<tr>
<td><a href="https://textdata.cn/blog/2024-01-21-hk-stock-market-anual-report/">港股年报</a></td>
<td><strong><em>中文港股年报-Word2Vec.200.15.bin</em></strong></td>
<td>hhttps://pan.baidu.com/s/1smMcrPtIP8g635YABCodig?pwd=sjdj</td>
<td>0.35</td>
<td>25.20</td>
<td>79.43</td>
<td>18.59</td>
<td>25</td>
</tr>
<tr>
<td><a href="https://textdata.cn/blog/2023-12-14-daily-news-dataset/">人民日报</a></td>
<td><a href="https://textdata.cn/blog/2023-12-28-visualize-the-culture-change-using-people-daily-dataset/">年份 Word2Vec</a></td>
<td><a href="https://pan.baidu.com/s/1Ru_wxu9egsmhM7lATjSlgQ?pwd=bcea">https://pan.baidu.com/s/1Ru_wxu9egsmhM7lATjSlgQ?pwd=bcea</a></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td><a href="https://textdata.cn/blog/2023-12-14-daily-news-dataset/">人民日报</a></td>
<td><a href="https://textdata.cn/blog/2023-12-28-visualize-the-culture-change-using-people-daily-dataset/">对齐模型 Aligned_Word2Vec</a></td>
<td><a href="https://pan.baidu.com/s/1IVgP0MyQpez0hpoJyEyFdA?pwd=7qsu">https://pan.baidu.com/s/1IVgP0MyQpez0hpoJyEyFdA?pwd=7qsu</a></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td><a href="https://textdata.cn/blog/2023-04-13-3571w-patent-dataset-in-china-mainland/">专利申请</a></td>
<td><strong><em>专利摘要-Word2Vec.200.15.bin</em></strong></td>
<td><a href="https://pan.baidu.com/s/1FHI_J7wU9eQGRckD12QB5g?pwd=6rr2">https://pan.baidu.com/s/1FHI_J7wU9eQGRckD12QB5g?pwd=6rr2</a></td>
<td>0.46</td>
<td>3.78</td>
<td>25.14</td>
<td>33.33</td>
<td>37.50</td>
</tr>
<tr>
<td><a href="https://textdata.cn/blog/2023-11-20-word2vec-by-year-by-province/">专利申请</a></td>
<td><strong><em>province_w2vs 分省份训练词向量</em></strong></td>
<td><a href="https://pan.baidu.com/s/1eBFTIZcv2DWssLiaRnCqZQ?pwd=ikpu">https://pan.baidu.com/s/1eBFTIZcv2DWssLiaRnCqZQ?pwd=ikpu</a></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td><a href="https://textdata.cn/blog/2023-11-20-word2vec-by-year-by-province/">专利申请</a></td>
<td><strong><em>year_w2vs 分年份训练词向量</em></strong></td>
<td><a href="https://pan.baidu.com/s/1lrVkML92cVJdHQa1HQyAwA?pwd=4gqa">https://pan.baidu.com/s/1lrVkML92cVJdHQa1HQyAwA?pwd=4gqa</a></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>大众点评评论语料</td>
<td><strong><em>大众点评-评论-Word2Vec.200.15.bin</em></strong></td>
<td><a href="https://pan.baidu.com/s/15He728XGzoXDFYrUWDTaqQ?pwd=eg6x">https://pan.baidu.com/s/15He728XGzoXDFYrUWDTaqQ?pwd=eg6x</a></td>
<td>0.34</td>
<td>50.31</td>
<td>89.71</td>
<td>70.00</td>
<td>0.00</td>
</tr>
<tr>
<td>中文歌词</td>
<td><strong><em>中文歌词-Word2Vec.200.15.bin</em></strong></td>
<td><a href="https://pan.baidu.com/s/1h1g1mOACmpCwn5pz8jR3vQ?pwd=ub2z">https://pan.baidu.com/s/1h1g1mOACmpCwn5pz8jR3vQ?pwd=ub2z</a></td>
<td>0.06</td>
<td>0.00</td>
<td>0.00</td>
<td>0.9</td>
<td>0.00</td>
</tr>
<tr>
<td>英文歌词</td>
<td><strong><em>英文歌词-Word2Vec.200.15.bin</em></strong></td>
<td><a href="https://pan.baidu.com/s/1ycy-BTSa8zqW_xbIoshy6Q?pwd=hu1v">https://pan.baidu.com/s/1ycy-BTSa8zqW_xbIoshy6Q?pwd=hu1v</a></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td><a href="https://textdata.cn/blog/2025-03-05-consumer-complaint-dataset/">黑猫消费者投诉</a></td>
<td><strong><em>消费者黑猫投诉-Word2Vec.200.15.bin</em></strong></td>
<td><a href="https://pan.baidu.com/s/1FOI2BIVRojOswdKfqaNbsw?pwd=catc">https://pan.baidu.com/s/1FOI2BIVRojOswdKfqaNbsw?pwd=catc</a></td>
<td>0.32</td>
<td>16.18</td>
<td>68</td>
<td>28.57</td>
<td>0.00</td>
</tr>
<tr>
<td><a href="https://textdata.cn/blog/2024-04-16-douban-movie-1000w-ratings-comments-dataset">豆瓣影评</a></td>
<td><strong><em>douban-movie-1000w-Word2Vec.200.15.bin</em></strong></td>
<td><a href="https://pan.baidu.com/s/1uq6Ti7HbEWyT4CgktKrMng?pwd=63jg">https://pan.baidu.com/s/1uq6Ti7HbEWyT4CgktKrMng?pwd=63jg</a></td>
<td>0.43</td>
<td>39.02</td>
<td>28.57</td>
<td>92.65</td>
<td>25.00</td>
</tr>
<tr>
<td><a href="https://textdata.cn/blog/2023-11-12-using-100m-bilibili-user-sign-data-to-training-word2vec">B 站</a></td>
<td><strong><em>B 站签名-Word2Vec.200.15.bin</em></strong></td>
<td><a href="https://pan.baidu.com/s/1OtBU9BzitcNxkmPzhzH6FQ?pwd=m3iv">https://pan.baidu.com/s/1OtBU9BzitcNxkmPzhzH6FQ?pwd=m3iv</a></td>
<td>0.34</td>
<td>25.56</td>
<td>33.71</td>
<td>44.17</td>
<td>0.00</td>
</tr>
<tr>
<td><a href="https://github.com/Viscount/IUI-Paper">B 站弹幕</a></td>
<td><strong><em>B 站弹幕-Word2Vec.200.15.bin</em></strong></td>
<td><a href="https://pan.baidu.com/s/1LNDLed5uP3KnUMmrKf_uhg?pwd=x4t8">https://pan.baidu.com/s/1LNDLed5uP3KnUMmrKf_uhg?pwd=x4t8</a></td>
<td>0.42</td>
<td>11.67</td>
<td>65.81</td>
<td>44.17</td>
<td>25.00</td>
</tr>
</tbody>
</table>
<p><br><br></p>
<h2 id="二cntext2x">二、cntext2.x</h2>
<p>cntext2.x 是中英文文本分析库，内置有多重词典和常用函数。 常见的文本分析代码行数在数十行，而 cntext2.x 力求将代码量控制在 2~5 行。</p>
<h3 id="21-训练模型">2.1 训练模型</h3>
<p>训练模型步骤:</p>
<ol>
<li>构建语料</li>
<li>训练模型</li>
</ol>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="c1"># 大邓Mac 96G内存， 12核使用的代码。</span>
<span class="n">w2v</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">Word2Vec</span><span class="p">(</span><span class="n">corpus_file</span><span class="o">=</span><span class="s1">&#39;留言板.txt&#39;</span><span class="p">,</span>
                  <span class="n">vector_size</span><span class="o">=</span><span class="mi">200</span><span class="p">,</span>
                  <span class="n">window_size</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span>
                  <span class="n">lang</span><span class="o">=</span><span class="s1">&#39;chinese&#39;</span><span class="p">,</span>
                  <span class="n">chunksize</span><span class="o">=</span><span class="mi">100000</span><span class="p">,</span>
                  <span class="n">min_count</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">Mac(Linux) System, Enable Parallel Processing
Cache output/renmin_board_cache.txt Not Found or Empty, Preprocessing Corpus
Reading Preprocessed Corpus from output/renmin_board_cache.txt
Start Training Word2Vec
Word2Vec Training Cost 2692 s.
Output Saved To: output/留言板-Word2Vec.200.15.bin
</code></pre></div><br>
<p>cntext2.x 训练模型的教程可参考</p>
<ul>
<li><a href="https://textdata.cn/blog/2023-11-12-using-100m-bilibili-user-sign-data-to-training-word2vec/">使用 1 亿 B 站用户签名训练 word2vec 词向量</a></li>
<li><a href="https://textdata.cn/blog/2023-11-10-training-word2vec-model-using-china-3751w-patent-application-dataset/">使用 1985 年-2025 年专利申请摘要训练 Word2Vec 模型</a></li>
<li><a href="https://textdata.cn/blog/2023-03-24-load-w2v-and-expand-your-concpet-dicitonary/">使用 MD&amp;A2001-2023 语料训练 Word2Vec/GloVe 模型</a></li>
<li><a href="https://textdata.cn/blog/2025-04-17-training-a-glove-model-using-china-judgements-corpus/">使用裁判文书语料训练 GloVe 词向量</a></li>
<li><a href="https://textdata.cn/blog/2023-11-20-word2vec-by-year-by-province/">使用 5000w 专利申请数据集按年份(按省份)训练词向量</a></li>
<li><a href="https://textdata.cn/blog/2023-12-28-train-word2vec-using-renmin-gov-leader-board-dataset/">使用人民网领导留言板语料训练 Word2Vec 模型</a></li>
</ul>
<br>
<h3 id="22-评估模型">2.2 评估模型</h3>
<p>使用近义法和类比法， 判断模型的表现。详情可查看<a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/">文档</a></p>
<br>
<p><strong>近义测试</strong></p>
<p>cntext2.x 内置 537 条近义实验数据， 可直接使用。</p>
<p><img loading="lazy" src="img/01-similar.png" alt=""  />
</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">ct</span><span class="o">.</span><span class="n">evaluate_similarity</span><span class="p">(</span><span class="n">w2v</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">近义测试: similarity.txt
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/cntext/model/evaluate_data/similarity.txt

评估结果：
+----------+------------+----------------------------+
| 发现词语 | 未发现词语 | Spearman&#39;s Rank Coeficient |
+----------+------------+----------------------------+
|   426    |    111     |            0.45            |
+----------+------------+----------------------------+
</code></pre></div><p>Spearman’s Rank Coeficient 系数取值[-1, 1], 取值越大， 说明模型表现越好。</p>
<br>
<p><strong>类比测试</strong></p>
<ul>
<li>雅典之于希腊，似如巴格达之于伊拉克。</li>
<li>哈尔滨之于黑龙江，似如长沙之于湖南。</li>
<li>国王之于王后，似如男人之于女人。</li>
</ul>
<p><img loading="lazy" src="img/02-analogy-woman.png" alt=""  />
</p>
<p>cntext2.x 内置 1194 条类比， 格式如下</p>
<p><img loading="lazy" src="img/03-analogy.png" alt=""  />
</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">ct</span><span class="o">.</span><span class="n">evaluate_analogy</span><span class="p">(</span><span class="n">wv</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">类比测试: analogy.txt
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/cntext/model/evaluate_data/analogy.txt
Processing Analogy Test: 100%|██████████████| 1198/1198 [00:11&lt;00:00, 99.91it/s]

评估结果：
+--------------------+----------+------------+------------+----------+
|      Category      | 发现词语 | 未发现词语 | 准确率 (%) | 平均排名 |
+--------------------+----------+------------+------------+----------+
| CapitalOfCountries |   238    |    439     |   19.33    |   2.74   |
|   CityInProvince   |   175    |     0      |   100.00   |   1.01   |
| FamilyRelationship |   272    |     0      |   61.40    |   1.96   |
|   SocialScience    |    10    |     60     |   20.00    |   1.50   |
+--------------------+----------+------------+------------+----------+
</code></pre></div><ul>
<li>CapitalOfCountries 留言板语料在此项表现较差， 应该是语料中常见国家首度的提及较少。</li>
<li>CityInProvince 留言板语料在此项表现如此优异，应该是语料中省份、省会地域词经常出现。</li>
<li>FamilyRelationship 留言板中应该少不了家长里短， 所以此项准确率还可以。 以<a href="https://textdata.cn/blog/2023-03-24-load-w2v-and-expand-your-concpet-dicitonary/">年报 MD&amp;A</a>为例，此处准确率只有 10%, 而<a href="https://textdata.cn/blog/2024-04-16-douban-movie-1000w-ratings-comments-dataset/">豆瓣影评</a>该处准确率高达 92.65%。</li>
<li>SocialScience 留言板语料在此项表现一般， 应该是语料中常见的社会科学词语提及较少。</li>
</ul>
<p>整体而言，语料训练的效果很不错，抓住了数据场景的独特性语义。</p>
<p><br><br></p>
<h2 id="三模型使用">三、模型使用</h2>
<h3 id="31-读取模型">3.1 读取模型</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="n">ct</span>

<span class="n">w2v</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">load_w2v</span><span class="p">(</span><span class="s1">&#39;output/留言板-Word2Vec.200.15.bin&#39;</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;维度数:&#39;</span><span class="p">,</span> <span class="n">w2v</span><span class="o">.</span><span class="n">vector_size</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;词汇量: &#39;</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">w2v</span><span class="p">))</span>
<span class="n">w2v</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">Loading output/留言板-Word2Vec.200.15.bin...
维度数: 200
词汇量:  1050245
&lt;gensim.models.keyedvectors.KeyedVectors at 0x328d737a0&gt;
</code></pre></div><br>
<h3 id="32-keyedvectors-的操作方法或属性">3.2 KeyedVectors 的操作方法(或属性)</h3>
<table>
<thead>
<tr>
<th>方法</th>
<th>描述</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong><em>KeyedVectors.index_to_key</em></strong></td>
<td>获取词汇表中的所有单词。</td>
</tr>
<tr>
<td><strong><em>KeyedVectors.key_to_index</em></strong></td>
<td>获取单词到索引的映射。</td>
</tr>
<tr>
<td><strong><em>KeyedVectors.vector_size</em></strong></td>
<td>获取 GloVe 模型中任意词向量的维度。</td>
</tr>
<tr>
<td><strong><em>KeyedVectors.get_vector(word)</em></strong></td>
<td>获取给定单词的词向量。</td>
</tr>
<tr>
<td><strong><em>KeyedVectors.similar_by_word(word, topn=10)</em></strong></td>
<td>获取某词语最相似的 10 个近义词。</td>
</tr>
<tr>
<td><strong><em>KeyedVectors.similar_by_vector(vector, topn=10)</em></strong></td>
<td>获取词向量最相似的 10 个近义词。</td>
</tr>
<tr>
<td>&hellip;</td>
<td>&hellip;</td>
</tr>
</tbody>
</table>
<br>
<h3 id="33-查看词表">3.3 查看词表</h3>
<p>因为词表有 <strong><em>1050245</em></strong> 个词， 为了方便，这里只显示前 20 个词</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"># 词表带顺序的
list(w2v.index_to_key)[:20]
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[&#39;问题&#39;,
 &#39;进行&#39;,
 &#39;您好&#39;,
 &#39;工作&#39;,
 &#39;小区&#39;,
 &#39;反映&#39;,
 &#39;领导&#39;,
 &#39;情况&#39;,
 &#39;相关&#39;,
 &#39;留言&#39;,
 &#39;没有&#39;,
 &#39;感谢您&#39;,
 &#39;网友&#39;,
 &#39;业主&#39;,
 &#39;办理&#39;,
 &#39;公司&#39;,
 &#39;建设&#39;,
 &#39;回复&#39;,
 &#39;支持&#39;,
 &#39;部门&#39;]
</code></pre></div><br>
<p>查看词表映射</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">w2v.key_to_index
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">{&#39;问题&#39;: 0,
 &#39;进行&#39;: 1,
 &#39;您好&#39;: 2,
 &#39;工作&#39;: 3,
 &#39;小区&#39;: 4,
 &#39;反映&#39;: 5,
 &#39;领导&#39;: 6,
 ...
  &#39;连续&#39;: 995,
 &#39;稳定&#39;: 996,
 &#39;市住建局&#39;: 997,
 &#39;降低&#39;: 998,
 &#39;会同&#39;: 999,
 ...}
</code></pre></div><br>
<h3 id="34-获取某词的向量">3.4 获取某词的向量</h3>
<p>查找某词对应的词向量</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1"># w2v[&#39;问题&#39;]</span>
<span class="n">w2v</span><span class="o">.</span><span class="n">get_vector</span><span class="p">(</span><span class="s1">&#39;问题&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">array([-6.2813835 ,  1.5916584 , -0.48086444, -2.6446412 , 10.031776  ,
       -0.11915778, -5.039283  , -2.1107564 ,  1.1351422 , -2.881387  ,
        4.2890835 , -1.1337206 ,  3.7850847 , -3.640467  , -0.96282107,
        ...
        ...
        1.1314462 , -2.5386178 , -2.3993561 , -2.0407596 ,  0.95457   ,
        3.03732   , -2.033116  , -0.20390491,  3.5368073 ,  6.5452943 ,
        2.1186016 ,  0.79572505,  2.5855987 ,  0.88565165, -1.812104  ],
      dtype=float32)
</code></pre></div><p>受限于篇幅，这里显示词向量的部分数值。</p>
<br>
<p>需要注意，如果查询的词不存在于模型词表，则会出现报错。例如</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">word = &#39;这是一个不存在的词&#39;
w2v.get_vector(word)
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[130], line 2
      1 word = &#39;这是一个不存在的词&#39;
----&gt; 2 w2v.wv.get_vector(word)

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/gensim/models/keyedvectors.py:446, in KeyedVectors.get_vector(self, key, norm)
    422 def get_vector(self, key, norm=False):
    423     &#34;&#34;&#34;Get the key&#39;s vector, as a 1D numpy array.
    424
    425     Parameters
   (...)
    444
    445     &#34;&#34;&#34;
--&gt; 446     index = self.get_index(key)
    447     if norm:
    448         self.fill_norms()

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/gensim/models/keyedvectors.py:420, in KeyedVectors.get_index(self, key, default)
    418     return default
    419 else:
--&gt; 420     raise KeyError(f&#34;Key &#39;{key}&#39; not present&#34;)

KeyError: &#34;Key &#39;这是一个不存在的词&#39; not present&#34;

</code></pre></div><br>
<h3 id="35-近义词">3.5 近义词</h3>
<p>根据词语查寻近义词，返回最相似的 10 个词</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">w2v</span><span class="o">.</span><span class="n">similar_by_word</span><span class="p">(</span><span class="s1">&#39;问题&#39;</span><span class="p">,</span> <span class="n">topn</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[(&#39;情况&#39;, 0.6178732514381409),
 (&#39;现象&#39;, 0.5385990142822266),
 (&#39;此类情况&#39;, 0.418301522731781),
 (&#39;留言&#39;, 0.4179410934448242),
 (&#39;一事&#39;, 0.40703579783439636),
 (&#39;事项&#39;, 0.39551448822021484),
 (&#39;事情&#39;, 0.3860214948654175),
 (&#39;情形&#39;, 0.38478103280067444),
 (&#39;事件&#39;, 0.36725184321403503),
 (&#39;现像&#39;, 0.3665226995944977)]
</code></pre></div><br>
<p>根据语义向量查寻近义词，返回最相似的 10 个词</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">question_vector</span> <span class="o">=</span> <span class="n">w2v</span><span class="o">.</span><span class="n">get_vector</span><span class="p">(</span><span class="s1">&#39;问题&#39;</span><span class="p">)</span>
<span class="n">w2v</span><span class="o">.</span><span class="n">similar_by_word</span><span class="p">(</span><span class="n">question_vector</span><span class="p">,</span> <span class="n">topn</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[(&#39;问题&#39;, 1.0),
 (&#39;情况&#39;, 0.6178732514381409),
 (&#39;现象&#39;, 0.5385990142822266),
 (&#39;此类情况&#39;, 0.4183014929294586),
 (&#39;留言&#39;, 0.4179410934448242),
 (&#39;一事&#39;, 0.40703579783439636),
 (&#39;事项&#39;, 0.39551448822021484),
 (&#39;事情&#39;, 0.3860214948654175),
 (&#39;情形&#39;, 0.38478103280067444),
 (&#39;事件&#39;, 0.36725184321403503)]
</code></pre></div><br>
<h3 id="36-计算多个词的中心向量">3.6 计算多个词的中心向量</h3>
<p>我们可以计算「经济」、「建设」、「发展」的中心向量 eco_vector。 并试图寻找中心向量 eco_vector 的最相似的 10 个词。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">eco_vector</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">semantic_centroid</span><span class="p">(</span><span class="n">wv</span><span class="o">=</span><span class="n">w2v</span><span class="p">,</span>
                                  <span class="n">words</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;经济&#39;</span><span class="p">,</span> <span class="s1">&#39;建设&#39;</span><span class="p">,</span> <span class="s1">&#39;发展&#39;</span><span class="p">])</span>


<span class="c1"># 寻找 eco_vector 语义最相似的10个词</span>
<span class="n">w2v</span><span class="o">.</span><span class="n">similar_by_vector</span><span class="p">(</span><span class="n">eco_vector</span><span class="p">,</span> <span class="n">topn</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[(&#39;发展&#39;, 0.8317984938621521),
 (&#39;建设&#39;, 0.7508440613746643),
 (&#39;经济&#39;, 0.6406075954437256),
 (&#39;经济社会发展&#39;, 0.6385446786880493),
 (&#39;发展壮大&#39;, 0.6317417621612549),
 (&#39;化发展&#39;, 0.5961641073226929),
 (&#39;大力发展&#39;, 0.585274338722229),
 (&#39;经济腾飞&#39;, 0.5823679566383362),
 (&#39;产业&#39;, 0.5820372700691223),
 (&#39;高质量发展&#39;, 0.5803337097167969)]
</code></pre></div><p>语义捕捉的很准。</p>
<br>
<h3 id="37-概念轴">3.7 概念轴</h3>
<p>男性概念向量由多个男性词的向量加总求均值得到，女性概念向量算法类似。当性质或方向明显相反的两个概念向量相减， 得到的新的向量，我们可以称之为**<em>概念轴向量 Concept Axis</em>**。</p>
<p>将几个城市词的词向量在[冷热概念轴向量]进行投影，得到的数值越大，表示越接近于 c_words2，越寒冷。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="c1"># 数值越大，表示越接近于c_words2，越寒冷。</span>
<span class="n">ct</span><span class="o">.</span><span class="n">sematic_projection</span><span class="p">(</span><span class="n">wv</span><span class="o">=</span><span class="n">w2v</span><span class="p">,</span>
                     <span class="n">words</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;杭州&#39;</span><span class="p">,</span> <span class="s1">&#39;哈尔滨&#39;</span><span class="p">,</span> <span class="s1">&#39;广州&#39;</span><span class="p">,</span> <span class="s1">&#39;潍坊&#39;</span><span class="p">],</span>
                     <span class="n">poswords</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;寒冷&#39;</span><span class="p">,</span> <span class="s1">&#39;冰雪&#39;</span><span class="p">],</span>
                     <span class="n">negwords</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;炎热&#39;</span><span class="p">,</span> <span class="s1">&#39;酷暑&#39;</span><span class="p">],</span>
                     <span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[(&#39;杭州&#39;, -2.52), (&#39;广州&#39;, -2.06), (&#39;潍坊&#39;, 2.18), (&#39;哈尔滨&#39;, 2.78)]
</code></pre></div><br>
<p>投影体现出城市的冷热， 体现了语言模型中蕴含着人类的认知(文化、偏见、记忆)。 类似的概念轴，</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">- 尺寸(大, 小)
- 湿度(干燥,潮湿)
- 财富(富裕, 贫穷)
- 性别(男, 女)
- 等
</code></pre></div><p>其实任意概念的向量也可看做概念轴，即该概念向量与 0 向量相减。只不过两组性质方向相反的方式得到的概念轴， 在语义上更稳定。</p>
<p><br><br></p>
<h2 id="相关资料">相关资料</h2>
<ul>
<li><a href="https://textdata.cn/blog/management_python_course/">视频课 | Python 实证指标构建与文本分析</a></li>
<li><a href="https://textdata.cn/blog/the_text_analysis_list_about_ms/">LIST | 社科(经管)数据挖掘文献资料汇总</a></li>
<li><a href="https://textdata.cn/blog/datasets_available_for_management_science/">LIST | 可供社科(经管)领域使用的数据集汇总</a></li>
<li><a href="https://textdata.cn/blog/2025-02-14-using-online-large-model-api-to-transform-text-data-into-structured-data/">教程 | 使用 Ollama 与大模型将文本数据转化为结构化数据</a></li>
<li><a href="https://textdata.cn/blog/">https://textdata.cn/</a></li>
</ul>
<p><br><br></p>
<h2 id="cntext使用声明">cntext使用声明</h2>
<p>如在研究或项目中使用 cntext ，请在文中介绍并附引用声明。引用格式可参考 <a href="https://textdata.cn/blog/2025-09-09-transform-text-data-into-structured-data-with-lmstudio-and-cntext/#%E4%BD%BF%E7%94%A8%E5%A3%B0%E6%98%8E">cntext 推荐引用格式</a></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>教程 | 使用大模型将文本数据转化为结构化数据(阿里云百炼)</title>
      <link>https://textdata.cn/blog/2025-09-12-text-analysis-with-qwen-and-cntext/</link>
      <pubDate>Fri, 12 Sep 2025 00:00:00 +0000</pubDate>
      
      <guid>/blog/2025-09-12-text-analysis-with-qwen-and-cntext/</guid>
      <description>实验数据为外卖评论， 今天咱们做个有难度的文本分析任务，从不同维度(味道、速度、服务)对外卖评论进行打分(-1.0~1.0)文本分析（也称为文本挖掘或自然语言处理，NLP）是指使用计算机算法和技术从大量文本数据中提取有价值信息的过程。文本分析的目标是从非结构化的文本数据中识别模式、提取关键信息、理解语义，并将其转化为结构化数据以便进一步分析和应用。</description>
      <content:encoded><![CDATA[<h2 id="一任务">一、任务</h2>
<p>之前分享了本地部署的文本编码教程</p>
<ul>
<li><a href="https://textdata.cn/blog/2025-02-14-using-online-large-model-api-to-transform-text-data-into-structured-data/">教程 | 使用大模型将文本编码为结构化数据(本地Ollma篇)</a></li>
<li><a href="https://textdata.cn/blog/2025-09-09-transform-text-data-into-structured-data-with-lmstudio-and-cntext/">教程 | 使用大模型将文本编码为结构化数据(本地LM-Studio篇)</a></li>
</ul>
<p>经过实验，发现本地编码速度实在感人(3s一条)， cntext2x未优化，只能同步依次对每条文本进行编码， 分析 1000 条至少需要 3000 秒，速度实在太慢。</p>
<p>经过这几天打磨，ct.llm内置异步处理机制，调用云服务器(阿里云百炼模型平台为例)， 1000 条耗时 20 秒。 今天将实验代码分享给大家。</p>
<p><br><br></p>
<h2 id="二配置环境">二、配置环境</h2>
<h3 id="21-安装cntext">2.1 安装cntext</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">cd desktop
pip install cntext --upgrade
</code></pre></div><blockquote>
<p>cntext2.1.7版本llm支持异步处理多条文本。</p>
</blockquote>
<br>
<h3 id="22-阿里云">2.2 阿里云</h3>
<h4 id="221-平台介绍">2.2.1 平台介绍</h4>
<p>使用阿里云百炼平台，只需几行Python代码即可轻松调用通义千问Qwen大模型。它提供简洁API接口，支持快速集成到应用中，实现高效文本生成与对话能力。无需复杂配置，适合快速原型开发与轻量级AI应用部署。</p>
<p>初次使用</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>
<span class="kn">from</span> <span class="nn">openai</span> <span class="kn">import</span> <span class="n">OpenAI</span>
<span class="n">api_key</span><span class="o">=</span> <span class="s1">&#39;你自己的api_key&#39;</span>

<span class="n">client</span> <span class="o">=</span> <span class="n">OpenAI</span><span class="p">(</span>
    <span class="c1"># 如何获取API Key：https://help.aliyun.com/zh/model-studio/developer-reference/get-api-key</span>
    <span class="n">api_key</span><span class="o">=</span><span class="n">api_key</span><span class="p">,</span> 
    <span class="n">base_url</span><span class="o">=</span><span class="s2">&#34;https://dashscope.aliyuncs.com/compatible-mode/v1&#34;</span><span class="p">,</span>
<span class="p">)</span>


<span class="n">completion</span> <span class="o">=</span> <span class="n">client</span><span class="o">.</span><span class="n">chat</span><span class="o">.</span><span class="n">completions</span><span class="o">.</span><span class="n">create</span><span class="p">(</span>
    <span class="c1"># 模型列表：https://help.aliyun.com/zh/model-studio/getting-started/models</span>
    <span class="n">model</span><span class="o">=</span><span class="s2">&#34;qwen-plus&#34;</span><span class="p">,</span>  
    <span class="n">messages</span><span class="o">=</span><span class="p">[</span>
        <span class="p">{</span><span class="s1">&#39;role&#39;</span><span class="p">:</span> <span class="s1">&#39;system&#39;</span><span class="p">,</span> <span class="s1">&#39;content&#39;</span><span class="p">:</span> <span class="s1">&#39;你是文本编码的专家, 现在需要你将收到的文本编码情感值，情感值范围为-1.0~1.0，-1.0表示负面情感，1.0表示正面情感，0.0表示中性情感。结果返回JSON格式，格式为{&#34;score&#34;: 0.0}&#39;</span><span class="p">},</span>
        <span class="p">{</span><span class="s1">&#39;role&#39;</span><span class="p">:</span> <span class="s1">&#39;user&#39;</span><span class="p">,</span> <span class="s1">&#39;content&#39;</span><span class="p">:</span> <span class="s1">&#39;今天很开心啊， 我很喜欢这个国家&#39;</span><span class="p">}</span>
    <span class="p">]</span>
<span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">completion</span><span class="o">.</span><span class="n">choices</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">message</span><span class="o">.</span><span class="n">content</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">{&#34;score&#34;: 0.8}
CPU times: user 31.8 ms, sys: 5.82 ms, total: 37.6 ms
Wall time: 554 ms
</code></pre></div><p>使用qwen-plus 单次编码的时间是 554毫秒。我整理了通义千问目前的模型定位、速度与价格。</p>
<table>
<thead>
<tr>
<th style="text-align:left">模型名</th>
<th style="text-align:left">定位</th>
<th style="text-align:left">输入成本(每千token)</th>
<th style="text-align:left">输出成本(每千token)</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:left">qwen-max</td>
<td style="text-align:left">最强综合能力</td>
<td style="text-align:left">0.0024</td>
<td style="text-align:left">0.0096</td>
</tr>
<tr>
<td style="text-align:left">qwen-plus</td>
<td style="text-align:left">平衡性能与成本</td>
<td style="text-align:left">0.0008</td>
<td style="text-align:left">0.002</td>
</tr>
<tr>
<td style="text-align:left">qwen-turbo</td>
<td style="text-align:left">快速响应</td>
<td style="text-align:left">0.0003</td>
<td style="text-align:left">0.003</td>
</tr>
<tr>
<td style="text-align:left">qwen-flash</td>
<td style="text-align:left">极致速度与低成本</td>
<td style="text-align:left">0.00015</td>
<td style="text-align:left">0.0015</td>
</tr>
</tbody>
</table>
<br>
<h4 id="222-如何配置阿里云">2.2.2 如何配置阿里云</h4>
<p>配置起来应该不难，大致有充值、申请api-key、选择一个模型。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">- api-key https://bailian.console.aliyun.com/?tab=model#/api-key
- 账单详情 https://billing-cost.console.aliyun.com/finance/expense-report/expense-detail-by-instance
- 充值 https://billing-cost.console.aliyun.com/fortune/fund-management/recharge
- 模型列表 https://bailian.console.aliyun.com/?tab=model#/model-market
</code></pre></div><p><img loading="lazy" src="img/01-api-key.png" alt=""  />

<img loading="lazy" src="img/02-model-list.png" alt=""  />

<img loading="lazy" src="img/03-billings.png" alt=""  />
</p>
<br>
<br>
<h2 id="三cntext2x介绍">三、cntext2x介绍</h2>
<h3 id="31-内置提示词模板">3.1 内置提示词模板</h3>
<p>cntext2x 内置提示词模板不止支持sentiment，还有其他任务，如分类、实体识别等。具体如下</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">ct</span><span class="o">.</span><span class="n">llm</span><span class="o">.</span><span class="n">tasks_list</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[&#39;sentiment&#39;,
 &#39;emotion&#39;,
 &#39;classify&#39;,
 &#39;intent&#39;,
 &#39;keywords&#39;,
 &#39;entities&#39;,
 &#39;summarize&#39;,
 &#39;rewrite&#39;,
 &#39;quality&#39;,
 &#39;similarity&#39;]
</code></pre></div><br>
<p>查看模板内容</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">ct</span><span class="o">.</span><span class="n">llm</span><span class="o">.</span><span class="n">tasks_get</span><span class="p">(</span><span class="s1">&#39;sentiment&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">{
  &#34;prompt&#34;: &#34;分析评论的情感倾向：返回情感类别 label（pos 表示正面，neg 表示负面，neutral 表示中性）和情感分值 score（取值范围 -1~1，负数为负面）。结果返回JSON格式，格式为{&#39;label&#39;: &#39;pos&#39;, &#39;score&#39;: 0.5}&#34;,
  &#34;output_format&#34;: {&#34;label&#34;: &#34;str&#34;, &#34;score&#34;: &#34;float&#34;}
}
</code></pre></div><p>内置模板设计是通用型，不够聚焦具体场景， 各位可根据自己研究问题、数据场景， 设计适合自己的提示词。</p>
<br>
<h3 id="32-自定义提示词模板">3.2 自定义提示词模板</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>
<span class="o">%%</span><span class="n">time</span>
<span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="n">PROMPT</span> <span class="o">=</span> <span class="s1">&#39;&#39;&#39;
</span><span class="s1">请从以下三个维度对外卖评论进行分析，每个维度打分范围为 0.0 ~ 1.0：
</span><span class="s1">- 口味（taste）：食物的味道、质量如何。
</span><span class="s1">- 速度（speed）：配送是否及时，出餐快不快。
</span><span class="s1">- 服务（service）：骑手态度、商家沟通、售后等服务体验。
</span><span class="s1">
</span><span class="s1">请返回 JSON 格式结果，形如：
</span><span class="s1">{&#34;taste&#34;: 0.8, &#34;speed&#34;: 0.5, &#34;service&#34;: 0.9}
</span><span class="s1">&#39;&#39;&#39;</span>

<span class="n">OUTPUT</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;taste&#39;</span><span class="p">:</span> <span class="s2">&#34;float&#34;</span><span class="p">,</span> <span class="s1">&#39;speed&#39;</span><span class="p">:</span> <span class="s2">&#34;float&#34;</span><span class="p">,</span> <span class="s1">&#39;service&#39;</span><span class="p">:</span> <span class="s2">&#34;float&#34;</span><span class="p">}</span>


<span class="n">score</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">llm</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="s2">&#34;服务很棒&#34;</span><span class="p">,</span>
               <span class="n">prompt</span><span class="o">=</span><span class="n">PROMPT</span> <span class="p">,</span>
               <span class="n">output_format</span><span class="o">=</span><span class="n">OUTPUT</span><span class="p">,</span>
               <span class="n">base_url</span><span class="o">=</span><span class="s2">&#34;https://dashscope.aliyuncs.com/compatible-mode/v1&#34;</span><span class="p">,</span> <span class="c1">#阿里云百炼base_url</span>
               <span class="n">api_key</span><span class="o">=</span><span class="s2">&#34;你自己的api_key&#34;</span><span class="p">,</span> <span class="c1">#更改自己的阿里云百炼的api_key</span>
               <span class="n">model_name</span> <span class="o">=</span> <span class="s1">&#39;qwen-plus&#39;</span><span class="p">)</span>

<span class="n">score</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">CPU times: user 41.1 ms, sys: 4.74 ms, total: 45.8 ms
Wall time: 1.26 s

{&#39;taste&#39;: 0.0, &#39;speed&#39;: 0.0, &#39;service&#39;: 1.0}
</code></pre></div><p>使用不同的模型，结果会有不同， 建议使用qwen-plus模型，该模型兼顾了性能与成本，速度也不慢。</p>
<br>
<h3 id="33-批量处理">3.3 批量处理</h3>
<p>ct.llm()支持处理单条文本，也支持异步批处理多条文本。在上一节已经展示了单条处理能力，接下来介绍如何批量处理多条文本。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="c1">#构造实验数据， 或者直接读取data.csv(含text字段)</span>

<span class="n">texts</span> <span class="o">=</span> <span class="p">[</span>
    <span class="s2">&#34;这家的红烧肉简直入口即化，米饭香喷喷的，就是外卖盒没盖紧，汤洒了一半，心痛！&#34;</span><span class="p">,</span>
    <span class="s2">&#34;从下单到敲门只用了18分钟，骑手小哥飞毛腿啊！不过面有点坨了，汤也不够热。&#34;</span><span class="p">,</span>
    <span class="s2">&#34;连续下雨天都点它家，每次送来的饭都是热的，还贴心地放了暖宝宝，服务太暖了，菜也好吃！&#34;</span><span class="p">,</span>
    <span class="s2">&#34;等了快一个小时，客服说骑手摔了，挺担心的，但最后饭送来时都凉透了，味道还行，就是体验差。&#34;</span><span class="p">,</span>
    <span class="s2">&#34;螺蛳粉味道正宗，酸笋够味，还多给了一份炸蛋！就是包装没封好，电梯里一股味儿，社死现场……&#34;</span><span class="p">,</span>
    <span class="s2">&#34;披萨送到时歪了，芝士都滑到一边，颜值崩塌！但味道确实香，骑手态度也好，下次希望改进包装。&#34;</span><span class="p">,</span>
    <span class="s2">&#34;深夜加班点的馄饨，没想到25分钟就到了，感动哭了！热汤下肚瞬间回血，五星好评送温暖。&#34;</span><span class="p">,</span>
    <span class="s2">&#34;咖喱鸡里的土豆没煮熟，硬的！反馈给商家，人家说‘下次注意’就完了？这服务也太敷衍了吧。&#34;</span><span class="p">,</span>
    <span class="s2">&#34;味道普普通通，不难吃也不惊艳，但配送准时，包装严实，骑手还发消息说放门口了，很省心。&#34;</span><span class="p">,</span>
    <span class="s2">&#34;生日点的蛋糕，商家写了祝福语还送了小蜡烛，仪式感拉满！就是配送稍慢，路上奶油有点化了。&#34;</span>
<span class="p">]</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="s1">&#39;text&#39;</span><span class="p">:</span> <span class="n">texts</span><span class="p">})</span>
<span class="c1">#df = pd.read_csv(&#39;data.csv&#39;)</span>
<span class="n">df</span>
</code></pre></div><p><img loading="lazy" src="img/04-df.png" alt=""  />
</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>
<span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="c1"># 定义提示词：要求模型从三个维度打分</span>
<span class="n">PROMPT</span> <span class="o">=</span> <span class="s1">&#39;&#39;&#39;
</span><span class="s1">请从以下三个维度对外卖评论进行分析，每个维度打分范围为 0.0 ~ 1.0：
</span><span class="s1">- 口味（taste）：食物的味道、质量如何。
</span><span class="s1">- 速度（speed）：配送是否及时，出餐快不快。
</span><span class="s1">- 服务（service）：骑手态度、商家沟通、售后等服务体验。
</span><span class="s1">
</span><span class="s1">请返回 JSON 格式结果，形如：
</span><span class="s1">{&#34;taste&#34;: 0.8, &#34;speed&#34;: 0.5, &#34;service&#34;: 0.9}
</span><span class="s1">&#39;&#39;&#39;</span>

<span class="c1"># 定义期望的输出格式</span>
<span class="n">OUTPUT</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;taste&#39;</span><span class="p">:</span> <span class="s2">&#34;float&#34;</span><span class="p">,</span> <span class="s1">&#39;speed&#39;</span><span class="p">:</span> <span class="s2">&#34;float&#34;</span><span class="p">,</span> <span class="s1">&#39;service&#39;</span><span class="p">:</span> <span class="s2">&#34;float&#34;</span><span class="p">}</span>

<span class="n">result_df</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">llm</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;text&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">tolist</span><span class="p">(),</span>   <span class="c1">#传入文本列表</span>
                   <span class="n">prompt</span><span class="o">=</span><span class="n">PROMPT</span> <span class="p">,</span>
                   <span class="n">output_format</span><span class="o">=</span><span class="n">OUTPUT</span><span class="p">,</span>
                   <span class="n">base_url</span><span class="o">=</span><span class="s2">&#34;https://dashscope.aliyuncs.com/compatible-mode/v1&#34;</span><span class="p">,</span> <span class="c1">#阿里云百炼base_url</span>
                   <span class="n">api_key</span><span class="o">=</span><span class="s2">&#34;你自己的api_key&#34;</span><span class="p">,</span> <span class="c1">#更改自己的阿里云百炼的api_key</span>
                   <span class="n">model_name</span> <span class="o">=</span> <span class="s1">&#39;qwen-plus&#39;</span><span class="p">,</span> 
                   <span class="c1">#rate_limit参数， 控制访问速度。注意整数和浮点数含义不同。 整数是一分钟访问次数， 浮点数是一秒钟访问次数。</span>
                   <span class="c1">#rate_limit=10.0,  #控制速度，一秒10次请求。</span>
                   <span class="n">return_df</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span> <span class="c1">#结果返回dataframe</span>

<span class="c1">#结果中加入原始数据text字段</span>
<span class="c1">#注意，一定要有 .tolist()</span>
<span class="n">result_df</span><span class="p">[</span><span class="s1">&#39;text&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;text&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">tolist</span><span class="p">()</span>

<span class="c1">#保存结果。</span>
<span class="n">result_df</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s1">&#39;result.csv&#39;</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
<span class="n">result_df</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">CPU times: user 95.1 ms, sys: 20.3 ms, total: 115 ms
Wall time: 2.69 s
</code></pre></div><p><img loading="lazy" src="img/05-df.png" alt=""  />
</p>
<br>
<h3 id="34-耗时对比">3.4 耗时对比</h3>
<p>10条的编码耗时2.69秒，速度还是很快的。那么编码1000条是多久？</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>
<span class="n">result_df</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">llm</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;text&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">tolist</span><span class="p">()</span><span class="o">*</span><span class="mi">100</span><span class="p">,</span>   <span class="c1">#传入1000条文本(重复100次的列表)</span>
                   <span class="n">prompt</span><span class="o">=</span><span class="n">PROMPT</span> <span class="p">,</span>
                   <span class="n">output_format</span><span class="o">=</span><span class="n">OUTPUT</span><span class="p">,</span>
                   <span class="n">base_url</span><span class="o">=</span><span class="s2">&#34;https://dashscope.aliyuncs.com/compatible-mode/v1&#34;</span><span class="p">,</span> <span class="c1">#阿里云百炼base_url</span>
                   <span class="n">api_key</span><span class="o">=</span><span class="s2">&#34;你自己的api_key&#34;</span><span class="p">,</span> <span class="c1">#更改自己的阿里云百炼的api_key</span>
                   <span class="n">model_name</span> <span class="o">=</span> <span class="s1">&#39;qwen-plus&#39;</span><span class="p">,</span> 
                   <span class="c1">#rate_limit参数， 控制访问速度。注意整数和浮点数含义不同。 整数是一分钟访问次数， 浮点数是一秒钟访问次数。</span>
                   <span class="c1">#rate_limit=10.0,  #控制速度，一秒10次请求。</span>
                   <span class="n">return_df</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span> <span class="c1">#结果返回dataframe</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">CPU times: user 6.01 s, sys: 1.18 s, total: 7.19 s
Wall time: 20.9 s
</code></pre></div><p>可以看到1000条编码所耗时间并不是10条的100倍， 之所以如此快，得益于ct.llm内部支持了异步处理， 可以同时处理多条文本， 提高编码效率。</p>
<p><br><br></p>
<h2 id="四实战代码模板">四、实战代码模板</h2>
<p>如果需要处理的数据量特别大， 处理技巧:</p>
<ol>
<li>先用少量数据测试， 确保所选择模型的速度、性能(编码质量)。</li>
<li>分批次编码、保存结果，避免断网、服务器异常，导致丢失数据。</li>
</ol>
<p>假设data.csv 含字段reviewid、rating、text。 分析结果csv也要含 reviewid、rating、text。以下是分批次处理、依次保存编码结果的代码。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>
<span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">import</span> <span class="nn">time</span>


<span class="c1">#读取提示工程</span>

<span class="n">PROMPT</span> <span class="o">=</span> <span class="s1">&#39;你设计的PROMPT&#39;</span>
<span class="n">OUTPUT_FORMAT</span> <span class="o">=</span> <span class="s1">&#39;你设计的output_format&#39;</span>
<span class="n">BASE_URL</span> <span class="o">=</span> <span class="s2">&#34;https://dashscope.aliyuncs.com/compatible-mode/v1&#34;</span> <span class="c1">#选择服务器，这里是阿里云百炼的base_url</span>
<span class="n">API_KEY</span> <span class="o">=</span> <span class="s1">&#39;你的api_key&#39;</span>  <span class="c1">#你的阿里云百炼的api_key</span>
<span class="n">MAX_RETRY</span> <span class="o">=</span> <span class="mi">3</span> <span class="c1">#重复次数(编码失败后的异常处理)</span>
<span class="n">TEMPERATURE</span> <span class="o">=</span> <span class="mi">0</span> <span class="c1">#数字越小，回答越靠谱(随机性小)。</span>
<span class="n">MODEL</span> <span class="o">=</span> <span class="s1">&#39;qwen-plus&#39;</span> <span class="c1">#选择兼顾质量、速度的模型</span>
<span class="n">start_time</span> <span class="o">=</span> <span class="n">time</span><span class="o">.</span><span class="n">time</span><span class="p">()</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;</span><span class="se">\n</span><span class="s2">🚀 开始处理模型: </span><span class="si">{</span><span class="n">model</span><span class="si">}</span><span class="s2">&#34;</span><span class="p">)</span>


<span class="c1">#每批次100条。使用chunksize后，可以不用设置rate_limit。</span>
<span class="n">chunk_dfs</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;data.csv&#39;</span><span class="p">,</span> <span class="n">chunksize</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span>
<span class="k">for</span> <span class="n">idx</span><span class="p">,</span> <span class="n">chunk_df</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">chunk_dfs</span><span class="p">):</span>
    <span class="n">batch_texts</span> <span class="o">=</span> <span class="n">chunk_df</span><span class="p">[</span><span class="s1">&#39;text&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">tolist</span><span class="p">()</span>
    <span class="n">batch_df</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">llm</span><span class="p">(</span><span class="n">text</span> <span class="o">=</span> <span class="n">batch_texts</span><span class="p">,</span> 
                      <span class="n">prompt</span> <span class="o">=</span> <span class="n">PROMPT</span><span class="p">,</span> 
                      <span class="n">output_format</span> <span class="o">=</span> <span class="n">OUTPUT_FORMAT</span><span class="p">,</span> 
                      <span class="n">base_url</span> <span class="o">=</span> <span class="n">BASE_URL</span><span class="p">,</span> 
                      <span class="n">api_key</span> <span class="o">=</span> <span class="n">API_KEY</span><span class="p">,</span> 
                      <span class="n">model_name</span> <span class="o">=</span> <span class="n">MODEL</span><span class="p">,</span> 
                      <span class="n">temperature</span> <span class="o">=</span> <span class="n">TEMPERATURE</span><span class="p">,</span> 
                      <span class="n">max_retries</span> <span class="o">=</span> <span class="n">MAX_RETRY</span><span class="p">,</span> 
                      <span class="n">return_df</span> <span class="o">=</span> <span class="kc">True</span><span class="p">,</span> 
                      <span class="n">verbose</span><span class="o">=</span> <span class="kc">True</span><span class="p">)</span>
    <span class="c1">#保存原始信息reviewid、rating</span>
    <span class="n">batch_df</span><span class="p">[</span><span class="s1">&#39;reviewid&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">chunk_df</span><span class="p">[</span><span class="s1">&#39;reviewid&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">tolist</span><span class="p">()</span>
    <span class="n">batch_df</span><span class="p">[</span><span class="s1">&#39;rating&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">chunk_df</span><span class="p">[</span><span class="s1">&#39;rating&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">tolist</span><span class="p">()</span>
    <span class="n">batch_df</span><span class="p">[</span><span class="s1">&#39;text&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">chunk_df</span><span class="p">[</span><span class="s1">&#39;text&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">tolist</span><span class="p">()</span>

    <span class="c1">#分批次存入新的csv</span>
    <span class="k">if</span> <span class="n">idx</span><span class="o">==</span><span class="mi">0</span><span class="p">:</span>
        <span class="n">header</span><span class="o">=</span><span class="kc">True</span>
        <span class="n">mode</span><span class="o">=</span><span class="s1">&#39;w&#39;</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="n">header</span><span class="o">=</span><span class="kc">False</span>
        <span class="n">mode</span> <span class="o">=</span> <span class="s1">&#39;a&#39;</span>
    <span class="n">batch_df</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;</span><span class="si">{</span><span class="n">MODEL</span><span class="si">}</span><span class="s1">-result.csv&#39;</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">mode</span><span class="o">=</span><span class="n">mode</span><span class="p">,</span> <span class="n">header</span><span class="o">=</span><span class="n">header</span><span class="p">)</span>
        
<span class="n">now_time</span> <span class="o">=</span> <span class="n">time</span><span class="o">.</span><span class="n">time</span><span class="p">()</span>
<span class="n">duration</span> <span class="o">=</span> <span class="nb">round</span><span class="p">((</span><span class="n">now_time</span><span class="o">-</span><span class="n">start_time</span><span class="p">)</span><span class="o">/</span><span class="mi">60</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;✅ </span><span class="si">{</span><span class="n">MODEL</span><span class="si">}</span><span class="s2"> 处理完成，耗时 </span><span class="si">{</span><span class="n">duration</span><span class="si">:</span><span class="s2">.2f</span><span class="si">}</span><span class="s2"> 分钟&#34;</span><span class="p">)</span>
</code></pre></div><br>
<p>上面的代码，我在 10000 条的在线评论数据中进行了实验，选择了qwen-flash/qwen-turbo/qwen-plus/qwen-max 4个模型。耗时统计</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">🚀 开始处理模型: qwen-flash
🚀 开始处理模型: qwen-flash
✅ qwen-flash 处理完成，耗时 13.72 分钟

🚀 开始处理模型: qwen-turbo
✅ qwen-turbo 处理完成，耗时 17.23 分钟

🚀 开始处理模型: qwen-plus
✅ qwen-plus 处理完成，耗时 34.29 分钟

🚀 开始处理模型: qwen-max
✅ qwen-plus 处理完成，耗时 48.79 分钟
</code></pre></div><p>标注质量方面， max最好， plus其次，flash、turbo的质量都一般，最终均衡考虑下推荐qwen-plus。</p>
<table>
<thead>
<tr>
<th style="text-align:left">模型名</th>
<th style="text-align:left">定位</th>
<th style="text-align:left">输入成本(每千token)</th>
<th style="text-align:left">输出成本(每千token)</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:left">qwen-max</td>
<td style="text-align:left">最强综合能力</td>
<td style="text-align:left">0.0024</td>
<td style="text-align:left">0.0096</td>
</tr>
<tr>
<td style="text-align:left"><strong>qwen-plus</strong></td>
<td style="text-align:left">平衡性能与成本</td>
<td style="text-align:left">0.0008</td>
<td style="text-align:left">0.002</td>
</tr>
<tr>
<td style="text-align:left">qwen-turbo</td>
<td style="text-align:left">快速响应</td>
<td style="text-align:left">0.0003</td>
<td style="text-align:left">0.003</td>
</tr>
<tr>
<td style="text-align:left">qwen-flash</td>
<td style="text-align:left">极致速度与低成本</td>
<td style="text-align:left">0.00015</td>
<td style="text-align:left">0.0015</td>
</tr>
</tbody>
</table>
<p><br><br></p>
<h2 id="cntext使用声明">cntext使用声明</h2>
<p>如在研究或项目中使用 cntext ，请在文中介绍并附引用声明。引用格式可参考 <a href="https://textdata.cn/blog/2025-09-09-transform-text-data-into-structured-data-with-lmstudio-and-cntext/#%E4%BD%BF%E7%94%A8%E5%A3%B0%E6%98%8E">cntext 推荐引用格式</a></p>
<p><br><br></p>
<h2 id="相关内容">相关内容</h2>
<ul>
<li><a href="https://textdata.cn/blog/2025-02-17-gpt-is-an-effective-tool-for-multilingual-psychological-text-analysis/"><strong>PNAS | GPT 是多语言心理文本分析的有效工具</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-06-14-how-to-download-large-language-model-with-ollama/"><strong>教程 | 如何使用 Ollama 下载 &amp; 使用本地大语言模型</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-08-06-using-the-ollama-local-large-model-to-predict-the-sentiment-category-of-online-comments/"><strong>实验 | 使用本地大模型预测在线评论情感类别和分值</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-08-07-structured-outputs-with-ollama/"><strong>实验 | 如何使 Ollama 结构化输出 JSON 样式的结果</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/"><strong>推荐 | 文本分析库 cntext2.x 使用手册</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-06-14-using-large-language-model-to-extract-structure-data-from-raw-text/"><strong>实验 | 使用本地大模型从文本中提取结构化信息</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-07-10-using-large-language-model-to-build-diy-dictionary/">实验 | 使用 Ollama 本地大模型 DIY 制作单词书教案 PDF</a></li>
<li><a href="https://textdata.cn/blog/2024-08-05-create-a-blog-writer-multi-agent-system-using-crewai-and-ollama/">实验 | 使用 Crewai 和 Ollama 构建智能体(AI Agent)帮我撰写博客文章</a></li>
</ul>
<p><br><br></p>
<h2 id="精选内容">精选内容</h2>
<ul>
<li>
<p><a href="https://textdata.cn/blog/datasets_available_for_management_science/">LIST | 可供社科(经管)领域使用的数据集汇总</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/the_text_analysis_list_about_ms/">LIST | 社科(经管)数据挖掘文献资料汇总</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2024-06-16-scrapegraph-ai/">网络爬虫 | 使用 scrapegraph-ai(大模型方案)自动采集网页数据</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/">推荐 | 文本分析库 cntext2.x 使用手册</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/management_python_course/">付费视频课 | Python 实证指标构建与文本分析</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2024-06-14-using-large-language-model-to-extract-structure-data-from-raw-text/">实验 | 使用本地大模型从文本中提取结构化信息</a></p>
<br>
<br>
</li>
</ul>
]]></content:encoded>
    </item>
    
    <item>
      <title>教程 | 使用大模型将文本数据转化为结构化数据(本地LM-Studio篇)</title>
      <link>https://textdata.cn/blog/2025-09-09-transform-text-data-into-structured-data-with-lmstudio-and-cntext/</link>
      <pubDate>Tue, 09 Sep 2025 00:00:00 +0000</pubDate>
      
      <guid>/blog/2025-09-09-transform-text-data-into-structured-data-with-lmstudio-and-cntext/</guid>
      <description>实验数据为外卖评论， 今天咱们做个有难度的文本分析任务，从不同维度(味道、速度、服务)对外卖评论进行打分(-1.0~1.0)文本分析（也称为文本挖掘或自然语言处理，NLP）是指使用计算机算法和技术从大量文本数据中提取有价值信息的过程。文本分析的目标是从非结构化的文本数据中识别模式、提取关键信息、理解语义，并将其转化为结构化数据以便进一步分析和应用。</description>
      <content:encoded><![CDATA[<h2 id="一任务">一、任务</h2>
<p>之前分享 <a href="https://textdata.cn/blog/2025-02-14-using-online-large-model-api-to-transform-text-data-into-structured-data/">教程 | 使用大模型将文本编码为结构化数据(本地Ollma篇)</a> 是基于Ollma的使用的大模型标注教程。 今天咱们换个新工具-<a href="https://lmstudio.ai/">LM Studio</a></p>
<p>使用 LM Studio 和 cntext 进行文本分析。</p>
<p><br><br></p>
<h2 id="二配置环境">二、配置环境</h2>
<h3 id="21-安装cntext">2.1 安装cntext</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">cd desktop
pip install cntext --upgrade
</code></pre></div><br>
<h3 id="22-安装lm-studio">2.2 安装LM Studio</h3>
<p>在官网 <a href="https://lmstudio.ai/">LM Studio</a> 点击下载， 该软件支持Win、Mac。</p>
<p><img loading="lazy" src="img/01-cover.png" alt=""  />
</p>
<br>
<h3 id="23-配置lms命令行工具">2.3 配置lms命令行工具</h3>
<p>为方便调试，需要配置lms命令行工具。</p>
<ul>
<li>window打开cmd， 执行命令
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">/c %USERPROFILE%/.lmstudio/bin/lms.exe bootstrap
</code></pre></div></li>
<li>mac打开terminal， 执行命令
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">~/.lmstudio/bin/lms bootstrap
</code></pre></div></li>
</ul>
<br>
<h3 id="24-安装模型">2.4 安装模型</h3>
<p>模型可以选择点击操作安装，也可通过命令行安装。</p>
<p><img loading="lazy" src="img/02-model-install.png" alt=""  />
</p>
<p>打开<a href="https://lmstudio.ai/models/">LM Studio模型列表</a>， 选择个小的模型进行安装。 可以找到 <strong>qwen3-4b</strong>。打开命令行cmd (Mac打开terminal)， 执行安装命令</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">lms get qwen/qwen3-4b
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">
   🡇 To download: model qwen/qwen3-4b - 170.89 KB
   └─ 🡇 To download: Qwen3 4B 4BIT [MLX] - 2.28 GB

About to download 2.28 GB.
Continue? (Y/N): Y
⠦ [▏                     ] 0.00% |  707 B / 2.28 GB | 837.6777251184834 B/s | ETA 755:46:24        ⠦ [▏                     ] 0.00% |  707 B / 2.28 GB |                 0 B/s | ETA Infinity:NaN:NaN ⠦ [▏                     ] 0.00% |  707 B / 2.28 GB |                 0 B/s | ETA Infinity:NaN:NaN 
⠇ [██████████████████████] 99.87% |   2.28 GB / 2.28 GB |             7.31 MB/s | ETA 00:00        ⠏ [██████████████████████] 99.87% |   2.28 GB / 2.28 GB |             7.31 MB/s | ETA 00:00        Finalizing download...
Download completed.
</code></pre></div><p>大概 10 分钟安装完成， 模型体积约 2.2GB。</p>
<br>
<p>当然了也可选择图形化安装，如图</p>
<p><img loading="lazy" src="img/03-ui-install.png" alt=""  />
</p>
<br>
<h3 id="25-查看已安装模型">2.5 查看已安装模型</h3>
<p>查看电脑内已安装的模型，打开cmd(Mac打开terminal)， 执行安装命令</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">lms ls
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">You have 5 models, taking up 26.42 GB of disk space.

LLM               PARAMS    ARCH     SIZE                 
qwen/qwen3-4b               qwen3    2.28 GB              
qwen3-1.7b-mlx              qwen3    984.01 MB            
qwen3-32b-mlx               qwen3    18.45 GB             
qwen3-8b-mlx                qwen3    4.62 GB      ✓ LOADED

EMBEDDING                               PARAMS    ARCH          SIZE        
text-embedding-nomic-embed-text-v1.5              Nomic BERT    84.11 MB  
</code></pre></div><p>图形化查看已经安装的模型</p>
<p><img loading="lazy" src="img/04-model-ls.png" alt=""  />
</p>
<br>
<br>
<h2 id="三使用lm-studio">三、使用LM Studio</h2>
<h3 id="31-ui界面尝试">3.1 UI界面尝试</h3>
<p><img loading="lazy" src="img/05-ui-chat.png" alt=""  />

<img loading="lazy" src="img/06ui-chat.png" alt=""  />
</p>
<br>
<h3 id="32-启动lm-studio服务">3.2 启动LM Studio服务</h3>
<p>启动服务，方便Python调用LM Studio。 打开cmd(Mac打开terminal)， 启动服务</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">lms server start
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">Success! Server is now running on port 1234
</code></pre></div><p>注意: 后续如果想关闭服务， 执行命令 <code>lms server stop</code></p>
<br>
<h3 id="33-初次尝试">3.3 初次尝试</h3>
<p>使用cntext内置的sentiment提示词模板,启用lmstudio服务，调用qwen/qwen3-4b模型。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>
<span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="n">texts</span> <span class="o">=</span> <span class="p">[</span><span class="s2">&#34;服务很棒！&#34;</span><span class="p">,</span> <span class="s2">&#34;服务一般！&#34;</span><span class="p">,</span> <span class="s2">&#34;服务很差！&#34;</span><span class="p">]</span>
<span class="k">for</span> <span class="n">text</span> <span class="ow">in</span> <span class="n">texts</span><span class="p">:</span>
    <span class="n">score</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">llm</span><span class="p">(</span><span class="n">text</span><span class="p">,</span> 
                   <span class="n">task</span><span class="o">=</span><span class="s2">&#34;sentiment&#34;</span><span class="p">,</span> 
                   <span class="n">backend</span><span class="o">=</span><span class="s2">&#34;lmstudio&#34;</span><span class="p">,</span>  
                   <span class="n">model_name</span><span class="o">=</span><span class="s2">&#34;qwen/qwen3-4b&#34;</span><span class="p">)</span>

    <span class="nb">print</span><span class="p">(</span><span class="n">text</span><span class="p">,</span> <span class="n">score</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[cntext2x] ✅ 连接模型服务: http://localhost:1234/v1

服务很棒！ {&#39;label&#39;: &#39;pos&#39;, &#39;score&#39;: 0.9}
服务一般！ {&#39;label&#39;: &#39;neutral&#39;, &#39;score&#39;: 0.0}
服务很差！ {&#39;label&#39;: &#39;neg&#39;, &#39;score&#39;: -0.95}

CPU times: user 110 ms, sys: 8.3 ms, total: 119 ms
Wall time: 9.14 s
</code></pre></div><br>
<h3 id="34-内置提示词模板">3.4 内置提示词模板</h3>
<p>cntext2x 内置提示词模板不止支持sentiment，还有其他任务，如分类、实体识别等。具体如下</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">ct</span><span class="o">.</span><span class="n">llm</span><span class="o">.</span><span class="n">tasks_list</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[&#39;sentiment&#39;,
 &#39;emotion&#39;,
 &#39;classify&#39;,
 &#39;intent&#39;,
 &#39;keywords&#39;,
 &#39;entities&#39;,
 &#39;summarize&#39;,
 &#39;rewrite&#39;,
 &#39;quality&#39;,
 &#39;similarity&#39;]
</code></pre></div><br>
<p>查看模板内容</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">ct</span><span class="o">.</span><span class="n">llm</span><span class="o">.</span><span class="n">tasks_get</span><span class="p">(</span><span class="s1">&#39;emotion&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">{&#39;prompt&#39;: &#39;识别文本中的主要情绪类型：从 [开心, 愤怒, 悲伤, 惊讶, 厌恶, 恐惧, 中性] 中选择最匹配的一项，返回情绪类型 emotion 和置信度 confidence（0~1）&#39;,
 &#39;output_format&#39;: {&#39;emotion&#39;: &#39;str&#39;, &#39;confidence&#39;: &#39;float&#39;}}
</code></pre></div><br>
<h3 id="35-自定义提示词模板">3.5 自定义提示词模板</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>
<span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>
<span class="n">PROMPT</span> <span class="o">=</span> <span class="s1">&#39;从口味taste、速度speed、服务service三个维度， 对外卖评论内容进行文本分析， 分别返回不同维度的分值(分值范围-1.0 ~ 1.0)&#39;</span>
<span class="n">OUTPUT</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;taste&#39;</span><span class="p">:</span> <span class="nb">float</span><span class="p">,</span> <span class="s1">&#39;speed&#39;</span><span class="p">:</span> <span class="nb">float</span><span class="p">,</span> <span class="s1">&#39;service&#39;</span><span class="p">:</span> <span class="nb">float</span><span class="p">}</span>

<span class="n">texts</span> <span class="o">=</span> <span class="p">[</span><span class="s2">&#34;服务很棒！&#34;</span><span class="p">,</span> <span class="s2">&#34;服务一般！&#34;</span><span class="p">,</span> <span class="s2">&#34;服务很差！&#34;</span><span class="p">]</span>
<span class="k">for</span> <span class="n">text</span> <span class="ow">in</span> <span class="n">texts</span><span class="p">:</span>
    <span class="n">score</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">llm</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="s2">&#34;服务很棒&#34;</span><span class="p">,</span>
           <span class="n">prompt</span><span class="o">=</span><span class="n">PROMPT</span> <span class="p">,</span>
           <span class="n">output_format</span><span class="o">=</span><span class="n">OUTPUT</span><span class="p">,</span>
           <span class="n">backend</span><span class="o">=</span><span class="s2">&#34;lmstudio&#34;</span><span class="p">,</span>  
           <span class="n">model_name</span><span class="o">=</span><span class="s2">&#34;qwen/qwen3-4b&#34;</span><span class="p">)</span>
    <span class="n">score</span><span class="p">[</span><span class="s1">&#39;text&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">text</span>
    <span class="nb">print</span><span class="p">(</span><span class="n">score</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">{&#39;taste&#39;: 0.0, &#39;speed&#39;: 0.0, &#39;service&#39;: 1.0, &#39;text&#39;: &#39;服务很棒！&#39;}
{&#39;taste&#39;: 0.0, &#39;speed&#39;: 0.0, &#39;service&#39;: 1.0, &#39;text&#39;: &#39;服务一般！&#39;}
{&#39;taste&#39;: 0.0, &#39;speed&#39;: 0.0, &#39;service&#39;: 1.0, &#39;text&#39;: &#39;服务很差！&#39;}
CPU times: user 114 ms, sys: 8.12 ms, total: 122 ms
Wall time: 8.79 s
</code></pre></div><p>速度有点慢， 如果想加速， 可以考虑使用qwen3-1.7b模型(b的数字越小，模型运行速度越快，但是质量越差)。</p>
<br>
<br>
<h2 id="cntext使用声明">cntext使用声明</h2>
<p>如在研究或项目中使用 cntext ，请在文中介绍并附引用声明。引用格式可参考 <a href="https://textdata.cn/blog/2025-09-09-transform-text-data-into-structured-data-with-lmstudio-and-cntext/#%E4%BD%BF%E7%94%A8%E5%A3%B0%E6%98%8E">cntext 推荐引用格式</a></p>
<p><br><br></p>
<h2 id="相关内容">相关内容</h2>
<ul>
<li><a href="https://textdata.cn/blog/2025-02-17-gpt-is-an-effective-tool-for-multilingual-psychological-text-analysis/"><strong>PNAS | GPT 是多语言心理文本分析的有效工具</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-06-14-how-to-download-large-language-model-with-ollama/"><strong>教程 | 如何使用 Ollama 下载 &amp; 使用本地大语言模型</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-08-06-using-the-ollama-local-large-model-to-predict-the-sentiment-category-of-online-comments/"><strong>实验 | 使用本地大模型预测在线评论情感类别和分值</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-08-07-structured-outputs-with-ollama/"><strong>实验 | 如何使 Ollama 结构化输出 JSON 样式的结果</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/"><strong>推荐 | 文本分析库 cntext2.x 使用手册</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-06-14-using-large-language-model-to-extract-structure-data-from-raw-text/"><strong>实验 | 使用本地大模型从文本中提取结构化信息</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-07-10-using-large-language-model-to-build-diy-dictionary/">实验 | 使用 Ollama 本地大模型 DIY 制作单词书教案 PDF</a></li>
<li><a href="https://textdata.cn/blog/2024-08-05-create-a-blog-writer-multi-agent-system-using-crewai-and-ollama/">实验 | 使用 Crewai 和 Ollama 构建智能体(AI Agent)帮我撰写博客文章</a></li>
</ul>
<p><br><br></p>
<h2 id="精选内容">精选内容</h2>
<ul>
<li><a href="https://textdata.cn/blog/datasets_available_for_management_science/">LIST | 可供社科(经管)领域使用的数据集汇总</a></li>
<li><a href="https://textdata.cn/blog/the_text_analysis_list_about_ms/">LIST | 社科(经管)数据挖掘文献资料汇总</a></li>
<li><a href="https://textdata.cn/blog/2024-06-16-scrapegraph-ai/">网络爬虫 | 使用 scrapegraph-ai(大模型方案)自动采集网页数据</a></li>
<li><a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/">推荐 | 文本分析库 cntext2.x 使用手册</a></li>
<li><a href="https://textdata.cn/blog/management_python_course/">付费视频课 | Python 实证指标构建与文本分析</a></li>
<li><a href="https://textdata.cn/blog/2024-06-14-using-large-language-model-to-extract-structure-data-from-raw-text/">实验 | 使用本地大模型从文本中提取结构化信息</a>
<br>
<br></li>
</ul>
]]></content:encoded>
    </item>
    
    <item>
      <title>数据集 | 2012年-2025年港股ESG报告数据集</title>
      <link>https://textdata.cn/blog/2024-06-26-hongkong-environmental-social-governance-dataset/</link>
      <pubDate>Thu, 26 Jun 2025 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-06-26-hongkong-environmental-social-governance-dataset/</guid>
      <description>&lt;p&gt;ESG的全称是环境（Environmental）、社会（Social）、和公司治理（Governance）。这是一个框架，用于评估企业运营对环境的影响、企业与社会的关系，以及企业的内部治理结构和流程。ESG概念广泛应用于可持续投资领域，帮助投资者理解企业在非财务指标上的表现，从而做出更加全面的投资决策。&lt;/p&gt;
&lt;br&gt;
&lt;h2 id=&#34;一esg概况&#34;&gt;一、ESG概况&lt;/h2&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;数据集名称: 港股ESG报告数据集
语言类型: 中文
记录数量: 10350
数据格式: TXT/PDF/CSV
数据体积: 67 G
会计年度: 2012 ~ 2025
发布日期: 2013-03-11 ~ 2025-06-13

声明:   科研用途； 如有问题， 请加微信372335839，备注「姓名-学校-专业」
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/01-screen.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;h2 id=&#34;二查看数据&#34;&gt;二、查看数据&lt;/h2&gt;
&lt;p&gt;TXT、PDF都是单个的文件，每个文件对应一家公司某年度的ESG报告。而 CSV 则是汇总数据文件， 一个文件内含有所有TXT的信息。&lt;/p&gt;
&lt;h3 id=&#34;21-读取数据&#34;&gt;2.1 读取数据&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;香港ESG(中文).csv.gz&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/02-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;22-发布日期&#34;&gt;2.2 发布日期&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;pub_date&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;to_datetime&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;pub_date&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;报告发布日期起: &amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;pub_date&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;min&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;strftime&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;%Y-%m-&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;%d&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;报告发布日期止: &amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;pub_date&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;max&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;strftime&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;%Y-%m-&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;%d&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;报告发布日期起:  2013-03-11
报告发布日期止:  2025-06-13
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;23-统计年度报告量&#34;&gt;2.3 统计年度报告量&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;plotnine&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;*&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;plt&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;matplotlib.font_manager&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;FontProperties&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#文泉驿微米黑.ttf位于代码同文件夹&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;font_prop&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;FontProperties&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fname&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;文泉驿微米黑.ttf&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; 

&lt;span class=&#34;n&#34;&gt;data&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;year&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;value_counts&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;reset_index&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;data&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;year&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;data&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;year&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;astype&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;category&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;ggplot&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;data&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;  &lt;span class=&#34;n&#34;&gt;aes&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;year&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;y&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;count&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
    &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;geom_col&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
    &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;geom_text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;aes&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;label&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;count&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;data&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;data&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;va&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;bottom&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;color&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;grey&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;10&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;theme&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;figure_size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;10&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;6&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt;
           &lt;span class=&#34;n&#34;&gt;text&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;element_text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;family&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;font_prop&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;get_name&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()),&lt;/span&gt; 
           &lt;span class=&#34;n&#34;&gt;plot_title&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;element_text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;family&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;font_prop&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;get_name&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(),&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;14&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
          &lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;labs&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;title&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;港股中文ESG报告发布数量&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
          &lt;span class=&#34;n&#34;&gt;x&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;会计年度&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
          &lt;span class=&#34;n&#34;&gt;y&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;报告数&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
  
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/04-plot.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;三声明&#34;&gt;三、声明&lt;/h2&gt;
&lt;p&gt;科研用途； 如有问题， 请加微信372335839，备注「姓名-学校-专业」&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
</description>
      <content:encoded><![CDATA[<p>ESG的全称是环境（Environmental）、社会（Social）、和公司治理（Governance）。这是一个框架，用于评估企业运营对环境的影响、企业与社会的关系，以及企业的内部治理结构和流程。ESG概念广泛应用于可持续投资领域，帮助投资者理解企业在非财务指标上的表现，从而做出更加全面的投资决策。</p>
<br>
<h2 id="一esg概况">一、ESG概况</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">数据集名称: 港股ESG报告数据集
语言类型: 中文
记录数量: 10350
数据格式: TXT/PDF/CSV
数据体积: 67 G
会计年度: 2012 ~ 2025
发布日期: 2013-03-11 ~ 2025-06-13

声明:   科研用途； 如有问题， 请加微信372335839，备注「姓名-学校-专业」
</code></pre></div><p><img loading="lazy" src="img/01-screen.png" alt=""  />
</p>
<br>
<br>
<h2 id="二查看数据">二、查看数据</h2>
<p>TXT、PDF都是单个的文件，每个文件对应一家公司某年度的ESG报告。而 CSV 则是汇总数据文件， 一个文件内含有所有TXT的信息。</p>
<h3 id="21-读取数据">2.1 读取数据</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;香港ESG(中文).csv.gz&#39;</span><span class="p">)</span>

<span class="n">df</span>
</code></pre></div><p><img loading="lazy" src="img/02-df.png" alt=""  />
</p>
<br>
<h3 id="22-发布日期">2.2 发布日期</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s1">&#39;pub_date&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;pub_date&#39;</span><span class="p">])</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;报告发布日期起: &#39;</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;pub_date&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">min</span><span class="p">()</span><span class="o">.</span><span class="n">strftime</span><span class="p">(</span><span class="s1">&#39;%Y-%m-</span><span class="si">%d</span><span class="s1">&#39;</span><span class="p">))</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;报告发布日期止: &#39;</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;pub_date&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">()</span><span class="o">.</span><span class="n">strftime</span><span class="p">(</span><span class="s1">&#39;%Y-%m-</span><span class="si">%d</span><span class="s1">&#39;</span><span class="p">))</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">报告发布日期起:  2013-03-11
报告发布日期止:  2025-06-13
</code></pre></div><br>
<h3 id="23-统计年度报告量">2.3 统计年度报告量</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">plotnine</span> <span class="kn">import</span> <span class="o">*</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">from</span> <span class="nn">matplotlib.font_manager</span> <span class="kn">import</span> <span class="n">FontProperties</span>

<span class="c1">#文泉驿微米黑.ttf位于代码同文件夹</span>
<span class="n">font_prop</span> <span class="o">=</span> <span class="n">FontProperties</span><span class="p">(</span><span class="n">fname</span><span class="o">=</span><span class="s1">&#39;文泉驿微米黑.ttf&#39;</span><span class="p">)</span> 

<span class="n">data</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;year&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">value_counts</span><span class="p">()</span><span class="o">.</span><span class="n">reset_index</span><span class="p">()</span>
<span class="n">data</span><span class="p">[</span><span class="s1">&#39;year&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="s1">&#39;year&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="s1">&#39;category&#39;</span><span class="p">)</span>

<span class="p">(</span>
    <span class="n">ggplot</span><span class="p">(</span><span class="n">data</span><span class="p">,</span>  <span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="s1">&#39;year&#39;</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="s1">&#39;count&#39;</span><span class="p">))</span>
    <span class="o">+</span><span class="n">geom_col</span><span class="p">()</span>
    <span class="o">+</span><span class="n">geom_text</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">label</span><span class="o">=</span><span class="s1">&#39;count&#39;</span><span class="p">),</span> <span class="n">data</span><span class="o">=</span><span class="n">data</span><span class="p">,</span> <span class="n">va</span><span class="o">=</span><span class="s1">&#39;bottom&#39;</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s1">&#39;grey&#39;</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
    <span class="o">+</span><span class="n">theme</span><span class="p">(</span><span class="n">figure_size</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">6</span><span class="p">),</span>
           <span class="n">text</span> <span class="o">=</span> <span class="n">element_text</span><span class="p">(</span><span class="n">family</span> <span class="o">=</span> <span class="n">font_prop</span><span class="o">.</span><span class="n">get_name</span><span class="p">()),</span> 
           <span class="n">plot_title</span> <span class="o">=</span> <span class="n">element_text</span><span class="p">(</span><span class="n">family</span> <span class="o">=</span> <span class="n">font_prop</span><span class="o">.</span><span class="n">get_name</span><span class="p">(),</span> <span class="n">size</span><span class="o">=</span><span class="mi">14</span><span class="p">)</span>
          <span class="p">)</span>
    <span class="o">+</span><span class="n">labs</span><span class="p">(</span><span class="n">title</span><span class="o">=</span><span class="s1">&#39;港股中文ESG报告发布数量&#39;</span><span class="p">,</span>
          <span class="n">x</span> <span class="o">=</span> <span class="s1">&#39;会计年度&#39;</span><span class="p">,</span> 
          <span class="n">y</span> <span class="o">=</span> <span class="s1">&#39;报告数&#39;</span><span class="p">)</span>
<span class="p">)</span>
  
</code></pre></div><p><img loading="lazy" src="img/04-plot.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="三声明">三、声明</h2>
<p>科研用途； 如有问题， 请加微信372335839，备注「姓名-学校-专业」</p>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>数据集 | 2001-2024年A股上市公司年报&amp;管理层讨论与分析</title>
      <link>https://textdata.cn/blog/2023-03-23-china-a-share-market-dataset-mda-from-01-to-21/</link>
      <pubDate>Thu, 01 May 2025 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-03-23-china-a-share-market-dataset-mda-from-01-to-21/</guid>
      <description>&lt;h2 id=&#34;一数据集介绍&#34;&gt;一、数据集介绍&lt;/h2&gt;
&lt;p&gt;2001-2024年A股年报数据集，含 4 个文件，约 16 G。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;- 管理层讨论与分析txt.zip
- 年报txt.zip
- A01-24.csv.gz
- mda01-24.csv.gz
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/a-mda.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;注意&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;zip文件夹是原始数据， 解压后内部为 txt 文件。&lt;/li&gt;
&lt;li&gt;gz文件为汇总数据， 解压后是csv文件。&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;声明&#34;&gt;声明&lt;/h3&gt;
&lt;p&gt;科研用途；如有问题， 请加微信372335839，备注「姓名-学校-专业」&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二年报数据&#34;&gt;二、年报数据&lt;/h2&gt;
&lt;p&gt;2001-2024年年报数据。数据中只有year、code、text三个字段， 如果想增加诸如公司简称、行业等信息， 可以使用 &lt;a href=&#34;https://textdata.cn/blog/2024-04-16-china-listed-company-information-dataset/&#34;&gt;&lt;strong&gt;数据集 | A股上市公司基本信息&lt;/strong&gt;&lt;/a&gt;   进行并表。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;anual_report_df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;A01-24.csv.gz&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;compression&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;gzip&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;anual_report_df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/df1.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;年报记录数&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;anual_report_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;67384
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;p&gt;上市公司总数&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;anual_report_df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;code&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;nunique&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;5728
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;br&gt;
&lt;h2 id=&#34;三mda数据&#34;&gt;三、MD&amp;amp;A数据&lt;/h2&gt;
&lt;p&gt;2001-2024年MD&amp;amp;A数据， 数据中只有year、code、text三个字段， 如果想增加诸如公司简称、行业等信息， 可以使用 &lt;a href=&#34;https://textdata.cn/blog/2024-04-16-china-listed-company-information-dataset/&#34;&gt;&lt;strong&gt;数据集 | A股上市公司基本信息&lt;/strong&gt;&lt;/a&gt;   进行并表。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;mda_df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;mda01-24.csv.gz&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;compression&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;gzip&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;mda_df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/df2.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;mda_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;65456
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;p&gt;上市公司总数&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;mda_df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;code&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;nunique&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;5706
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;四说明&#34;&gt;四、说明&lt;/h2&gt;
&lt;p&gt;从代码运行发现， md&amp;amp;a记录量少于年报记录量。这是由于 mda01-24.csv.gz 是从 A01-24.csv.gz 中生成的， 由于上市公司的年报不是一套模板生成的， 每个公司模板不同，甚至每个公司前后年度报告的排版也会发生变化。在编程提取md&amp;amp;a的过程中， 会因排版规则不能穷举， 导致md&amp;amp;a样本量略微小于年报的样本量。 提取md&amp;amp;a的工具是 &lt;a href=&#34;https://textdata.cn/blog/2024-04-27-cntext2x-tutorial/&#34;&gt;大邓开发的cntext2.1.6库&lt;/a&gt; ，使用的内置函数 &lt;code&gt;mda=ct.extract_mda(text) &lt;/code&gt;。&lt;/p&gt;
&lt;p&gt;我们这里不展示提取过程，仅展示说明md&amp;amp;a记录量与年报记录量之比。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;anual_report_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;year&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;anual_report_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;year&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;astype&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;int&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;mda_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;year&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;mda_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;year&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;astype&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;int&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;查看每年mda记录量与年报记录量之比&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;year&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;range&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;2001&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2024&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;mda_record_num&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;mda_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;mda_df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;year&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;==&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;year&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;anual_report_record_num&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;anual_report_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;anual_report_df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;year&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;==&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;year&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
    &lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;sa&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;year&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt; :&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;mda_record_num&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;/&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;anual_report_record_num&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;查看每年mda记录量与年报记录量之比
2001 : 0.6546700942587832
2002 : 0.8569105691056911
2003 : 0.9287925696594427
2004 : 0.9550398839738942
2005 : 0.9707602339181286
2006 : 0.9745879120879121
2007 : 0.9821882951653944
2008 : 0.9846153846153847
2009 : 0.9859075535512966
2010 : 0.9868544600938968
2011 : 0.9894291754756871
2012 : 0.9891696750902527
2013 : 0.9901458415451321
2014 : 0.9905767056162834
2015 : 0.9922616953921913
2016 : 0.9926681542875359
2017 : 0.9934528892684316
2018 : 0.9892384105960265
2019 : 0.9639227642276422
2020 : 0.9642857142857143
2021 : 0.9310064935064936
2022 : 0.9838492597577388
2023 : 0.9901137847416527
2024 : 1.0
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;五相关内容&#34;&gt;五、相关内容&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2024-01-21-hk-stock-market-anual-report/&#34;&gt;&lt;strong&gt;数据集 | 港股年报文本数据集(2007 ~ 2023.12)&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2024-01-18-neeq-china-listed-on-nation-equities-exchange-and-quotation-system-anunal-year-report/&#34;&gt;&lt;strong&gt;数据集(付费) | 三板上市公司年报2002-2023.12&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2024-01-14-usa-sec-10k-report-dataset/&#34;&gt;&lt;strong&gt;数据集 | 美股年报10-K、20-F数据(2000-2023.12)&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-03-24-load-w2v-and-expand-your-concpet-dicitonary/&#34;&gt;&lt;strong&gt;词向量 | 使用MD&amp;amp;A2001-2024语料训练Word2Vec模型&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-01-06-mda_informative_content/&#34;&gt;中国工业经济 | MD&amp;amp;A信息含量指标构建代码实现&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-01-13-information-content-of-critical-audit/&#34;&gt;金融研究 | 使用Python构建「关键审计事项信息含量」&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-09-08-earnings-communication-conference-forward-looking-statements-information/&#34;&gt;中国管理科学 | 使用业绩说明会文本数据测量上市公司前瞻性信息&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2024-04-25-firm-economic-policy-uncertainty/&#34;&gt;代码 | 使用 MD&amp;amp;A文本测量「企业不确定性感知FEPU」&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2024-04-16-china-listed-company-information-dataset/&#34;&gt;&lt;strong&gt;数据集 | A股上市公司基本信息&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
</description>
      <content:encoded><![CDATA[<h2 id="一数据集介绍">一、数据集介绍</h2>
<p>2001-2024年A股年报数据集，含 4 个文件，约 16 G。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">- 管理层讨论与分析txt.zip
- 年报txt.zip
- A01-24.csv.gz
- mda01-24.csv.gz
</code></pre></div><p><img loading="lazy" src="img/a-mda.png" alt=""  />
</p>
<br>
<p>注意</p>
<ul>
<li>zip文件夹是原始数据， 解压后内部为 txt 文件。</li>
<li>gz文件为汇总数据， 解压后是csv文件。</li>
</ul>
<h3 id="声明">声明</h3>
<p>科研用途；如有问题， 请加微信372335839，备注「姓名-学校-专业」</p>
<p><br><br></p>
<h2 id="二年报数据">二、年报数据</h2>
<p>2001-2024年年报数据。数据中只有year、code、text三个字段， 如果想增加诸如公司简称、行业等信息， 可以使用 <a href="https://textdata.cn/blog/2024-04-16-china-listed-company-information-dataset/"><strong>数据集 | A股上市公司基本信息</strong></a>   进行并表。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">anual_report_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;A01-24.csv.gz&#39;</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">)</span>
<span class="n">anual_report_df</span>
</code></pre></div><p><img loading="lazy" src="img/df1.png" alt=""  />
</p>
<br>
<p>年报记录数</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="nb">len</span><span class="p">(</span><span class="n">anual_report_df</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">67384
</code></pre></div><br>
<p>上市公司总数</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">anual_report_df</span><span class="o">.</span><span class="n">code</span><span class="o">.</span><span class="n">nunique</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">5728
</code></pre></div><br>
<br>
<h2 id="三mda数据">三、MD&amp;A数据</h2>
<p>2001-2024年MD&amp;A数据， 数据中只有year、code、text三个字段， 如果想增加诸如公司简称、行业等信息， 可以使用 <a href="https://textdata.cn/blog/2024-04-16-china-listed-company-information-dataset/"><strong>数据集 | A股上市公司基本信息</strong></a>   进行并表。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">mda_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;mda01-24.csv.gz&#39;</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">)</span>
<span class="n">mda_df</span>
</code></pre></div><p><img loading="lazy" src="img/df2.png" alt=""  />
</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="nb">len</span><span class="p">(</span><span class="n">mda_df</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">65456
</code></pre></div><br>
<p>上市公司总数</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">mda_df</span><span class="o">.</span><span class="n">code</span><span class="o">.</span><span class="n">nunique</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">5706
</code></pre></div><p><br><br></p>
<h2 id="四说明">四、说明</h2>
<p>从代码运行发现， md&amp;a记录量少于年报记录量。这是由于 mda01-24.csv.gz 是从 A01-24.csv.gz 中生成的， 由于上市公司的年报不是一套模板生成的， 每个公司模板不同，甚至每个公司前后年度报告的排版也会发生变化。在编程提取md&amp;a的过程中， 会因排版规则不能穷举， 导致md&amp;a样本量略微小于年报的样本量。 提取md&amp;a的工具是 <a href="https://textdata.cn/blog/2024-04-27-cntext2x-tutorial/">大邓开发的cntext2.1.6库</a> ，使用的内置函数 <code>mda=ct.extract_mda(text) </code>。</p>
<p>我们这里不展示提取过程，仅展示说明md&amp;a记录量与年报记录量之比。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">anual_report_df</span><span class="p">[</span><span class="s1">&#39;year&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">anual_report_df</span><span class="p">[</span><span class="s1">&#39;year&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="nb">int</span><span class="p">)</span>
<span class="n">mda_df</span><span class="p">[</span><span class="s1">&#39;year&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">mda_df</span><span class="p">[</span><span class="s1">&#39;year&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="nb">int</span><span class="p">)</span>

<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;查看每年mda记录量与年报记录量之比&#39;</span><span class="p">)</span>
<span class="k">for</span> <span class="n">year</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">2001</span><span class="p">,</span> <span class="mi">2024</span><span class="p">):</span>
    <span class="n">mda_record_num</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">mda_df</span><span class="p">[</span><span class="n">mda_df</span><span class="o">.</span><span class="n">year</span><span class="o">==</span><span class="n">year</span><span class="p">])</span>
    <span class="n">anual_report_record_num</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">anual_report_df</span><span class="p">[</span><span class="n">anual_report_df</span><span class="o">.</span><span class="n">year</span><span class="o">==</span><span class="n">year</span><span class="p">])</span>
    <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;</span><span class="si">{</span><span class="n">year</span><span class="si">}</span><span class="s1"> :&#39;</span><span class="p">,</span> <span class="n">mda_record_num</span><span class="o">/</span><span class="n">anual_report_record_num</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">查看每年mda记录量与年报记录量之比
2001 : 0.6546700942587832
2002 : 0.8569105691056911
2003 : 0.9287925696594427
2004 : 0.9550398839738942
2005 : 0.9707602339181286
2006 : 0.9745879120879121
2007 : 0.9821882951653944
2008 : 0.9846153846153847
2009 : 0.9859075535512966
2010 : 0.9868544600938968
2011 : 0.9894291754756871
2012 : 0.9891696750902527
2013 : 0.9901458415451321
2014 : 0.9905767056162834
2015 : 0.9922616953921913
2016 : 0.9926681542875359
2017 : 0.9934528892684316
2018 : 0.9892384105960265
2019 : 0.9639227642276422
2020 : 0.9642857142857143
2021 : 0.9310064935064936
2022 : 0.9838492597577388
2023 : 0.9901137847416527
2024 : 1.0
</code></pre></div><p><br><br></p>
<h2 id="五相关内容">五、相关内容</h2>
<ul>
<li><a href="https://textdata.cn/blog/2024-01-21-hk-stock-market-anual-report/"><strong>数据集 | 港股年报文本数据集(2007 ~ 2023.12)</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-01-18-neeq-china-listed-on-nation-equities-exchange-and-quotation-system-anunal-year-report/"><strong>数据集(付费) | 三板上市公司年报2002-2023.12</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-01-14-usa-sec-10k-report-dataset/"><strong>数据集 | 美股年报10-K、20-F数据(2000-2023.12)</strong></a></li>
<li><a href="https://textdata.cn/blog/2023-03-24-load-w2v-and-expand-your-concpet-dicitonary/"><strong>词向量 | 使用MD&amp;A2001-2024语料训练Word2Vec模型</strong></a></li>
<li><a href="https://textdata.cn/blog/2023-01-06-mda_informative_content/">中国工业经济 | MD&amp;A信息含量指标构建代码实现</a></li>
<li><a href="https://textdata.cn/blog/2023-01-13-information-content-of-critical-audit/">金融研究 | 使用Python构建「关键审计事项信息含量」</a></li>
<li><a href="https://textdata.cn/blog/2023-09-08-earnings-communication-conference-forward-looking-statements-information/">中国管理科学 | 使用业绩说明会文本数据测量上市公司前瞻性信息</a></li>
<li><a href="https://textdata.cn/blog/2024-04-25-firm-economic-policy-uncertainty/">代码 | 使用 MD&amp;A文本测量「企业不确定性感知FEPU」</a></li>
<li><a href="https://textdata.cn/blog/2024-04-16-china-listed-company-information-dataset/"><strong>数据集 | A股上市公司基本信息</strong></a></li>
</ul>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>词向量 | 使用 MD&amp;A2001-2024 语料训练 Word2Vec/GloVe 模型</title>
      <link>https://textdata.cn/blog/2023-03-24-load-w2v-and-expand-your-concpet-dicitonary/</link>
      <pubDate>Thu, 01 May 2025 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-03-24-load-w2v-and-expand-your-concpet-dicitonary/</guid>
      <description>&lt;h2 id=&#34;一数据集&#34;&gt;一、数据集&lt;/h2&gt;
&lt;h3 id=&#34;11-数据概况&#34;&gt;1.1 数据概况&lt;/h3&gt;
&lt;p&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-03-23-china-a-share-market-dataset-mda-from-01-to-21/&#34;&gt;&lt;strong&gt;数据集 | 2001-2024 年 A 股上市公司年报&amp;amp;管理层讨论与分析&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;数据名称: 2001-2024年A股上市公司年报&amp;amp;管理层讨论与分析
数据来源: 上海证券交易所、深圳证券交易所
数据格式: csv、txt
公司数量: 5706
MD&amp;amp;A数量: 65519
会计年度: 2001-2024
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h3 id=&#34;12-读取-mda-数据&#34;&gt;1.2 读取 md&amp;amp;a 数据&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;# 读取前5行数据&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;mda01-24.csv.gz&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;compression&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;gzip&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;nrows&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;5&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;# gz解压后读取csv&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;# df = pd.read_csv(&amp;#39;mda01-24.csv&amp;#39;, nrows=5)&lt;/span&gt;

&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;head&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;65519
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/01-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h2 id=&#34;二训练-word2vec--glove-模型&#34;&gt;二、训练 Word2Vec &amp;amp; GloVe 模型&lt;/h2&gt;
&lt;h3 id=&#34;21-准备语料&#34;&gt;2.1 准备语料&lt;/h3&gt;
&lt;p&gt;从 &lt;strong&gt;mda01-24.csv.gz&lt;/strong&gt; 数据中抽取出所有文本，写入到 &lt;strong&gt;mda01-24.txt&lt;/strong&gt;&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;o&#34;&gt;%%&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;time&lt;/span&gt;

&lt;span class=&#34;k&#34;&gt;with&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;open&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;mda01-24.txt&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;w&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;encoding&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;utf-8&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;text&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;se&#34;&gt;\n&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;join&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;text&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fillna&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;write&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;最终得到 3.34G 的语料文件。&lt;/p&gt;
&lt;br&gt;
&lt;h2 id=&#34;22-配置-cntext-环境&#34;&gt;2.2 配置 cntext 环境&lt;/h2&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;pip3 install cntext --upgrade
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;23-开始训练&#34;&gt;2.3 开始训练&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;o&#34;&gt;%%&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;time&lt;/span&gt;
&lt;span class=&#34;o&#34;&gt;%%&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;time&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;cntext&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;ct&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;w2v_model&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Word2Vec&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;corpus_file&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;mda01-24.txt&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;c1&#34;&gt;# 语料文件&lt;/span&gt;
                        &lt;span class=&#34;n&#34;&gt;lang&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;chinese&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;             &lt;span class=&#34;c1&#34;&gt;# 中文语料&lt;/span&gt;
                        &lt;span class=&#34;n&#34;&gt;vector_size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;200&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;            &lt;span class=&#34;c1&#34;&gt;# 嵌入的维度数&lt;/span&gt;
                        &lt;span class=&#34;n&#34;&gt;window_size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;15&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;             &lt;span class=&#34;c1&#34;&gt;# 词语上下文的窗口大小&lt;/span&gt;



&lt;span class=&#34;n&#34;&gt;glove_model&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;GloVe&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;corpus_file&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;mda01-24.txt&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
                       &lt;span class=&#34;n&#34;&gt;lang&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;chinese&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
                       &lt;span class=&#34;n&#34;&gt;vector_size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;200&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
                       &lt;span class=&#34;n&#34;&gt;window_size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;15&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;Mac(Linux) System, Enable Parallel Processing
Cache output/mda01-24_cache.txt Not Found or Empty, Preprocessing Corpus
Processing Corpus: 100%|█| 27404772/27404772 [04:38&amp;lt;00:00, 9
Reading Preprocessed Corpus from output/mda01-24_cache.txt
Start Training Word2Vec
Word2Vec Training Cost 1625 s. 
Output Saved To: output/mda01-24-Word2Vec.200.15.bin


Mac(Linux) System, Enable Parallel Processing
Cache output/mda01-24_cache.txt Found, Skip Preprocessing Corpus
Start Training GloVe
BUILDING VOCABULARY
Using vocabulary of size 536863.

COUNTING COOCCURRENCES
Merging cooccurrence files: processed 353975745 lines.

Using random seed 1746091798
SHUFFLING COOCCURRENCES
Merging temp files: processed 353975745 lines.

TRAINING MODEL
Read 353975745 lines.
Using random seed 1746091864
05/01/25 - 05:32.08PM, iter: 001, cost: 0.115862
05/01/25 - 05:33.04PM, iter: 002, cost: 0.082325
05/01/25 - 05:34.00PM, iter: 003, cost: 0.070848
......
......
05/01/25 - 05:43.23PM, iter: 013, cost: 0.050617
05/01/25 - 05:44.19PM, iter: 014, cost: 0.050079
05/01/25 - 05:45.16PM, iter: 015, cost: 0.049582

GloVe Training Cost 1366 s. 
Output Saved To: output/mda01-24-GloVe.200.15.bin
CPU times: user 1h 28min 19s, sys: 2min 6s, total: 1h 30min 26s
Wall time: 49min 55s
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;经过 1.5 小时， 训练出的中国 A 股管理层讨论与分析的 GloVe 和 Word2Vec 词向量模型(如下截图)。模型可广泛用于经济管理等领域概念(情感)词典的构建或扩展。&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;mda01-24_cache.txt&lt;/strong&gt; 缓存文件&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;mda01-24-Word2Vec.200.15.bin&lt;/strong&gt; Word2Vec 模型文件&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;mda01-24-GloVe.200.15.bin&lt;/strong&gt; GloVe 模型文件&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/pretained-screen.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;h2 id=&#34;三使用模型&#34;&gt;三、使用模型&lt;/h2&gt;
&lt;h3 id=&#34;31-导入模型&#34;&gt;3.1 导入模型&lt;/h3&gt;
&lt;p&gt;使用 &lt;strong&gt;&lt;em&gt;ct.load_w2v(w2v_path)&lt;/em&gt;&lt;/strong&gt; 来导入刚刚训练好的模型 &lt;strong&gt;&lt;em&gt;mda01-24-GloVe.200.15.bin&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;cntext&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;ct&lt;/span&gt;

&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;__version__&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;w2v_model&lt;/span&gt;   &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;load_w2v&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;output/mda01-24-Word2Vec.200.15.bin&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;glove_model&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;load_w2v&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;output/mda01-24-GloVe.200.15.bin&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;w2v_model&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;2.1.6
Loading output/mda01-24-Word2Vec.200.15.bin...
Loading output/mda01-24-GloVe.200.15.bin...
&amp;lt;gensim.models.keyedvectors.KeyedVectors at 0x633060fe0&amp;gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;32-评估模型&#34;&gt;3.2 评估模型&lt;/h3&gt;
&lt;p&gt;使用近义法和类比法， 判断模型的表现。详情可查看&lt;a href=&#34;https://cntext.readthedocs.io/zh-cn/latest/model.html&#34;&gt;文档&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;以 word2vec 为例，评估模型表现&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;evaluate_similarity&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;w2v_model&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;evaluate_analogy&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;w2v_model&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;近义测试: similarity.txt
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/cntext/model/evaluate_data/similarity.txt

评估结果：
+----------+------------+----------------------------+
| 发现词语 | 未发现词语 | Spearman&amp;#39;s Rank Coeficient |
+----------+------------+----------------------------+
|   425    |    112     |            0.42            |
+----------+------------+----------------------------+


类比测试: analogy.txt
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/cntext/model/evaluate_data/analogy.txt
Processing Analogy Test: 100%|██████████████| 1198/1198 [00:11&amp;lt;00:00, 99.91it/s]

评估结果：
+--------------------+----------+------------+------------+----------+
|      Category      | 发现词语 | 未发现词语 | 准确率 (%) | 平均排名 |
+--------------------+----------+------------+------------+----------+
| CapitalOfCountries |   455    |    222     |   31.21    |   4.30   |
|   CityInProvince   |   175    |     0      |   97.71    |   1.26   |
| FamilyRelationship |    90    |    182     |   10.00    |   5.89   |
|   SocialScience    |    9     |     61     |   44.44    |   4.50   |
+--------------------+----------+------------+------------+----------+
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;strong&gt;近义测试&lt;/strong&gt;: Spearman&amp;rsquo;s Rank Coeficient 系数取值[-1, 1], 取值越大， 说明模型表现越好。&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;&lt;strong&gt;类比测试&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;CapitalOfCountries 中文 md&amp;amp;a 语料在此项表现较差， 应该是语料中常见国家首度的提及较少。也体现了大多数企业没有国际化。盲猜美股的 CapitalOfCountries 表现应该好于 A 股。&lt;/li&gt;
&lt;li&gt;CityInProvince 中文 md&amp;amp;a 语料在此项表现如此优异，说明 A 股多数企业扎根于中国大地， 年报 md&amp;amp;a 中提及次数很多。&lt;/li&gt;
&lt;li&gt;FamilyRelationship 中文 md&amp;amp;a 语料中主要体现的是公司组织层面，较少提及家庭关系词语，所以类别表现一般是很容易理解的。&lt;/li&gt;
&lt;li&gt;SocialScience 中文 md&amp;amp;a 语料在此项表现一般， 应该是语料中常见的社会科学词语提及较少。&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;整体而言，语料训练的效果很不错，抓住了数据场景的独特性语义。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;33-keyedvectors-的操作方法或属性&#34;&gt;3.3 KeyedVectors 的操作方法(或属性)&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;方法&lt;/th&gt;
&lt;th&gt;描述&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;em&gt;KeyedVectors.index_to_key&lt;/em&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;获取词汇表中的所有单词。&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;em&gt;KeyedVectors.key_to_index&lt;/em&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;获取单词到索引的映射。&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;em&gt;KeyedVectors.vector_size&lt;/em&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;获取 GloVe 模型中任意词向量的维度。&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;em&gt;KeyedVectors.get_vector(word)&lt;/em&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;获取给定单词的词向量。&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;em&gt;KeyedVectors.similar_by_word(word, topn=10)&lt;/em&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;获取某词语最相似的 10 个近义词。&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;em&gt;KeyedVectors.similar_by_vector(vector, topn=10)&lt;/em&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;获取词向量最相似的 10 个近义词。&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3 id=&#34;34-查看词汇量维度数&#34;&gt;3.4 查看词汇量&amp;amp;维度数&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# 词汇量&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;Word2Vec词汇量: &amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;w2v_model&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;GloVe词汇量: &amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;glove_model&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;

&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;Word2Vec维度数: &amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;w2v_model&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;vector_size&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;GloVe维度数: &amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;glove_model&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;vector_size&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;Word2Vec词汇量:  902666
GloVe词汇量:     536864
Word2Vec维度数:  200
GloVe维度数:     200
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;35-词表&#34;&gt;3.5 词表&lt;/h3&gt;
&lt;p&gt;查看词表&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;w2v_model&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;index_to_key&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;[&amp;#39;公司&amp;#39;,
 &amp;#39;适用&amp;#39;,
 &amp;#39;情况&amp;#39;,
 &amp;#39;项目&amp;#39;,
 &amp;#39;产品&amp;#39;,
 ...
 &amp;#39;比上&amp;#39;,
 &amp;#39;境内&amp;#39;,
 &amp;#39;最终&amp;#39;,
 &amp;#39;启动&amp;#39;,
 ...]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;p&gt;查看词汇映射表&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;w2v_model&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;key_to_index&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;{&amp;#39;公司&amp;#39;: 0,
 &amp;#39;适用&amp;#39;: 1,
 &amp;#39;情况&amp;#39;: 2,
 &amp;#39;项目&amp;#39;: 3,
 &amp;#39;产品&amp;#39;: 4,
 ......
 &amp;#39;比上&amp;#39;: 996,
 &amp;#39;境内&amp;#39;: 997,
 &amp;#39;最终&amp;#39;: 998,
 &amp;#39;启动&amp;#39;: 999,
 ...}
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;36-查看词向量&#34;&gt;3.6 查看词向量&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# 查询某词的词向量&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;w2v_model&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;get_vector&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;创新&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;array([ 2.2048666 ,  1.7392347 ,  3.8569732 ,  2.181498  ,  0.49182096,
       -2.0054908 ,  0.55133677,  0.97385484,  3.6563325 , -2.1495004 ,
       -4.8804154 ,  2.8375697 ,  2.071349  ,  3.0867636 , -1.3978149 ,
       -0.38058507, -2.379905  , -1.8974878 ,  3.596266  ,  0.44742537,
        ......
        0.13521506, -0.78970003, -0.8154422 ,  1.015166  ,  0.30753416,
       -6.1991196 , -2.2295246 ,  0.797445  , -0.21968505,  1.6549479 ,
       -1.1522037 , -1.5377268 , -3.4639692 , -3.3877385 ,  3.5285642 ,
        0.9497059 , -2.6022844 ,  1.6192312 , -0.39254257, -0.5094183 ],
      dtype=float32)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# 查询多个词的词向量&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;w2v_model&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;get_mean_vector&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;([&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;创新&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;研发&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Ruj&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;array([ 0.01322632,  0.01596442,  0.08699574,  0.06786569,  0.00441768,
       -0.04059787,  0.01970061,  0.02050735,  0.04548474, -0.01610814,
       -0.10554063,  0.08021796,  0.10255495,  0.06383747, -0.07158516,
        0.00185056, -0.02854855, -0.09506228,  0.1032301 , -0.05448814,
       ......
       -0.01035122, -0.02931183, -0.03785197,  0.04421834,  0.04357708,
       -0.15989086, -0.05572033,  0.02324059, -0.08414906,  0.02760434,
        0.01254621, -0.02324901, -0.05535778, -0.06064604,  0.0409652 ,
       -0.04119795, -0.08222105,  0.03998823, -0.03626942, -0.01975589],
      dtype=float32)

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;37-近义词&#34;&gt;3.7 近义词&lt;/h3&gt;
&lt;p&gt;根据词语查找最相似的 10 个词&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;w2v_model&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;similar_by_word&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;创新&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;topn&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;10&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;[(&amp;#39;技术创新&amp;#39;, 0.6993309855461121),
 (&amp;#39;不断创新&amp;#39;, 0.6758015155792236),
 (&amp;#39;创新型&amp;#39;, 0.636788547039032),
 (&amp;#39;创新能力&amp;#39;, 0.6053606271743774),
 (&amp;#39;引领&amp;#39;, 0.604947566986084),
 (&amp;#39;硬核&amp;#39;, 0.5690070986747742),
 (&amp;#39;前沿&amp;#39;, 0.5627986788749695),
 (&amp;#39;赋能&amp;#39;, 0.5582684278488159),
 (&amp;#39;创新性&amp;#39;, 0.5509947538375854),
 (&amp;#39;革新&amp;#39;, 0.5494255423545837)]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;p&gt;根据某词的词向量查询最相似的 10 个词&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;creativeness_vector&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;w2v_model&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;get_vector&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;创新&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;w2v_model&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;similar_by_vector&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;creativeness_vector&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;topn&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;10&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;[(&amp;#39;创新&amp;#39;, 1.0),
 (&amp;#39;技术创新&amp;#39;, 0.6993309855461121),
 (&amp;#39;不断创新&amp;#39;, 0.6758015155792236),
 (&amp;#39;创新型&amp;#39;, 0.636788547039032),
 (&amp;#39;创新能力&amp;#39;, 0.6053606271743774),
 (&amp;#39;引领&amp;#39;, 0.6049476265907288),
 (&amp;#39;硬核&amp;#39;, 0.5690070986747742),
 (&amp;#39;前沿&amp;#39;, 0.5627986788749695),
 (&amp;#39;赋能&amp;#39;, 0.5582684278488159),
 (&amp;#39;创新性&amp;#39;, 0.5509947538375854)]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;p&gt;多个词求得均值向量&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;AI_vector&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;w2v_model&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;get_mean_vector&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;([&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;ai&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;  &lt;span class=&#34;s1&#34;&gt;&amp;#39;机器学习&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;人工智能&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;自然语言处理&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;w2v_model&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;similar_by_vector&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;AI_vector&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;topn&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;20&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;[(&amp;#39;ai&amp;#39;, 0.9074109792709351),
 (&amp;#39;机器学习&amp;#39;, 0.8809980750083923),
 (&amp;#39;自然语言处理&amp;#39;, 0.8750396966934204),
 (&amp;#39;ai模型&amp;#39;, 0.8575210571289062),
 (&amp;#39;人工智能&amp;#39;, 0.8506893515586853),
 (&amp;#39;nlp&amp;#39;, 0.8240388035774231),
 (&amp;#39;语言模型&amp;#39;, 0.8206671476364136),
 (&amp;#39;模态模型&amp;#39;, 0.8144882917404175),
 (&amp;#39;深度学习&amp;#39;, 0.7912176847457886),
 (&amp;#39;生成式&amp;#39;, 0.7850476503372192),
 (&amp;#39;自然语言&amp;#39;, 0.7846022248268127),
 (&amp;#39;llm&amp;#39;, 0.7809537649154663),
 (&amp;#39;大模&amp;#39;, 0.7670232653617859),
 (&amp;#39;gpt&amp;#39;, 0.7638874053955078),
 (&amp;#39;自然语言理解&amp;#39;, 0.7441188097000122),
 (&amp;#39;知识图谱&amp;#39;, 0.7421959638595581),
 (&amp;#39;生成式ai&amp;#39;, 0.7387682199478149),
 (&amp;#39;aigc&amp;#39;, 0.7381091117858887),
 (&amp;#39;ai算法&amp;#39;, 0.7311530709266663),
 (&amp;#39;语音识别&amp;#39;, 0.7257674932479858)]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;p&gt;短视主义词&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;short_term_vector&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;w2v_model&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;get_mean_vector&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;([&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;尽快&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;  &lt;span class=&#34;s1&#34;&gt;&amp;#39;年内&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;马上&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;w2v_model&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;similar_by_vector&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;short_term_vector&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;topn&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;20&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;[(&amp;#39;尽快&amp;#39;, 0.7294592261314392),
 (&amp;#39;年内&amp;#39;, 0.7279667854309082),
 (&amp;#39;尽早&amp;#39;, 0.6742831468582153),
 (&amp;#39;马上&amp;#39;, 0.6565427184104919),
 (&amp;#39;即将&amp;#39;, 0.61030113697052),
 (&amp;#39;早日&amp;#39;, 0.6024956107139587),
 (&amp;#39;争取早日&amp;#39;, 0.5442042946815491),
 (&amp;#39;争取尽早&amp;#39;, 0.5283723473548889),
 (&amp;#39;抓紧&amp;#39;, 0.5254929661750793),
 (&amp;#39;争取&amp;#39;, 0.5205905437469482),
 (&amp;#39;短时间&amp;#39;, 0.5205082297325134),
 (&amp;#39;争取尽快&amp;#39;, 0.5160724520683289),
 (&amp;#39;按期&amp;#39;, 0.51212477684021),
 (&amp;#39;后续&amp;#39;, 0.5105950236320496),
 (&amp;#39;力争早日&amp;#39;, 0.5102716684341431),
 (&amp;#39;提前&amp;#39;, 0.5060917139053345),
 (&amp;#39;力争&amp;#39;, 0.4955942928791046),
 (&amp;#39;力争尽早&amp;#39;, 0.4942554235458374),
 (&amp;#39;最后&amp;#39;, 0.4882470369338989),
 (&amp;#39;立即&amp;#39;, 0.4858567416667938)]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;四扩展词典&#34;&gt;四、扩展词典&lt;/h2&gt;
&lt;p&gt;做词典法的文本分析，最重要的是有自己的领域词典。之前受限于技术难度，文科生的我也一直在用形容词的通用情感词典。现在依托 word2vec 技术， 可以加速人工构建的准确率和效率。&lt;/p&gt;
&lt;p&gt;下面是在 &lt;strong&gt;&lt;em&gt;mda01-24-Word2Vec.200.15.bin&lt;/em&gt;&lt;/strong&gt; 上做的词典扩展测试，函数 &lt;strong&gt;&lt;em&gt;ct.expand_dictionary(wv, seeddict, topn=100)&lt;/em&gt;&lt;/strong&gt; 会根据种子词选取最准确的 topn 个词。&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;em&gt;wv&lt;/em&gt;&lt;/strong&gt; 预训练模型，数据类型为 gensim.models.keyedvectors.KeyedVectors。&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;em&gt;seeddict&lt;/em&gt;&lt;/strong&gt; 参数类似于种子词；格式为 PYTHON 字典；&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;em&gt;topn&lt;/em&gt;&lt;/strong&gt; 返回 topn 个语义最接近 seeddict 的词，默认 100.&lt;/li&gt;
&lt;/ul&gt;
&lt;br&gt;
&lt;p&gt;假设现在有种子词 seeddicts， 内含我构建的 &lt;strong&gt;&lt;em&gt;短视词&lt;/em&gt;&lt;/strong&gt;、 &lt;strong&gt;&lt;em&gt;创新词&lt;/em&gt;&lt;/strong&gt;、 &lt;strong&gt;&lt;em&gt;竞争词&lt;/em&gt;&lt;/strong&gt;， 我希望生成最终各含 30 个词的候选词表 txt 文件。&lt;/p&gt;
&lt;p&gt;可以使用 &lt;strong&gt;&lt;em&gt;ct.expand_dictionary&lt;/em&gt;&lt;/strong&gt; 进行如下操作&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;seeddicts&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;
    &lt;span class=&#34;s1&#34;&gt;&amp;#39;短视词&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;抓紧&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;立刻&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;月底&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;年底&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;年终&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;争取&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;力争&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt;
    &lt;span class=&#34;s1&#34;&gt;&amp;#39;创新词&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;创新&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;科技&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;  &lt;span class=&#34;s1&#34;&gt;&amp;#39;研发&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;  &lt;span class=&#34;s1&#34;&gt;&amp;#39;技术&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;标准&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt;
    &lt;span class=&#34;s1&#34;&gt;&amp;#39;竞争词&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;竞争&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;竞争力&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt;
    &lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;expand_dictionary&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;wv&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;w2v_model&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
                     &lt;span class=&#34;n&#34;&gt;seeddict&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;seeddicts&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
                     &lt;span class=&#34;n&#34;&gt;topn&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;30&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;Finish! 短视词 candidates saved to output/短视词.txt
Finish! 创新词 candidates saved to output/创新词.txt
Finish! 竞争词 candidates saved to output/竞争词.txt
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/03-expand.jpg&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;六获取模型&#34;&gt;六、获取模型&lt;/h2&gt;
&lt;p&gt;内容创作不易， 本文为付费内容，&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;- 免费     mda01-24-Word2Vec.200.15.bin   链接: https://pan.baidu.com/s/1Gke4UKOnswpctp8vsZ0koQ?pwd=dpry

- 免费     mda01-24-GloVe.200.15.bin 链接: https://pan.baidu.com/s/1TqoA4TqMAhLzpIp0ZvrQEA?pwd=ajjw

- 更多免费词向量      https://cntext.readthedocs.io/zh-cn/latest/embeddings.html
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;相关内容&#34;&gt;相关内容&lt;/h2&gt;
&lt;p&gt;相关文献&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;[0]刘景江,郑畅然,洪永淼.机器学习如何赋能管理学研究？——国内外前沿综述和未来展望[J].管理世界,2023,39(09):191-216.
[1]冉雅璇,李志强,刘佳妮,张逸石.大数据时代下社会科学研究方法的拓展——基于词嵌入技术的文本分析的应用[J].南开管理评论:1-27.
[3]胡楠,薛付婧,王昊楠.管理者短视主义影响企业长期投资吗？——基于文本分析和机器学习[J].管理世界,2021,37(05):139-156+11+19-21.
[4]Kai Li, Feng Mai, Rui Shen, Xinyan Yan, Measuring Corporate Culture Using Machine Learning, *The Review of Financial Studies*,2020
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/the_text_analysis_list_about_ms/&#34;&gt;LIST | 社科(经管)文本挖掘文献汇总&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/text_analysis_code_list_about_ms/&#34;&gt;LIST | 文本分析代码汇总&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-03-15-39faq-about-word-embeddings-for-social-science/&#34;&gt;词嵌入技术在社会科学领域进行数据挖掘常见 39 个 FAQ 汇总&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/datasets_available_for_management_science/&#34;&gt;LIST | 可供社科(经管)领域使用的数据集&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/management_python_course/&#34;&gt;Python 实证指标构建与文本分析&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/&#34;&gt;推荐 | 文本分析库 cntext2.x 使用手册&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-11-20-word2vec-by-year-by-province/&#34;&gt;使用 5000w 专利申请数据集按年份(按省份)训练词向量&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-11-10-training-word2vec-model-using-china-3751w-patent-application-dataset/&#34;&gt;词向量 | 使用1985年-2025年专利申请摘要训练 Word2Vec 模型&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-12-28-train-word2vec-using-renmin-gov-leader-board-dataset/&#34;&gt;词向量 | 使用人民网领导留言板语料训练 Word2Vec 模型&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2025-03-28-train_a_glove_model_on_chinese_corpus_using_stanfordnlp/&#34;&gt;实验 | 使用 Stanford Glove 代码训练中文语料的 Glove 模型&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-12-28-visualize-the-culture-change-using-people-daily-dataset/&#34;&gt;可视化 | 人民日报语料反映七十年文化演变&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
      <content:encoded><![CDATA[<h2 id="一数据集">一、数据集</h2>
<h3 id="11-数据概况">1.1 数据概况</h3>
<p><a href="https://textdata.cn/blog/2023-03-23-china-a-share-market-dataset-mda-from-01-to-21/"><strong>数据集 | 2001-2024 年 A 股上市公司年报&amp;管理层讨论与分析</strong></a></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">数据名称: 2001-2024年A股上市公司年报&amp;管理层讨论与分析
数据来源: 上海证券交易所、深圳证券交易所
数据格式: csv、txt
公司数量: 5706
MD&amp;A数量: 65519
会计年度: 2001-2024
</code></pre></div><h3 id="12-读取-mda-数据">1.2 读取 md&amp;a 数据</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="c1"># 读取前5行数据</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;mda01-24.csv.gz&#39;</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">,</span> <span class="n">nrows</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
<span class="c1"># gz解压后读取csv</span>
<span class="c1"># df = pd.read_csv(&#39;mda01-24.csv&#39;, nrows=5)</span>

<span class="nb">print</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">))</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">65519
</code></pre></div><p><img loading="lazy" src="img/01-df.png" alt=""  />
</p>
<br>
<h2 id="二训练-word2vec--glove-模型">二、训练 Word2Vec &amp; GloVe 模型</h2>
<h3 id="21-准备语料">2.1 准备语料</h3>
<p>从 <strong>mda01-24.csv.gz</strong> 数据中抽取出所有文本，写入到 <strong>mda01-24.txt</strong></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>

<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s1">&#39;mda01-24.txt&#39;</span><span class="p">,</span> <span class="s1">&#39;w&#39;</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
    <span class="n">text</span> <span class="o">=</span> <span class="s1">&#39;</span><span class="se">\n</span><span class="s1">&#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;text&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">))</span>
    <span class="n">f</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>
</code></pre></div><p>最终得到 3.34G 的语料文件。</p>
<br>
<h2 id="22-配置-cntext-环境">2.2 配置 cntext 环境</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">pip3 install cntext --upgrade
</code></pre></div><br>
<h3 id="23-开始训练">2.3 开始训练</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>
<span class="o">%%</span><span class="n">time</span>
<span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="n">w2v_model</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">Word2Vec</span><span class="p">(</span><span class="n">corpus_file</span><span class="o">=</span><span class="s1">&#39;mda01-24.txt&#39;</span><span class="p">,</span> <span class="c1"># 语料文件</span>
                        <span class="n">lang</span><span class="o">=</span><span class="s1">&#39;chinese&#39;</span><span class="p">,</span>             <span class="c1"># 中文语料</span>
                        <span class="n">vector_size</span><span class="o">=</span><span class="mi">200</span><span class="p">,</span>            <span class="c1"># 嵌入的维度数</span>
                        <span class="n">window_size</span><span class="o">=</span><span class="mi">15</span><span class="p">)</span>             <span class="c1"># 词语上下文的窗口大小</span>



<span class="n">glove_model</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">GloVe</span><span class="p">(</span><span class="n">corpus_file</span><span class="o">=</span><span class="s1">&#39;mda01-24.txt&#39;</span><span class="p">,</span>
                       <span class="n">lang</span><span class="o">=</span><span class="s1">&#39;chinese&#39;</span><span class="p">,</span>
                       <span class="n">vector_size</span><span class="o">=</span><span class="mi">200</span><span class="p">,</span>
                       <span class="n">window_size</span><span class="o">=</span><span class="mi">15</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">Mac(Linux) System, Enable Parallel Processing
Cache output/mda01-24_cache.txt Not Found or Empty, Preprocessing Corpus
Processing Corpus: 100%|█| 27404772/27404772 [04:38&lt;00:00, 9
Reading Preprocessed Corpus from output/mda01-24_cache.txt
Start Training Word2Vec
Word2Vec Training Cost 1625 s. 
Output Saved To: output/mda01-24-Word2Vec.200.15.bin


Mac(Linux) System, Enable Parallel Processing
Cache output/mda01-24_cache.txt Found, Skip Preprocessing Corpus
Start Training GloVe
BUILDING VOCABULARY
Using vocabulary of size 536863.

COUNTING COOCCURRENCES
Merging cooccurrence files: processed 353975745 lines.

Using random seed 1746091798
SHUFFLING COOCCURRENCES
Merging temp files: processed 353975745 lines.

TRAINING MODEL
Read 353975745 lines.
Using random seed 1746091864
05/01/25 - 05:32.08PM, iter: 001, cost: 0.115862
05/01/25 - 05:33.04PM, iter: 002, cost: 0.082325
05/01/25 - 05:34.00PM, iter: 003, cost: 0.070848
......
......
05/01/25 - 05:43.23PM, iter: 013, cost: 0.050617
05/01/25 - 05:44.19PM, iter: 014, cost: 0.050079
05/01/25 - 05:45.16PM, iter: 015, cost: 0.049582

GloVe Training Cost 1366 s. 
Output Saved To: output/mda01-24-GloVe.200.15.bin
CPU times: user 1h 28min 19s, sys: 2min 6s, total: 1h 30min 26s
Wall time: 49min 55s
</code></pre></div><p>经过 1.5 小时， 训练出的中国 A 股管理层讨论与分析的 GloVe 和 Word2Vec 词向量模型(如下截图)。模型可广泛用于经济管理等领域概念(情感)词典的构建或扩展。</p>
<ul>
<li><strong>mda01-24_cache.txt</strong> 缓存文件</li>
<li><strong>mda01-24-Word2Vec.200.15.bin</strong> Word2Vec 模型文件</li>
<li><strong>mda01-24-GloVe.200.15.bin</strong> GloVe 模型文件</li>
</ul>
<p><img loading="lazy" src="img/pretained-screen.png" alt=""  />
</p>
<br>
<br>
<h2 id="三使用模型">三、使用模型</h2>
<h3 id="31-导入模型">3.1 导入模型</h3>
<p>使用 <strong><em>ct.load_w2v(w2v_path)</em></strong> 来导入刚刚训练好的模型 <strong><em>mda01-24-GloVe.200.15.bin</em></strong></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="nb">print</span><span class="p">(</span><span class="n">ct</span><span class="o">.</span><span class="n">__version__</span><span class="p">)</span>

<span class="n">w2v_model</span>   <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">load_w2v</span><span class="p">(</span><span class="s1">&#39;output/mda01-24-Word2Vec.200.15.bin&#39;</span><span class="p">)</span>
<span class="n">glove_model</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">load_w2v</span><span class="p">(</span><span class="s1">&#39;output/mda01-24-GloVe.200.15.bin&#39;</span><span class="p">)</span>
<span class="n">w2v_model</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">2.1.6
Loading output/mda01-24-Word2Vec.200.15.bin...
Loading output/mda01-24-GloVe.200.15.bin...
&lt;gensim.models.keyedvectors.KeyedVectors at 0x633060fe0&gt;
</code></pre></div><br>
<h3 id="32-评估模型">3.2 评估模型</h3>
<p>使用近义法和类比法， 判断模型的表现。详情可查看<a href="https://cntext.readthedocs.io/zh-cn/latest/model.html">文档</a></p>
<p>以 word2vec 为例，评估模型表现</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">ct</span><span class="o">.</span><span class="n">evaluate_similarity</span><span class="p">(</span><span class="n">w2v_model</span><span class="p">)</span>

<span class="n">ct</span><span class="o">.</span><span class="n">evaluate_analogy</span><span class="p">(</span><span class="n">w2v_model</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">近义测试: similarity.txt
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/cntext/model/evaluate_data/similarity.txt

评估结果：
+----------+------------+----------------------------+
| 发现词语 | 未发现词语 | Spearman&#39;s Rank Coeficient |
+----------+------------+----------------------------+
|   425    |    112     |            0.42            |
+----------+------------+----------------------------+


类比测试: analogy.txt
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/cntext/model/evaluate_data/analogy.txt
Processing Analogy Test: 100%|██████████████| 1198/1198 [00:11&lt;00:00, 99.91it/s]

评估结果：
+--------------------+----------+------------+------------+----------+
|      Category      | 发现词语 | 未发现词语 | 准确率 (%) | 平均排名 |
+--------------------+----------+------------+------------+----------+
| CapitalOfCountries |   455    |    222     |   31.21    |   4.30   |
|   CityInProvince   |   175    |     0      |   97.71    |   1.26   |
| FamilyRelationship |    90    |    182     |   10.00    |   5.89   |
|   SocialScience    |    9     |     61     |   44.44    |   4.50   |
+--------------------+----------+------------+------------+----------+
</code></pre></div><p><strong>近义测试</strong>: Spearman&rsquo;s Rank Coeficient 系数取值[-1, 1], 取值越大， 说明模型表现越好。</p>
<br>
<p><strong>类比测试</strong>:</p>
<ul>
<li>CapitalOfCountries 中文 md&amp;a 语料在此项表现较差， 应该是语料中常见国家首度的提及较少。也体现了大多数企业没有国际化。盲猜美股的 CapitalOfCountries 表现应该好于 A 股。</li>
<li>CityInProvince 中文 md&amp;a 语料在此项表现如此优异，说明 A 股多数企业扎根于中国大地， 年报 md&amp;a 中提及次数很多。</li>
<li>FamilyRelationship 中文 md&amp;a 语料中主要体现的是公司组织层面，较少提及家庭关系词语，所以类别表现一般是很容易理解的。</li>
<li>SocialScience 中文 md&amp;a 语料在此项表现一般， 应该是语料中常见的社会科学词语提及较少。</li>
</ul>
<p>整体而言，语料训练的效果很不错，抓住了数据场景的独特性语义。</p>
<br>
<h3 id="33-keyedvectors-的操作方法或属性">3.3 KeyedVectors 的操作方法(或属性)</h3>
<table>
<thead>
<tr>
<th>方法</th>
<th>描述</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong><em>KeyedVectors.index_to_key</em></strong></td>
<td>获取词汇表中的所有单词。</td>
</tr>
<tr>
<td><strong><em>KeyedVectors.key_to_index</em></strong></td>
<td>获取单词到索引的映射。</td>
</tr>
<tr>
<td><strong><em>KeyedVectors.vector_size</em></strong></td>
<td>获取 GloVe 模型中任意词向量的维度。</td>
</tr>
<tr>
<td><strong><em>KeyedVectors.get_vector(word)</em></strong></td>
<td>获取给定单词的词向量。</td>
</tr>
<tr>
<td><strong><em>KeyedVectors.similar_by_word(word, topn=10)</em></strong></td>
<td>获取某词语最相似的 10 个近义词。</td>
</tr>
<tr>
<td><strong><em>KeyedVectors.similar_by_vector(vector, topn=10)</em></strong></td>
<td>获取词向量最相似的 10 个近义词。</td>
</tr>
</tbody>
</table>
<h3 id="34-查看词汇量维度数">3.4 查看词汇量&amp;维度数</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1"># 词汇量</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;Word2Vec词汇量: &#39;</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">w2v_model</span><span class="p">))</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;GloVe词汇量: &#39;</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">glove_model</span><span class="p">))</span>

<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;Word2Vec维度数: &#39;</span><span class="p">,</span> <span class="n">w2v_model</span><span class="o">.</span><span class="n">vector_size</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;GloVe维度数: &#39;</span><span class="p">,</span> <span class="n">glove_model</span><span class="o">.</span><span class="n">vector_size</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">Word2Vec词汇量:  902666
GloVe词汇量:     536864
Word2Vec维度数:  200
GloVe维度数:     200
</code></pre></div><br>
<h3 id="35-词表">3.5 词表</h3>
<p>查看词表</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">w2v_model</span><span class="o">.</span><span class="n">index_to_key</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[&#39;公司&#39;,
 &#39;适用&#39;,
 &#39;情况&#39;,
 &#39;项目&#39;,
 &#39;产品&#39;,
 ...
 &#39;比上&#39;,
 &#39;境内&#39;,
 &#39;最终&#39;,
 &#39;启动&#39;,
 ...]
</code></pre></div><br>
<p>查看词汇映射表</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">w2v_model</span><span class="o">.</span><span class="n">key_to_index</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">{&#39;公司&#39;: 0,
 &#39;适用&#39;: 1,
 &#39;情况&#39;: 2,
 &#39;项目&#39;: 3,
 &#39;产品&#39;: 4,
 ......
 &#39;比上&#39;: 996,
 &#39;境内&#39;: 997,
 &#39;最终&#39;: 998,
 &#39;启动&#39;: 999,
 ...}
</code></pre></div><br>
<h3 id="36-查看词向量">3.6 查看词向量</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1"># 查询某词的词向量</span>
<span class="n">w2v_model</span><span class="o">.</span><span class="n">get_vector</span><span class="p">(</span><span class="s1">&#39;创新&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">array([ 2.2048666 ,  1.7392347 ,  3.8569732 ,  2.181498  ,  0.49182096,
       -2.0054908 ,  0.55133677,  0.97385484,  3.6563325 , -2.1495004 ,
       -4.8804154 ,  2.8375697 ,  2.071349  ,  3.0867636 , -1.3978149 ,
       -0.38058507, -2.379905  , -1.8974878 ,  3.596266  ,  0.44742537,
        ......
        0.13521506, -0.78970003, -0.8154422 ,  1.015166  ,  0.30753416,
       -6.1991196 , -2.2295246 ,  0.797445  , -0.21968505,  1.6549479 ,
       -1.1522037 , -1.5377268 , -3.4639692 , -3.3877385 ,  3.5285642 ,
        0.9497059 , -2.6022844 ,  1.6192312 , -0.39254257, -0.5094183 ],
      dtype=float32)
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1"># 查询多个词的词向量</span>
<span class="n">w2v_model</span><span class="o">.</span><span class="n">get_mean_vector</span><span class="p">([</span><span class="s1">&#39;创新&#39;</span><span class="p">,</span> <span class="s1">&#39;研发&#39;</span><span class="p">])</span>
</code></pre></div><p>Ruj</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">array([ 0.01322632,  0.01596442,  0.08699574,  0.06786569,  0.00441768,
       -0.04059787,  0.01970061,  0.02050735,  0.04548474, -0.01610814,
       -0.10554063,  0.08021796,  0.10255495,  0.06383747, -0.07158516,
        0.00185056, -0.02854855, -0.09506228,  0.1032301 , -0.05448814,
       ......
       -0.01035122, -0.02931183, -0.03785197,  0.04421834,  0.04357708,
       -0.15989086, -0.05572033,  0.02324059, -0.08414906,  0.02760434,
        0.01254621, -0.02324901, -0.05535778, -0.06064604,  0.0409652 ,
       -0.04119795, -0.08222105,  0.03998823, -0.03626942, -0.01975589],
      dtype=float32)

</code></pre></div><br>
<h3 id="37-近义词">3.7 近义词</h3>
<p>根据词语查找最相似的 10 个词</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">w2v_model</span><span class="o">.</span><span class="n">similar_by_word</span><span class="p">(</span><span class="s1">&#39;创新&#39;</span><span class="p">,</span> <span class="n">topn</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[(&#39;技术创新&#39;, 0.6993309855461121),
 (&#39;不断创新&#39;, 0.6758015155792236),
 (&#39;创新型&#39;, 0.636788547039032),
 (&#39;创新能力&#39;, 0.6053606271743774),
 (&#39;引领&#39;, 0.604947566986084),
 (&#39;硬核&#39;, 0.5690070986747742),
 (&#39;前沿&#39;, 0.5627986788749695),
 (&#39;赋能&#39;, 0.5582684278488159),
 (&#39;创新性&#39;, 0.5509947538375854),
 (&#39;革新&#39;, 0.5494255423545837)]
</code></pre></div><br>
<p>根据某词的词向量查询最相似的 10 个词</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">creativeness_vector</span> <span class="o">=</span> <span class="n">w2v_model</span><span class="o">.</span><span class="n">get_vector</span><span class="p">(</span><span class="s1">&#39;创新&#39;</span><span class="p">)</span>
<span class="n">w2v_model</span><span class="o">.</span><span class="n">similar_by_vector</span><span class="p">(</span><span class="n">creativeness_vector</span><span class="p">,</span> <span class="n">topn</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[(&#39;创新&#39;, 1.0),
 (&#39;技术创新&#39;, 0.6993309855461121),
 (&#39;不断创新&#39;, 0.6758015155792236),
 (&#39;创新型&#39;, 0.636788547039032),
 (&#39;创新能力&#39;, 0.6053606271743774),
 (&#39;引领&#39;, 0.6049476265907288),
 (&#39;硬核&#39;, 0.5690070986747742),
 (&#39;前沿&#39;, 0.5627986788749695),
 (&#39;赋能&#39;, 0.5582684278488159),
 (&#39;创新性&#39;, 0.5509947538375854)]
</code></pre></div><br>
<p>多个词求得均值向量</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">AI_vector</span> <span class="o">=</span> <span class="n">w2v_model</span><span class="o">.</span><span class="n">get_mean_vector</span><span class="p">([</span><span class="s1">&#39;ai&#39;</span><span class="p">,</span>  <span class="s1">&#39;机器学习&#39;</span><span class="p">,</span> <span class="s1">&#39;人工智能&#39;</span><span class="p">,</span> <span class="s1">&#39;自然语言处理&#39;</span><span class="p">])</span>
<span class="n">w2v_model</span><span class="o">.</span><span class="n">similar_by_vector</span><span class="p">(</span><span class="n">AI_vector</span><span class="p">,</span> <span class="n">topn</span><span class="o">=</span><span class="mi">20</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[(&#39;ai&#39;, 0.9074109792709351),
 (&#39;机器学习&#39;, 0.8809980750083923),
 (&#39;自然语言处理&#39;, 0.8750396966934204),
 (&#39;ai模型&#39;, 0.8575210571289062),
 (&#39;人工智能&#39;, 0.8506893515586853),
 (&#39;nlp&#39;, 0.8240388035774231),
 (&#39;语言模型&#39;, 0.8206671476364136),
 (&#39;模态模型&#39;, 0.8144882917404175),
 (&#39;深度学习&#39;, 0.7912176847457886),
 (&#39;生成式&#39;, 0.7850476503372192),
 (&#39;自然语言&#39;, 0.7846022248268127),
 (&#39;llm&#39;, 0.7809537649154663),
 (&#39;大模&#39;, 0.7670232653617859),
 (&#39;gpt&#39;, 0.7638874053955078),
 (&#39;自然语言理解&#39;, 0.7441188097000122),
 (&#39;知识图谱&#39;, 0.7421959638595581),
 (&#39;生成式ai&#39;, 0.7387682199478149),
 (&#39;aigc&#39;, 0.7381091117858887),
 (&#39;ai算法&#39;, 0.7311530709266663),
 (&#39;语音识别&#39;, 0.7257674932479858)]
</code></pre></div><br>
<p>短视主义词</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">short_term_vector</span> <span class="o">=</span> <span class="n">w2v_model</span><span class="o">.</span><span class="n">get_mean_vector</span><span class="p">([</span><span class="s1">&#39;尽快&#39;</span><span class="p">,</span>  <span class="s1">&#39;年内&#39;</span><span class="p">,</span> <span class="s1">&#39;马上&#39;</span><span class="p">])</span>
<span class="n">w2v_model</span><span class="o">.</span><span class="n">similar_by_vector</span><span class="p">(</span><span class="n">short_term_vector</span><span class="p">,</span> <span class="n">topn</span><span class="o">=</span><span class="mi">20</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[(&#39;尽快&#39;, 0.7294592261314392),
 (&#39;年内&#39;, 0.7279667854309082),
 (&#39;尽早&#39;, 0.6742831468582153),
 (&#39;马上&#39;, 0.6565427184104919),
 (&#39;即将&#39;, 0.61030113697052),
 (&#39;早日&#39;, 0.6024956107139587),
 (&#39;争取早日&#39;, 0.5442042946815491),
 (&#39;争取尽早&#39;, 0.5283723473548889),
 (&#39;抓紧&#39;, 0.5254929661750793),
 (&#39;争取&#39;, 0.5205905437469482),
 (&#39;短时间&#39;, 0.5205082297325134),
 (&#39;争取尽快&#39;, 0.5160724520683289),
 (&#39;按期&#39;, 0.51212477684021),
 (&#39;后续&#39;, 0.5105950236320496),
 (&#39;力争早日&#39;, 0.5102716684341431),
 (&#39;提前&#39;, 0.5060917139053345),
 (&#39;力争&#39;, 0.4955942928791046),
 (&#39;力争尽早&#39;, 0.4942554235458374),
 (&#39;最后&#39;, 0.4882470369338989),
 (&#39;立即&#39;, 0.4858567416667938)]
</code></pre></div><p><br><br></p>
<h2 id="四扩展词典">四、扩展词典</h2>
<p>做词典法的文本分析，最重要的是有自己的领域词典。之前受限于技术难度，文科生的我也一直在用形容词的通用情感词典。现在依托 word2vec 技术， 可以加速人工构建的准确率和效率。</p>
<p>下面是在 <strong><em>mda01-24-Word2Vec.200.15.bin</em></strong> 上做的词典扩展测试，函数 <strong><em>ct.expand_dictionary(wv, seeddict, topn=100)</em></strong> 会根据种子词选取最准确的 topn 个词。</p>
<ul>
<li><strong><em>wv</em></strong> 预训练模型，数据类型为 gensim.models.keyedvectors.KeyedVectors。</li>
<li><strong><em>seeddict</em></strong> 参数类似于种子词；格式为 PYTHON 字典；</li>
<li><strong><em>topn</em></strong> 返回 topn 个语义最接近 seeddict 的词，默认 100.</li>
</ul>
<br>
<p>假设现在有种子词 seeddicts， 内含我构建的 <strong><em>短视词</em></strong>、 <strong><em>创新词</em></strong>、 <strong><em>竞争词</em></strong>， 我希望生成最终各含 30 个词的候选词表 txt 文件。</p>
<p>可以使用 <strong><em>ct.expand_dictionary</em></strong> 进行如下操作</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">seeddicts</span> <span class="o">=</span> <span class="p">{</span>
    <span class="s1">&#39;短视词&#39;</span><span class="p">:</span> <span class="p">[</span><span class="s1">&#39;抓紧&#39;</span><span class="p">,</span> <span class="s1">&#39;立刻&#39;</span><span class="p">,</span> <span class="s1">&#39;月底&#39;</span><span class="p">,</span> <span class="s1">&#39;年底&#39;</span><span class="p">,</span> <span class="s1">&#39;年终&#39;</span><span class="p">,</span> <span class="s1">&#39;争取&#39;</span><span class="p">,</span> <span class="s1">&#39;力争&#39;</span><span class="p">],</span>
    <span class="s1">&#39;创新词&#39;</span><span class="p">:</span> <span class="p">[</span><span class="s1">&#39;创新&#39;</span><span class="p">,</span> <span class="s1">&#39;科技&#39;</span><span class="p">,</span>  <span class="s1">&#39;研发&#39;</span><span class="p">,</span>  <span class="s1">&#39;技术&#39;</span><span class="p">,</span> <span class="s1">&#39;标准&#39;</span><span class="p">],</span>
    <span class="s1">&#39;竞争词&#39;</span><span class="p">:</span> <span class="p">[</span><span class="s1">&#39;竞争&#39;</span><span class="p">,</span> <span class="s1">&#39;竞争力&#39;</span><span class="p">],</span>
    <span class="p">}</span>

<span class="n">ct</span><span class="o">.</span><span class="n">expand_dictionary</span><span class="p">(</span><span class="n">wv</span> <span class="o">=</span> <span class="n">w2v_model</span><span class="p">,</span>
                     <span class="n">seeddict</span> <span class="o">=</span> <span class="n">seeddicts</span><span class="p">,</span>
                     <span class="n">topn</span><span class="o">=</span><span class="mi">30</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">Finish! 短视词 candidates saved to output/短视词.txt
Finish! 创新词 candidates saved to output/创新词.txt
Finish! 竞争词 candidates saved to output/竞争词.txt
</code></pre></div><p><img loading="lazy" src="img/03-expand.jpg" alt=""  />
</p>
<p><br><br></p>
<h2 id="六获取模型">六、获取模型</h2>
<p>内容创作不易， 本文为付费内容，</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">- 免费     mda01-24-Word2Vec.200.15.bin   链接: https://pan.baidu.com/s/1Gke4UKOnswpctp8vsZ0koQ?pwd=dpry

- 免费     mda01-24-GloVe.200.15.bin 链接: https://pan.baidu.com/s/1TqoA4TqMAhLzpIp0ZvrQEA?pwd=ajjw

- 更多免费词向量      https://cntext.readthedocs.io/zh-cn/latest/embeddings.html
</code></pre></div><p><br><br></p>
<h2 id="相关内容">相关内容</h2>
<p>相关文献</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[0]刘景江,郑畅然,洪永淼.机器学习如何赋能管理学研究？——国内外前沿综述和未来展望[J].管理世界,2023,39(09):191-216.
[1]冉雅璇,李志强,刘佳妮,张逸石.大数据时代下社会科学研究方法的拓展——基于词嵌入技术的文本分析的应用[J].南开管理评论:1-27.
[3]胡楠,薛付婧,王昊楠.管理者短视主义影响企业长期投资吗？——基于文本分析和机器学习[J].管理世界,2021,37(05):139-156+11+19-21.
[4]Kai Li, Feng Mai, Rui Shen, Xinyan Yan, Measuring Corporate Culture Using Machine Learning, *The Review of Financial Studies*,2020
</code></pre></div><br>
<ul>
<li><a href="https://textdata.cn/blog/the_text_analysis_list_about_ms/">LIST | 社科(经管)文本挖掘文献汇总</a></li>
<li><a href="https://textdata.cn/blog/text_analysis_code_list_about_ms/">LIST | 文本分析代码汇总</a></li>
<li><a href="https://textdata.cn/blog/2023-03-15-39faq-about-word-embeddings-for-social-science/">词嵌入技术在社会科学领域进行数据挖掘常见 39 个 FAQ 汇总</a></li>
<li><a href="https://textdata.cn/blog/datasets_available_for_management_science/">LIST | 可供社科(经管)领域使用的数据集</a></li>
<li><a href="https://textdata.cn/blog/management_python_course/">Python 实证指标构建与文本分析</a></li>
<li><a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/">推荐 | 文本分析库 cntext2.x 使用手册</a></li>
<li><a href="https://textdata.cn/blog/2023-11-20-word2vec-by-year-by-province/">使用 5000w 专利申请数据集按年份(按省份)训练词向量</a></li>
<li><a href="https://textdata.cn/blog/2023-11-10-training-word2vec-model-using-china-3751w-patent-application-dataset/">词向量 | 使用1985年-2025年专利申请摘要训练 Word2Vec 模型</a></li>
<li><a href="https://textdata.cn/blog/2023-12-28-train-word2vec-using-renmin-gov-leader-board-dataset/">词向量 | 使用人民网领导留言板语料训练 Word2Vec 模型</a></li>
<li><a href="https://textdata.cn/blog/2025-03-28-train_a_glove_model_on_chinese_corpus_using_stanfordnlp/">实验 | 使用 Stanford Glove 代码训练中文语料的 Glove 模型</a></li>
<li><a href="https://textdata.cn/blog/2023-12-28-visualize-the-culture-change-using-people-daily-dataset/">可视化 | 人民日报语料反映七十年文化演变</a></li>
</ul>
]]></content:encoded>
    </item>
    
    <item>
      <title>文化几何学：通过词嵌入分析反映文本背后的社会文化(变迁)</title>
      <link>https://textdata.cn/blog/2025-04-23-word-embedding-reflect-human-attitude/</link>
      <pubDate>Wed, 23 Apr 2025 00:00:00 +0000</pubDate>
      
      <guid>/blog/2025-04-23-word-embedding-reflect-human-attitude/</guid>
      <description>中文语料预训练模型列表， 使用 cntext2.x 训练出的预训练语言模型， 主要分 GloVe 和 Word2Vec 两种。</description>
      <content:encoded><![CDATA[<p>人类在留下语言、文字的过程中，也留下了自己的偏见、态度等主观认知信息（偏见、态度）。词嵌入做为一种词向量模型，可以隐含上下文的情景信息，态度及偏见很容易保留在词向量的某些维度中。通过词向量距离的测算，就可以间接测得不同群体 对 某概念(组织、群体、品牌、地域等)的态度偏见。</p>
<ul>
<li><a href="https://textdata.cn/blog/2023-11-03-organization-science-with-word-embeddings/">OS2022 | 概念空间 | 词嵌入模型如何为组织科学中的测量和理论提供信息</a></li>
<li><a href="https://textdata.cn/blog/2022-04-07-word-embeddings-in-social-science/">转载 | 大数据时代下社会科学研究方法的拓展——基于词嵌入技术的文本分析的应用</a></li>
</ul>
<p>虽然现在LLM(大语言模型)很火，但其底层架构Transformer正是基于词向量(分布式语义表示)发展而来。LLM虽然能生成流畅的文本，但其&quot;黑箱&quot;特性使得我们难以直接分析其中蕴含的社会偏见。相比之下，传统的词嵌入模型(维度通常在50-300之间)虽然维度较高难以直观理解，但通过线性代数等数学工具，我们可以精确测量和分析词向量空间中的文化偏见和态度倾向。</p>
<p>下图所示， 在大众点评语料的词向量中蕴含着一些文化(态度或刻板印象)， 如提起<strong>旅行</strong>这件事， 大家脑海里首先想到的是一群年轻女性探索有趣的世界，世界那么大我想去看看。 而 <strong>高尔夫球</strong>， 在大家认知里是一群男性老板通过该活动社交谈生意。 </p>
<p><img loading="lazy" src="img/04-hobby.png" alt=""  />
</p>
<br>
<h2 id="一文献">一、文献</h2>
<p>这篇文献挺老的，但是算法思路目前很有启发。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">Kozlowski, Austin C., Matt Taddy, and James A. Evans. &#34;The geometry of culture: Analyzing the meanings of class through word embeddings.&#34; American Sociological Review 84, no. 5 (2019): 905-949.
</code></pre></div><h3 id="摘要">摘要</h3>
<p>词嵌入模型是研究文化的有效工具，并以历史分析中对社会阶级的共同理解为实证案例。词嵌入模型通过将语义关系表示为高维空间中的向量关系，提供了一种与当代文化理论一致的关系模型。在这些空间中，词差异（如“富裕-贫穷”）所诱导的维度对应于文化意义的维度，而词在这些维度上的投影反映了广泛共享的文化关联——我们通过调查验证了这一点。通过分析过去一百年来出版的数百万本书籍的文本，我们发现阶级的标志在经济转型中不断变化，但阶级的基本文化维度保持显著稳定。值得注意的是，教育成为与富裕紧密相关的因素，独立于其与高雅品味的关联。</p>
<br>
<h3 id="研究目的">研究目的</h3>
<p>验证词嵌入模型是否能够准确捕捉文化维度（如富裕、性别、种族等），并通过与人类评估的文化关联数据进行对比，证明其在社会学分析中的有效性。</p>
<br>
<h3 id="研究设计">研究设计</h3>
<ol>
<li><strong>语料准备</strong>:  基于Google Ngram语料库训练，包含1900-2012年间出版的数百万本书籍的文本数据。</li>
<li><strong>​词嵌入模型</strong>​​: 基于语料训练词嵌入模型(GloVe或Word2Vec），这篇文章使用的Word2Vec。 词嵌入算法将文本中的单词表示为高维空间中的向量。这些向量基于单词在文本中的上下文关系，共享相似上下文的单词在空间中位置相近。</li>
<li><strong>​文化维度的识别</strong>​​: 通过计算反义词对的平均值来识别文化维度。例如，通过计算“rich”和“poor”等反义词对的差值，构建“富裕-贫穷”这一文化维度。</li>
<li><strong>词向量投影</strong>​​: 将单词对应的词向量投影到特定的概念轴（文化维度，如性别、财富）向量上，计算其在该维度上的关联强度。 投影值通过余弦相似度衡量，正值表示与某一文化维度（如富裕）的正向关联，负值表示负向关联。</li>
<li><strong>验证方法</strong>​​: 要求受访者对59个词汇（如“banker”“jazz”“nurse”）在三个维度（阶级、种族、性别）上进行评分(0~100)。例如“在0到100的范围内，您认为‘芭蕾舞’在多大程度上属于上层阶级？”； 通过调查数据验证词嵌入模型在捕捉文化关联方面的有效性。比较词嵌入模型与人类评估的文化关联数据，计算Pearson相关系数。</li>
<li>​<strong>静态​结果可视化​</strong>​: 通过图表展示运动词在不同文化维度(性别维度、财富维度)上的投影结果。</li>
</ol>
<p>以上步骤证明了词嵌入投影算法捕捉人类社会文化(文本中蕴含的文化线索)的有效性，接下来按每10年构建一个语料(1900~2010)， 训练出不同年代的词向量。财富维度与六种阶级维度(教育、培养、地位、道德、职业、性别)余弦相似度的关系。 下图是六个维度的正反义词对儿。</p>
<p><img loading="lazy" src="img/01-word-pairs.png" alt=""  />
</p>
<br>
<h3 id="研究结果">研究结果</h3>
<ol>
<li><strong>​文化维度的有效性​</strong>​: 词嵌入模型在捕捉文化关联方面表现出色，与人类评估的相关系数在0.53到0.90之间。性别维度的关联最强，种族维度的关联较弱。</li>
<li><strong>多维度的阶级结构</strong>​​: 阶级的文化维度形成了一个复杂但稳定的语义结构，包括财富、地位、教育、道德等维度。这些维度在高维空间中相互关联，无法通过低维空间准确表示。</li>
<li><strong>社会阶级的文化维度演变​</strong>​: 分析结果显示，社会阶级的文化维度在二十世纪保持稳定，但具体的文化标记（如职业名称）发生了显著变化。教育和富裕之间的关联逐渐增强，成为阶级划分的重要标志。
<img loading="lazy" src="img/02-apa-proj.png" alt=""  />

<img loading="lazy" src="img/03-evolution.png" alt=""  />
</li>
</ol>
<p><br><br></p>
<h2 id="二实验准备">二、实验准备</h2>
<h3 id="21-训练模型">2.1 训练模型</h3>
<p>使用 <a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/">cntext2.x</a> 训练出的预训练语言模型， 具体可参考</p>
<p>不考虑时间(语义演变)， 只训练一个模型
<a href="https://textdata.cn/blog/2023-11-12-using-100m-bilibili-user-sign-data-to-training-word2vec/">词向量 | 使用1亿B站用户签名训练word2vec词向量</a>
<a href="https://textdata.cn/blog/2025-03-28-train_a_glove_model_on_chinese_corpus_using_stanfordnlp/">词向量 | 使用Stanford Glove代码训练中文语料的Glove模型</a>
<a href="https://textdata.cn/blog/2023-11-10-training-word2vec-model-using-china-3751w-patent-application-dataset/">词向量 | 使用1985年-2025年专利申请摘要训练 Word2Vec 模型</a>
<a href="https://textdata.cn/blog/2023-03-24-load-w2v-and-expand-your-concpet-dicitonary/">词向量 | 使用 MD&amp;A2001-2023 语料训练 Word2Vec/GloVe 模型</a>
<a href="https://textdata.cn/blog/2023-12-28-train-word2vec-using-renmin-gov-leader-board-dataset/">词向量 | 使用人民网领导留言板语料训练 Word2Vec 模型</a></p>
<br>
<p>考虑时间因素， 按某个时间间隔(如每10年)，训练一个年代向量</p>
<p><a href="https://textdata.cn/blog/2023-12-28-visualize-the-culture-change-using-people-daily-dataset/">可视化 | 人民日报语料反映七十年文化演变</a></p>
<br>
<p>如果觉得训练太麻烦， 大邓将已经训练好的模型免费提供给大家。
<a href="https://github.com/hiDaDeng/Chinese-Pretrained-Word-Embeddings">免费资源 | cntext2.x 训练出的免费公开词向量</a></p>
<h3 id="22-读取模型">2.2 读取模型</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#pip3 install cntext --upgrade</span>
<span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="c1"># 模型下载地址</span>
<span class="c1"># https://github.com/hiDaDeng/Chinese-Pretrained-Word-Embeddings</span>
<span class="n">wv</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">load_w2v</span><span class="p">(</span><span class="s1">&#39;大众点评-评论-GloVe.200.15.bin&#39;</span><span class="p">)</span>
</code></pre></div><br>
<h3 id="23-获取词向量">2.3 获取词向量</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;模型词汇量: </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">wv</span><span class="p">)</span><span class="si">}</span><span class="s1">&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">模型词汇量: 278565
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="nb">print</span><span class="p">(</span><span class="n">wv</span><span class="p">[</span><span class="s1">&#39;富有&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">wv</span><span class="p">[</span><span class="s1">&#39;富有&#39;</span><span class="p">])</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">(200,)
array([-0.52744 , -0.108866, -0.119827, -0.644396, -0.342953, -0.503506,
       -0.453796, -0.213651,  0.041335,  0.345231,  0.4752  , -0.026904,
       -0.026971, -0.249429, -1.115758,  0.351041, -0.304552,  0.40272 ,
       ......
       ......
       -0.061966,  0.384454,  0.280508, -0.005171, -0.236791,  0.171627,
        0.151691, -0.295215,  0.233423, -0.146419, -0.210322, -0.338783,
        0.214728, -0.101312,  0.489487, -0.257294,  0.732999,  0.057721,
       -0.286473,  0.394552], dtype=float32)
</code></pre></div><p>词向量的维度是200，即每个词的语义是由200个数字组成的向量所表示。</p>
<br>
<h3 id="24-计算概念轴向量">2.4 计算概念轴向量</h3>
<p>概念轴向量为例，如何计算呢？以性别为例，</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">1. 先找出性别(男、女)的正反义词对儿
2. 分别计算正词的多个向量、负词的多个词向量
3. 求得正均值向量、负均值向量
4. 两者相减、归一化处理后得到性别概念向量。  
</code></pre></div><p>大邓将这些步骤封装到cntext2.x中，只需要将词语传入即可</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1"># 性别概念轴向量</span>
<span class="n">gender_poss</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;男人&#39;</span><span class="p">,</span> <span class="s1">&#39;男性&#39;</span><span class="p">,</span> <span class="s1">&#39;丈夫&#39;</span><span class="p">,</span> <span class="s1">&#39;他&#39;</span><span class="p">,</span> <span class="s1">&#39;爷爷&#39;</span><span class="p">,</span> <span class="s1">&#39;祖父&#39;</span><span class="p">,</span> <span class="s1">&#39;爸爸&#39;</span><span class="p">,</span> <span class="s1">&#39;父亲&#39;</span><span class="p">,</span> <span class="s1">&#39;儿子&#39;</span><span class="p">,</span> <span class="s1">&#39;兄弟&#39;</span><span class="p">]</span>
<span class="n">gender_negs</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;女人&#39;</span><span class="p">,</span> <span class="s1">&#39;女性&#39;</span><span class="p">,</span> <span class="s1">&#39;妻子&#39;</span><span class="p">,</span> <span class="s1">&#39;她&#39;</span><span class="p">,</span> <span class="s1">&#39;奶奶&#39;</span><span class="p">,</span> <span class="s1">&#39;祖母&#39;</span><span class="p">,</span> <span class="s1">&#39;妈妈&#39;</span><span class="p">,</span> <span class="s1">&#39;母亲&#39;</span><span class="p">,</span> <span class="s1">&#39;女儿&#39;</span><span class="p">,</span> <span class="s1">&#39;姐妹&#39;</span><span class="p">]</span>
<span class="n">gender_vector</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">generate_concept_axis</span><span class="p">(</span><span class="n">wv</span><span class="o">=</span><span class="n">wv</span><span class="p">,</span> 
                                         <span class="n">poswords</span><span class="o">=</span><span class="n">gender_poss</span><span class="p">,</span> 
                                         <span class="n">negwords</span><span class="o">=</span><span class="n">gender_negs</span><span class="p">)</span>


<span class="c1"># 财富概念轴向量</span>
<span class="n">affluence_poss</span> <span class="o">=</span><span class="p">[</span><span class="s1">&#39;富有&#39;</span><span class="p">,</span> <span class="s1">&#39;有钱&#39;</span><span class="p">,</span> <span class="s1">&#39;成功&#39;</span><span class="p">,</span> <span class="s1">&#39;发达&#39;</span><span class="p">,</span> <span class="s1">&#39;富裕&#39;</span><span class="p">,</span> <span class="s1">&#39;优势&#39;</span><span class="p">,</span> <span class="s1">&#39;高贵&#39;</span><span class="p">,</span> <span class="s1">&#39;高端&#39;</span><span class="p">,</span> <span class="s1">&#39;昂贵&#39;</span><span class="p">,</span> <span class="s1">&#39;华丽&#39;</span><span class="p">,</span> <span class="s1">&#39;精致&#39;</span><span class="p">,</span> <span class="s1">&#39;奢侈&#39;</span><span class="p">,</span> <span class="s1">&#39;奢华&#39;</span><span class="p">,</span> <span class="s1">&#39;充裕&#39;</span><span class="p">,</span> <span class="s1">&#39;豪华&#39;</span><span class="p">]</span>
<span class="n">affluence_negs</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;贫穷&#39;</span><span class="p">,</span> <span class="s1">&#39;没钱&#39;</span><span class="p">,</span> <span class="s1">&#39;失败&#39;</span><span class="p">,</span> <span class="s1">&#39;落后&#39;</span><span class="p">,</span> <span class="s1">&#39;贫困&#39;</span><span class="p">,</span> <span class="s1">&#39;劣势&#39;</span><span class="p">,</span> <span class="s1">&#39;卑贱&#39;</span><span class="p">,</span> <span class="s1">&#39;低端&#39;</span><span class="p">,</span> <span class="s1">&#39;廉价&#39;</span><span class="p">,</span> <span class="s1">&#39;朴素&#39;</span><span class="p">,</span> <span class="s1">&#39;粗糙&#39;</span><span class="p">,</span> <span class="s1">&#39;廉价&#39;</span><span class="p">,</span> <span class="s1">&#39;节俭&#39;</span><span class="p">,</span> <span class="s1">&#39;匮乏&#39;</span><span class="p">,</span> <span class="s1">&#39;破旧&#39;</span><span class="p">]</span>
<span class="n">affluence_vector</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">generate_concept_axis</span><span class="p">(</span><span class="n">wv</span><span class="o">=</span><span class="n">wv</span><span class="p">,</span> 
                                            <span class="n">poswords</span><span class="o">=</span><span class="n">affluence_poss</span><span class="p">,</span> 
                                            <span class="n">negwords</span><span class="o">=</span><span class="n">affluence_negs</span><span class="p">)</span>

<span class="c1"># 查看性别概念轴向量</span>
<span class="nb">print</span><span class="p">(</span><span class="n">gender_vector</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">gender_vector</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">(200,)
[-0.11656909 -0.19618881 -0.01077267  0.04915987  0.00569247  0.05462526
 -0.009799    0.00770712  0.05658354  0.04547084  0.03688154 -0.02133968
 -0.0706896   0.08739712  0.11174724 -0.02057768  0.03183764  0.01165388
  ......
  ......
  0.0101583   0.09426635 -0.09078085 -0.13099451 -0.02234778  0.03765206
  0.1083525   0.07751778  0.04983377  0.03304265 -0.05442946  0.11609897
 -0.10463558  0.00224418  0.00210647 -0.04888193  0.01931083  0.07366373
 -0.01534469  0.06682201]
</code></pre></div><p>注意:</p>
<ol>
<li>词向量、 概念轴向量维度是相同的，在本文案例中都是200.</li>
<li>注意概念正反义词对方向的确定， 方向决定了对计算结果正负号数字的解读。 例如性别概念轴维度，将男性确定为正义词， 任意词的词向量与性别概念轴计算投影(或余弦相似度)， 数值越大， 说明该词与男性的相关性越大。</li>
</ol>
<h3 id="25-计算投影">2.5 计算投影</h3>
<p>cntext2.x封装了投影计算，只需要传入词语或词向量即可。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">project_word</span><span class="p">(</span><span class="n">wv</span><span class="p">,</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="n">cosine</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
</code></pre></div><p>在向量空间中， 计算词语 a 在词语 b 上的投影。</p>
<ul>
<li><strong>wv</strong> 语料 txt 文件路径</li>
<li><strong>a</strong> 词语 a 字符串或列表</li>
<li><strong>b</strong> 词语字符串、词语列表、或某概念向量</li>
<li><strong>cosine</strong> 是否使用余弦相似度， 默认为False， 函数计算结果为a在b上的投影值。 如果为True， 函数计算结果为a与b的余弦相似度。</li>
</ul>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="c1"># 获取词向量文件</span>
<span class="c1"># https://github.com/hiDaDeng/Chinese-Pretrained-Word-Embeddings</span>
<span class="n">dm_w2v</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">load_w2v</span><span class="p">(</span><span class="s1">&#39;douban-movie-1000w-Word2Vec.200.15.bin&#39;</span><span class="p">)</span>

<span class="n">b</span><span class="o">=</span><span class="s1">&#39;苗条&#39;</span>
<span class="k">for</span> <span class="n">a</span> <span class="ow">in</span> <span class="p">[</span><span class="s1">&#39;性感&#39;</span><span class="p">,</span><span class="s1">&#39;美丽&#39;</span><span class="p">,</span> <span class="s1">&#39;可爱&#39;</span><span class="p">,</span> <span class="s1">&#39;丑陋&#39;</span><span class="p">]:</span>
    <span class="n">proj</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">project_word</span><span class="p">(</span><span class="n">dm_w2v</span><span class="p">,</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">)</span>
    <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;[</span><span class="si">{</span><span class="n">a</span><span class="si">}</span><span class="s1">]在[</span><span class="si">{</span><span class="n">b</span><span class="si">}</span><span class="s1">]投影值: </span><span class="si">{</span><span class="n">proj</span><span class="si">}</span><span class="s1">&#39;</span><span class="p">)</span>


<span class="n">b</span><span class="o">=</span><span class="s1">&#39;修长&#39;</span>
<span class="k">for</span> <span class="n">a</span> <span class="ow">in</span> <span class="p">[</span><span class="s1">&#39;性感&#39;</span><span class="p">,</span><span class="s1">&#39;美丽&#39;</span><span class="p">,</span> <span class="s1">&#39;可爱&#39;</span><span class="p">,</span> <span class="s1">&#39;丑陋&#39;</span><span class="p">]:</span>
    <span class="n">proj</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">project_word</span><span class="p">(</span><span class="n">dm_w2v</span><span class="p">,</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">)</span>
    <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;[</span><span class="si">{</span><span class="n">a</span><span class="si">}</span><span class="s1">]在[</span><span class="si">{</span><span class="n">b</span><span class="si">}</span><span class="s1">]投影值: </span><span class="si">{</span><span class="n">proj</span><span class="si">}</span><span class="s1">&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[性感]在[苗条]投影值: 14.172947883605957
[美丽]在[苗条]投影值: 7.0944623947143555
[可爱]在[苗条]投影值: 6.935092926025391
[丑陋]在[苗条]投影值: 1.235807180404663

[性感]在[修长]投影值: 14.599699974060059
[美丽]在[修长]投影值: 9.360642433166504
[可爱]在[修长]投影值: 4.740543842315674
[丑陋]在[修长]投影值: 4.010622501373291
</code></pre></div><p>可以看到， 在豆瓣电影语料中， 在[苗条、修长]维度的认知中，都认为</p>
<ul>
<li>[性感]意味着身材最瘦长</li>
<li>[美丽]次之、[可爱]略显不那么修长苗条</li>
<li>[丑陋]意味着基本与[苗条、修长]无关，数值最小。</li>
</ul>
<br>
<h2 id="三实验可视化">三、实验可视化</h2>
<h3 id="31-静态可视化">3.1 静态可视化</h3>
<p>不考虑时间因素，将所有语料训练得出一个词向量， 在这个词向量基础上进行语义投影可视化。</p>
<p>这里用大众点评评论语料训练出的词向量为例，进行爱好词、品牌词、美食词在性别维度、财富维度的投影。看看这些词（爱好词、品牌词、美食词）是否体现出性别差异、财富差异。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="n">ct</span><span class="o">.</span><span class="n">matplotlib_chinese</span><span class="p">()</span> <span class="c1"># 确保中文显示</span>
<span class="n">plt</span><span class="o">.</span><span class="n">style</span><span class="o">.</span><span class="n">use</span><span class="p">(</span><span class="s1">&#39;ggplot&#39;</span><span class="p">)</span>  <span class="c1"># 使用内置的 ggplot 风格作为基础</span>



<span class="c1"># ====== 用户已经完成的数据准备部分（假设已运行）======</span>
<span class="n">wv</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">load_w2v</span><span class="p">(</span><span class="s1">&#39;大众点评-评论-GloVe.200.15.bin&#39;</span><span class="p">)</span>
<span class="n">gender_vector</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">generate_concept_axis</span><span class="p">(</span><span class="n">wv</span><span class="o">=</span><span class="n">wv</span><span class="p">,</span>
                                          <span class="n">poswords</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;男人&#39;</span><span class="p">,</span> <span class="s1">&#39;男性&#39;</span><span class="p">,</span> <span class="s1">&#39;丈夫&#39;</span><span class="p">,</span> <span class="s1">&#39;他&#39;</span><span class="p">,</span> <span class="s1">&#39;爷爷&#39;</span><span class="p">,</span> <span class="s1">&#39;祖父&#39;</span><span class="p">,</span> <span class="s1">&#39;爸爸&#39;</span><span class="p">,</span> <span class="s1">&#39;父亲&#39;</span><span class="p">,</span> <span class="s1">&#39;儿子&#39;</span><span class="p">,</span> <span class="s1">&#39;兄弟&#39;</span><span class="p">],</span>
                                          <span class="n">negwords</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;女人&#39;</span><span class="p">,</span> <span class="s1">&#39;女性&#39;</span><span class="p">,</span> <span class="s1">&#39;妻子&#39;</span><span class="p">,</span> <span class="s1">&#39;她&#39;</span><span class="p">,</span> <span class="s1">&#39;奶奶&#39;</span><span class="p">,</span> <span class="s1">&#39;祖母&#39;</span><span class="p">,</span> <span class="s1">&#39;妈妈&#39;</span><span class="p">,</span> <span class="s1">&#39;母亲&#39;</span><span class="p">,</span> <span class="s1">&#39;女儿&#39;</span><span class="p">,</span> <span class="s1">&#39;姐妹&#39;</span><span class="p">])</span>
<span class="n">affluence_vector</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">generate_concept_axis</span><span class="p">(</span><span class="n">wv</span><span class="o">=</span><span class="n">wv</span><span class="p">,</span>
                                          <span class="n">poswords</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;富有&#39;</span><span class="p">,</span> <span class="s1">&#39;有钱&#39;</span><span class="p">,</span> <span class="s1">&#39;成功&#39;</span><span class="p">,</span> <span class="s1">&#39;发达&#39;</span><span class="p">,</span> <span class="s1">&#39;富裕&#39;</span><span class="p">,</span> <span class="s1">&#39;优势&#39;</span><span class="p">,</span> <span class="s1">&#39;高贵&#39;</span><span class="p">,</span> <span class="s1">&#39;高端&#39;</span><span class="p">,</span> <span class="s1">&#39;昂贵&#39;</span><span class="p">,</span> <span class="s1">&#39;华丽&#39;</span><span class="p">,</span> <span class="s1">&#39;精致&#39;</span><span class="p">,</span> <span class="s1">&#39;奢侈&#39;</span><span class="p">,</span> <span class="s1">&#39;奢华&#39;</span><span class="p">,</span> <span class="s1">&#39;充裕&#39;</span><span class="p">,</span> <span class="s1">&#39;豪华&#39;</span><span class="p">],</span>
                                          <span class="n">negwords</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;贫穷&#39;</span><span class="p">,</span> <span class="s1">&#39;没钱&#39;</span><span class="p">,</span> <span class="s1">&#39;失败&#39;</span><span class="p">,</span> <span class="s1">&#39;落后&#39;</span><span class="p">,</span> <span class="s1">&#39;贫困&#39;</span><span class="p">,</span> <span class="s1">&#39;劣势&#39;</span><span class="p">,</span> <span class="s1">&#39;卑贱&#39;</span><span class="p">,</span> <span class="s1">&#39;低端&#39;</span><span class="p">,</span> <span class="s1">&#39;廉价&#39;</span><span class="p">,</span> <span class="s1">&#39;朴素&#39;</span><span class="p">,</span> <span class="s1">&#39;粗糙&#39;</span><span class="p">,</span> <span class="s1">&#39;廉价&#39;</span><span class="p">,</span> <span class="s1">&#39;节俭&#39;</span><span class="p">,</span> <span class="s1">&#39;匮乏&#39;</span><span class="p">,</span> <span class="s1">&#39;破旧&#39;</span><span class="p">])</span>


<span class="n">words</span> <span class="o">=</span>  <span class="p">[</span><span class="s2">&#34;象棋&#34;</span><span class="p">,</span> <span class="s2">&#34;麻将&#34;</span><span class="p">,</span> <span class="s2">&#34;围棋&#34;</span><span class="p">,</span> <span class="s2">&#34;高尔夫&#34;</span><span class="p">,</span> <span class="s2">&#34;武术&#34;</span><span class="p">,</span> <span class="s2">&#34;潜水&#34;</span><span class="p">,</span> <span class="s2">&#34;书法&#34;</span><span class="p">,</span> <span class="s2">&#34;瑜伽&#34;</span><span class="p">,</span> <span class="s2">&#34;羽毛球&#34;</span><span class="p">,</span> <span class="s2">&#34;马术&#34;</span><span class="p">,</span> <span class="s2">&#34;网球&#34;</span><span class="p">,</span> <span class="s2">&#34;美妆&#34;</span><span class="p">,</span> <span class="s2">&#34;旅行&#34;</span><span class="p">]</span>
<span class="c1"># words =  [&#34;烧烤&#34;, &#34;寿司&#34;, &#34;牛排&#34;, &#34;白酒&#34;, &#34;啤酒&#34;, &#34;麻辣烫&#34;, &#34;汉堡&#34;, &#34;煎饼&#34;, &#34;包子&#34;, &#34;小米粥&#34;, &#34;沙拉&#34;, &#34;披萨&#34;]</span>
<span class="c1"># words =  [&#34;阿玛尼&#34;, &#34;coach&#34;, &#34;lv&#34;, &#34;耐克&#34;, &#34;阿迪&#34;, &#34;爱马仕&#34;, &#34;优衣库&#34;, &#34;海澜&#34;]</span>

<span class="n">gender_proj</span> <span class="o">=</span> <span class="p">[</span><span class="n">ct</span><span class="o">.</span><span class="n">project_word</span><span class="p">(</span><span class="n">wv</span><span class="o">=</span><span class="n">wv</span><span class="p">,</span> <span class="n">a</span><span class="o">=</span><span class="n">word</span><span class="p">,</span> <span class="n">b</span><span class="o">=</span><span class="n">gender_vector</span><span class="p">)</span> <span class="k">for</span> <span class="n">word</span> <span class="ow">in</span> <span class="n">words</span><span class="p">]</span>
<span class="n">affluence_proj</span> <span class="o">=</span> <span class="p">[</span><span class="n">ct</span><span class="o">.</span><span class="n">project_word</span><span class="p">(</span><span class="n">wv</span><span class="o">=</span><span class="n">wv</span><span class="p">,</span> <span class="n">a</span><span class="o">=</span><span class="n">word</span><span class="p">,</span> <span class="n">b</span><span class="o">=</span><span class="n">affluence_vector</span><span class="p">)</span> <span class="k">for</span> <span class="n">word</span> <span class="ow">in</span> <span class="n">words</span><span class="p">]</span>
<span class="c1"># ========================================================</span>


<span class="c1"># ====== 绘图部分 ======</span>
<span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">8</span><span class="p">))</span>
<span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">gca</span><span class="p">()</span> <span class="c1"># 获取当前 axes 对象，方便后续操作，特别是设置限制</span>

<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">&#39;爱好的(性别-财富)刻板印象&#39;</span><span class="p">,</span> <span class="n">pad</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="n">plt</span><span class="o">.</span><span class="n">rcParams</span><span class="p">[</span><span class="s1">&#39;axes.titlesize&#39;</span><span class="p">])</span> <span class="c1"># 使用样式中定义的标题字号和 padding</span>

<span class="c1"># 设置图表显示范围，略大于数据范围，为轴标签和箭头留出空间</span>
<span class="c1"># 先根据数据计算一个合理的范围，再根据需求调整</span>
<span class="n">x_data_min</span><span class="p">,</span> <span class="n">x_data_max</span> <span class="o">=</span> <span class="nb">min</span><span class="p">(</span><span class="n">affluence_proj</span><span class="p">),</span> <span class="nb">max</span><span class="p">(</span><span class="n">affluence_proj</span><span class="p">)</span>
<span class="n">y_data_min</span><span class="p">,</span> <span class="n">y_data_max</span> <span class="o">=</span> <span class="nb">min</span><span class="p">(</span><span class="n">gender_proj</span><span class="p">),</span> <span class="nb">max</span><span class="p">(</span><span class="n">gender_proj</span><span class="p">)</span>
<span class="n">x_range</span> <span class="o">=</span> <span class="n">x_data_max</span> <span class="o">-</span> <span class="n">x_data_min</span>
<span class="n">y_range</span> <span class="o">=</span> <span class="n">y_data_max</span> <span class="o">-</span> <span class="n">y_data_min</span>

<span class="c1"># 可以设置固定范围，或者根据数据范围动态计算</span>
<span class="n">x_lims</span> <span class="o">=</span> <span class="p">(</span><span class="n">x_data_min</span> <span class="o">-</span> <span class="n">x_range</span> <span class="o">*</span> <span class="mf">0.2</span><span class="p">,</span> <span class="n">x_data_max</span> <span class="o">+</span> <span class="n">x_range</span> <span class="o">*</span> <span class="mf">0.2</span><span class="p">)</span>
<span class="n">y_lims</span> <span class="o">=</span> <span class="p">(</span><span class="n">y_data_min</span> <span class="o">-</span> <span class="n">y_range</span> <span class="o">*</span> <span class="mf">0.2</span><span class="p">,</span> <span class="n">y_data_max</span> <span class="o">+</span> <span class="n">y_range</span> <span class="o">*</span> <span class="mf">0.2</span><span class="p">)</span>
<span class="c1"># 或者设置一个固定的、对称的范围，例如：</span>
<span class="n">max_abs_x</span> <span class="o">=</span> <span class="nb">max</span><span class="p">(</span><span class="nb">abs</span><span class="p">(</span><span class="n">x_data_min</span><span class="p">),</span> <span class="nb">abs</span><span class="p">(</span><span class="n">x_data_max</span><span class="p">))</span>
<span class="n">max_abs_y</span> <span class="o">=</span> <span class="nb">max</span><span class="p">(</span><span class="nb">abs</span><span class="p">(</span><span class="n">y_data_min</span><span class="p">),</span> <span class="nb">abs</span><span class="p">(</span><span class="n">y_data_max</span><span class="p">))</span>
<span class="n">plot_lim</span> <span class="o">=</span> <span class="nb">max</span><span class="p">(</span><span class="n">max_abs_x</span><span class="p">,</span> <span class="n">max_abs_y</span><span class="p">)</span> <span class="o">*</span> <span class="mf">1.3</span> <span class="c1"># 确保范围包含所有点并有余量</span>
<span class="n">x_lims</span> <span class="o">=</span> <span class="p">(</span><span class="o">-</span><span class="n">plot_lim</span><span class="p">,</span> <span class="n">plot_lim</span><span class="p">)</span>
<span class="n">y_lims</span> <span class="o">=</span> <span class="p">(</span><span class="o">-</span><span class="n">plot_lim</span><span class="p">,</span> <span class="n">plot_lim</span><span class="p">)</span>

<span class="n">ax</span><span class="o">.</span><span class="n">set_xlim</span><span class="p">(</span><span class="n">x_lims</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set_ylim</span><span class="p">(</span><span class="n">y_lims</span><span class="p">)</span>


<span class="c1"># 绘制散点</span>
<span class="n">plt</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">affluence_proj</span><span class="p">,</span> <span class="n">gender_proj</span><span class="p">,</span> <span class="n">s</span><span class="o">=</span><span class="mi">150</span><span class="p">,</span> <span class="n">edgecolor</span><span class="o">=</span><span class="s1">&#39;black&#39;</span><span class="p">,</span> <span class="n">facecolor</span><span class="o">=</span><span class="s1">&#39;lightgray&#39;</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.7</span><span class="p">)</span> <span class="c1"># 调整点大小和颜色，使其清晰</span>

<span class="c1"># 绘制中心轴线 (0,0)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">axhline</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s1">&#39;black&#39;</span><span class="p">,</span> <span class="n">linestyle</span><span class="o">=</span><span class="s1">&#39;-&#39;</span><span class="p">,</span> <span class="n">linewidth</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">axvline</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s1">&#39;black&#39;</span><span class="p">,</span> <span class="n">linestyle</span><span class="o">=</span><span class="s1">&#39;-&#39;</span><span class="p">,</span> <span class="n">linewidth</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>

<span class="c1"># 生成网格线 (使用更新后的图表限制)</span>
<span class="n">xx</span><span class="p">,</span> <span class="n">yy</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">meshgrid</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="o">-</span><span class="mf">1.2</span><span class="p">,</span> <span class="mf">1.2</span><span class="p">,</span> <span class="mi">10</span><span class="p">),</span> <span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="o">-</span><span class="mf">1.2</span><span class="p">,</span> <span class="mf">1.2</span><span class="p">,</span> <span class="mi">10</span><span class="p">))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">contourf</span><span class="p">(</span><span class="n">xx</span><span class="p">,</span> <span class="n">yy</span><span class="p">,</span> <span class="n">xx</span><span class="o">**</span><span class="mi">2</span> <span class="o">+</span> <span class="n">yy</span><span class="o">**</span><span class="mi">2</span><span class="p">,</span> <span class="n">levels</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">cmap</span><span class="o">=</span><span class="s1">&#39;gray_r&#39;</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.1</span><span class="p">)</span>



<span class="c1"># 添加轴端点标签和箭头 (更像图 3 的风格)</span>
<span class="n">arrow_length_x</span> <span class="o">=</span> <span class="n">ax</span><span class="o">.</span><span class="n">get_xlim</span><span class="p">()[</span><span class="mi">1</span><span class="p">]</span> <span class="o">*</span> <span class="mf">0.95</span> <span class="c1"># 箭头长度为范围的 90%</span>
<span class="n">arrow_length_y</span> <span class="o">=</span> <span class="n">ax</span><span class="o">.</span><span class="n">get_ylim</span><span class="p">()[</span><span class="mi">1</span><span class="p">]</span> <span class="o">*</span> <span class="mf">0.95</span>
<span class="n">head_width_x</span> <span class="o">=</span> <span class="p">(</span><span class="n">ax</span><span class="o">.</span><span class="n">get_xlim</span><span class="p">()[</span><span class="mi">1</span><span class="p">]</span> <span class="o">-</span> <span class="n">ax</span><span class="o">.</span><span class="n">get_xlim</span><span class="p">()[</span><span class="mi">0</span><span class="p">])</span> <span class="o">*</span> <span class="mf">0.015</span> <span class="c1"># 箭头头部宽度根据轴范围调整</span>
<span class="n">head_width_y</span> <span class="o">=</span> <span class="p">(</span><span class="n">ax</span><span class="o">.</span><span class="n">get_ylim</span><span class="p">()[</span><span class="mi">1</span><span class="p">]</span> <span class="o">-</span> <span class="n">ax</span><span class="o">.</span><span class="n">get_ylim</span><span class="p">()[</span><span class="mi">0</span><span class="p">])</span> <span class="o">*</span> <span class="mf">0.015</span>

<span class="c1"># Affluence 轴 (X) - 贫穷到富有</span>
<span class="n">plt</span><span class="o">.</span><span class="n">arrow</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">arrow_length_x</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">head_width</span><span class="o">=</span><span class="n">head_width_y</span><span class="p">,</span> <span class="n">head_length</span><span class="o">=</span><span class="n">head_width_x</span><span class="p">,</span> <span class="n">fc</span><span class="o">=</span><span class="s1">&#39;black&#39;</span><span class="p">,</span> <span class="n">ec</span><span class="o">=</span><span class="s1">&#39;black&#39;</span><span class="p">,</span> <span class="n">length_includes_head</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">linewidth</span><span class="o">=</span><span class="mf">0.8</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">arrow</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="o">-</span><span class="n">arrow_length_x</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">head_width</span><span class="o">=</span><span class="n">head_width_y</span><span class="p">,</span> <span class="n">head_length</span><span class="o">=</span><span class="n">head_width_x</span><span class="p">,</span> <span class="n">fc</span><span class="o">=</span><span class="s1">&#39;black&#39;</span><span class="p">,</span> <span class="n">ec</span><span class="o">=</span><span class="s1">&#39;black&#39;</span><span class="p">,</span> <span class="n">length_includes_head</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">linewidth</span><span class="o">=</span><span class="mf">0.8</span><span class="p">)</span>

<span class="c1"># Gender 轴 (Y) - 女性到男性</span>
<span class="n">plt</span><span class="o">.</span><span class="n">arrow</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">arrow_length_y</span><span class="p">,</span> <span class="n">head_width</span><span class="o">=</span><span class="n">head_width_x</span><span class="p">,</span> <span class="n">head_length</span><span class="o">=</span><span class="n">head_width_y</span><span class="p">,</span> <span class="n">fc</span><span class="o">=</span><span class="s1">&#39;black&#39;</span><span class="p">,</span> <span class="n">ec</span><span class="o">=</span><span class="s1">&#39;black&#39;</span><span class="p">,</span> <span class="n">length_includes_head</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">linewidth</span><span class="o">=</span><span class="mf">0.8</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">arrow</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="o">-</span><span class="n">arrow_length_y</span><span class="p">,</span> <span class="n">head_width</span><span class="o">=</span><span class="n">head_width_x</span><span class="p">,</span> <span class="n">head_length</span><span class="o">=</span><span class="n">head_width_y</span><span class="p">,</span> <span class="n">fc</span><span class="o">=</span><span class="s1">&#39;black&#39;</span><span class="p">,</span> <span class="n">ec</span><span class="o">=</span><span class="s1">&#39;black&#39;</span><span class="p">,</span> <span class="n">length_includes_head</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">linewidth</span><span class="o">=</span><span class="mf">0.8</span><span class="p">)</span>


<span class="c1"># 添加词语标签</span>
<span class="c1"># 遍历数据点和词语</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">word</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="n">affluence_proj</span><span class="p">,</span> <span class="n">gender_proj</span><span class="p">,</span> <span class="n">words</span><span class="p">)):</span>
     <span class="c1"># 可以尝试 xytext 偏移来控制标签位置</span>
    <span class="n">plt</span><span class="o">.</span><span class="n">annotate</span><span class="p">(</span><span class="n">word</span><span class="p">,</span> <span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">),</span> <span class="n">textcoords</span><span class="o">=</span><span class="s2">&#34;offset points&#34;</span><span class="p">,</span> <span class="n">xytext</span><span class="o">=</span><span class="p">(</span><span class="mi">5</span><span class="p">,</span> <span class="mi">5</span><span class="p">),</span> <span class="n">ha</span><span class="o">=</span><span class="s1">&#39;left&#39;</span><span class="p">,</span> <span class="n">va</span><span class="o">=</span><span class="s1">&#39;bottom&#39;</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="n">plt</span><span class="o">.</span><span class="n">rcParams</span><span class="p">[</span><span class="s1">&#39;font.size&#39;</span><span class="p">])</span>


<span class="c1"># 设置轴标题</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;Affluence (</span><span class="si">{</span><span class="nb">chr</span><span class="p">(</span><span class="mi">8592</span><span class="p">)</span><span class="si">}</span><span class="s1"> 贫穷 | 富有 </span><span class="si">{</span><span class="nb">chr</span><span class="p">(</span><span class="mi">8594</span><span class="p">)</span><span class="si">}</span><span class="s1">)&#39;</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="n">plt</span><span class="o">.</span><span class="n">rcParams</span><span class="p">[</span><span class="s1">&#39;axes.labelsize&#39;</span><span class="p">])</span> <span class="c1"># 使用箭头符号更直观</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;Gender (</span><span class="si">{</span><span class="nb">chr</span><span class="p">(</span><span class="mi">8595</span><span class="p">)</span><span class="si">}</span><span class="s1"> 女性化程度 | 男性化程度 </span><span class="si">{</span><span class="nb">chr</span><span class="p">(</span><span class="mi">8593</span><span class="p">)</span><span class="si">}</span><span class="s1">)&#39;</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="n">plt</span><span class="o">.</span><span class="n">rcParams</span><span class="p">[</span><span class="s1">&#39;axes.labelsize&#39;</span><span class="p">])</span> <span class="c1"># 使用箭头符号更直观</span>


<span class="c1"># 样式中已经设置了网格，如果想自定义，可以取消注释下一行</span>
<span class="c1">#plt.grid(True, linestyle=&#39;--&#39;, alpha=0.5)</span>

<span class="c1"># 确保图表元素的布局紧凑</span>
<span class="n">plt</span><span class="o">.</span><span class="n">tight_layout</span><span class="p">()</span>

<span class="c1"># 显示图表</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/04-hobby.png" alt=""  />

爱好词在性别维度向量、财富维度向量投影结果解读。</p>
<ul>
<li>性别维度，旅行、瑜伽、美妆这几个爱好词与女性强相关。高尔夫、网球、羽毛球、麻将、书法与男性强相关。</li>
<li>财富维度， 富裕的爱好有瑜伽、高尔夫、旅行、网球、麻将、羽毛球。而贫穷的爱好有象棋、美妆、武术、围棋，似乎不用太花钱。</li>
</ul>
<p><img loading="lazy" src="img/05-food.png" alt=""  />
</p>
<p>美食词在性别维度向量、财富维度向量投影结果解读。饮食方面投影主要分布在图的右侧。</p>
<ul>
<li>性别维度，啤酒、白酒、烧烤很男性； 而披萨、沙拉、寿司、牛排很女性。</li>
<li>财富维度，牛排、寿计较富裕， 而小米粥很贫穷。 食物的财富维度区分度较低。</li>
</ul>
<p>总的来说， 食物中，好吃的、贵的跟女性关联度远大于男性。</p>
<p><img loading="lazy" src="img/06-brand.png" alt=""  />

品牌词在性别维度向量、财富维度向量投影结果解读。</p>
<ul>
<li>性别维度, 耐克、阿迪、海澜之家、阿玛尼与男性关联度更高。而优衣库、lv、爱马仕、coach与女性的关联度更高。</li>
<li>财富维度, lv、coach、爱马仕在语义上与富裕强相关，而耐克、阿迪、海澜之家与贫穷强相关。</li>
</ul>
<p>总之，静态的分析，通过大众点评评论语料， 可以体现出目前社会消费领域中， 对于品牌、美食、爱好的认知、文化、刻板印象。  如果有不同年代的语料， 就可以挖掘文化的变化。</p>
<br>
<h3 id="32-考虑时间因素">3.2 考虑时间因素</h3>
<p>以人民日报语料为例， 每10年训练一个词向量， 观察不同年份的语义的变化。 训练代码可阅读 <a href="https://textdata.cn/blog/2023-12-28-visualize-the-culture-change-using-people-daily-dataset/">可视化 | 人民日报语料反映七十年文化演变</a> 。  文中以语义距离来刻画文化，即</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">distance = distance(女, 成功) - distance(男, 成功)
</code></pre></div><ul>
<li>如果distance趋近于0， 男女在成功概念上语义接近， 无明显刻板印象。</li>
<li>但是当distance明显大于0， 当人们聊到成功概念时，更容易联想到男性，而不是女性。</li>
</ul>
<br>
<p><strong>性别与成就</strong></p>
<p><img loading="lazy" src="img/07-gender.png" alt=""  />
</p>
<p>从图中可以看到， 新中国初期， 我国的女性解放运动在全世界都是领先的，成果十分卓著。 而今耳熟能详的口号恰好说明当时的宣传已经刻入每个中国人的认知中，如</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">- 谁说女子不如男
- 不爱红装爱武装
- 女人撑起半边天
...
</code></pre></div><p>提到「成功概念」时，在新中国初期，由于破除性别刻板印象，宣传更加中性， 立榜样考虑了性别的平衡。而随着时间推移，口号式的宣传运动沉寂后， 历史的惯性(传统文化的基因)可能会重新复活， 提到「成功概念」时，社会更容易将「成功」与「男性」联系起来。</p>
<br>
<p><strong>性别与责任</strong></p>
<p>成就与男性有更高的关联， 背后是否意味着传统文化建构的社会要求男性承担远多于女性的责任。</p>
<p><img loading="lazy" src="img/08-responsibility.png" alt=""  />
</p>
<p>从图中可以看出，在大多数年份， distance是大于0的，即 提到「责任」概念时，社会更容易联想到「男性」，而不是「女性」。</p>
<p><br><br></p>
<h2 id="相关资料">相关资料</h2>
<ul>
<li><a href="https://textdata.cn/blog/management_python_course/">视频课 | Python 实证指标构建与文本分析</a></li>
<li><a href="https://textdata.cn/blog/the_text_analysis_list_about_ms/">LIST | 社科(经管)数据挖掘文献资料汇总</a></li>
<li><a href="https://textdata.cn/blog/datasets_available_for_management_science/">LIST | 可供社科(经管)领域使用的数据集汇总</a></li>
<li><a href="https://textdata.cn/blog/2025-02-14-using-online-large-model-api-to-transform-text-data-into-structured-data/">教程 | 使用 Ollama 与大模型将文本数据转化为结构化数据</a></li>
<li><a href="https://textdata.cn/blog/">https://textdata.cn/</a></li>
</ul>
]]></content:encoded>
    </item>
    
    <item>
      <title>案例 |  使用裁判文书语料训练GloVe词向量</title>
      <link>https://textdata.cn/blog/2025-04-17-training-a-glove-model-using-china-judgements-corpus/</link>
      <pubDate>Thu, 17 Apr 2025 00:00:00 +0000</pubDate>
      
      <guid>/blog/2025-04-17-training-a-glove-model-using-china-judgements-corpus/</guid>
      <description>&lt;p&gt;前阵子分享了 &lt;a href=&#34;https://textdata.cn/blog/2025-03-28-train_a_glove_model_on_chinese_corpus_using_stanfordnlp/&#34;&gt;实验 | 使用 Stanford Glove 代码训练中文语料的 Glove 模型&lt;/a&gt; ，后来我偷偷的修改了这篇技术文， 将 C 代码封装到 cntext2.x 内， 原来训练代码行几十行， 现在只需要两行就可以训练 GloVe 模型。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;#pip3 install cntext --upgrade&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;cntext&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;ct&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;g_wv&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;GloVe&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;corpus_file&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;语料文件.txt&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;window_size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;15&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;vector_size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;200&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;lang&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;chinese&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;一检查数据&#34;&gt;一、检查数据&lt;/h2&gt;
&lt;p&gt;裁判文书数据集，每个月份存储到一个 csv， 每个年份有一个对应的文件夹。下图是 2021 年的文件夹截图
&lt;img loading=&#34;lazy&#34; src=&#34;img/2021.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;csv 字段格式是一致的，我们只需要找一个文件，尝试着读取前 5 行，查看数据中有哪些字段。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;2013/2013-01.csv&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;nrows&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;5&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;dropna&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;subset&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;文书内容&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;inplace&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;head&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二训练词向量&#34;&gt;二、训练词向量&lt;/h2&gt;
&lt;h3 id=&#34;21-构造语料&#34;&gt;2.1 构造语料&lt;/h3&gt;
&lt;p&gt;我们只从 csv 中选取 &amp;ldquo;&lt;strong&gt;文书内容&lt;/strong&gt;&amp;rdquo; ，并将其存储到语料 txt 文件中。但全部裁判文书数据量高达 300G， 我希望文本语料控制在 10G 左右。&lt;/p&gt;
&lt;p&gt;2010/2011/2013 这三个年度的数据只有几百 M， 数据全部保留。 剩下的年份，设置不同的抽样比例，尽可能将每年生成的语料 txt 文件控制在 1G 左右。&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;年份&lt;/th&gt;
&lt;th&gt;解压后文件大小&lt;/th&gt;
&lt;th&gt;抽样比例&lt;/th&gt;
&lt;th&gt;语料 txt 大小&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2010&lt;/td&gt;
&lt;td&gt;761M&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;684M&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2011&lt;/td&gt;
&lt;td&gt;452M&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;396M&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2012&lt;/td&gt;
&lt;td&gt;757M&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;665M&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2013&lt;/td&gt;
&lt;td&gt;5.13G&lt;/td&gt;
&lt;td&gt;20%&lt;/td&gt;
&lt;td&gt;984M&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2014&lt;/td&gt;
&lt;td&gt;23.7G&lt;/td&gt;
&lt;td&gt;4%&lt;/td&gt;
&lt;td&gt;905M&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2015&lt;/td&gt;
&lt;td&gt;33.6G&lt;/td&gt;
&lt;td&gt;3%&lt;/td&gt;
&lt;td&gt;968M&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016&lt;/td&gt;
&lt;td&gt;39.9G&lt;/td&gt;
&lt;td&gt;2.4%&lt;/td&gt;
&lt;td&gt;914M&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2017&lt;/td&gt;
&lt;td&gt;44.6G&lt;/td&gt;
&lt;td&gt;2.2%&lt;/td&gt;
&lt;td&gt;882M&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2018&lt;/td&gt;
&lt;td&gt;24.8G&lt;/td&gt;
&lt;td&gt;4%&lt;/td&gt;
&lt;td&gt;875M&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2019&lt;/td&gt;
&lt;td&gt;48.3G&lt;/td&gt;
&lt;td&gt;2%&lt;/td&gt;
&lt;td&gt;833M&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2020&lt;/td&gt;
&lt;td&gt;91.2G&lt;/td&gt;
&lt;td&gt;1%&lt;/td&gt;
&lt;td&gt;779M&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2021&lt;/td&gt;
&lt;td&gt;32.3G&lt;/td&gt;
&lt;td&gt;3%&lt;/td&gt;
&lt;td&gt;816M&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;os&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;tqdm&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;tqdm&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;# 年份、抽样比例&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;year_fracs&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;
    &lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;2010&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;2011&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;2012&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt;
    &lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;2013&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mf&#34;&gt;0.2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;2014&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mf&#34;&gt;0.04&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;2015&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mf&#34;&gt;0.03&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt;
    &lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;2016&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mf&#34;&gt;0.024&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;2017&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mf&#34;&gt;0.022&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;2018&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mf&#34;&gt;0.04&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt;
    &lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;2019&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mf&#34;&gt;0.02&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;2020&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mf&#34;&gt;0.01&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;2021&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mf&#34;&gt;0.03&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;



&lt;span class=&#34;k&#34;&gt;with&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;open&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;裁判文书.txt&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;w&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;encoding&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;utf-8&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;corpus_file&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;year&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;frac&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;tqdm&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;year_fracs&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
        &lt;span class=&#34;n&#34;&gt;csvfs&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;sa&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;year&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;/&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;csvf&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;csvf&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;os&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;listdir&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;year&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;if&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;.csv&amp;#39;&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;csvf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
        &lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;csvf&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;csvfs&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
            &lt;span class=&#34;c1&#34;&gt;# 为节省内存开销，&lt;/span&gt;
            &lt;span class=&#34;c1&#34;&gt;# 只读 csv 中的 “文书内容” 一个字段，&lt;/span&gt;
            &lt;span class=&#34;c1&#34;&gt;# 且设置 chunksize 分批次读取&lt;/span&gt;
            &lt;span class=&#34;n&#34;&gt;chunk_dfs&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;csvf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;usecols&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;文书内容&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;chunksize&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;10000&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
            &lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;chunk_df&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;chunk_dfs&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
                &lt;span class=&#34;n&#34;&gt;chunk_df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;dropna&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;subset&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;文书内容&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;inplace&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
                &lt;span class=&#34;n&#34;&gt;text&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;se&#34;&gt;\n&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;join&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;chunk_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;文书内容&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;values&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
                &lt;span class=&#34;n&#34;&gt;corpus_file&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;write&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;22-训练-glove&#34;&gt;2.2 训练 GloVe&lt;/h3&gt;
&lt;p&gt;使用 cntext2.x 对代码进行了优化， 几个 G 的语料在 cntext 内预处理时候不会一次性读取全部内容，所以一般情况不会出现内存溢出问题。&lt;/p&gt;
&lt;p&gt;基于语料 &lt;strong&gt;&lt;em&gt;裁判文书.txt&lt;/em&gt;&lt;/strong&gt; 训练 GloVe 词嵌入语言模型 ，参数 window_size=15, vector_size=200, 结果会自动保存到 output 文件夹内。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/01-corpus.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;使用 &lt;strong&gt;&lt;em&gt;cntext2.1.6&lt;/em&gt;&lt;/strong&gt;， 代码如下&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;cntext&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;ct&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;g_wv&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;GloVe&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;corpus_file&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;裁判文书.txt&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;window_size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;15&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;vector_size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;200&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;lang&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;chinese&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;Mac(Linux) System, Enable Parallel Processing
Cache output/裁判文书_cache.txt Not Found, Preprocessing Corpus
Processing Corpus: 100%|██████████████████| 2502938/2502938 [26:37&amp;lt;00:00, 1566.54it/s]
Reading Preprocessed Corpus from output/裁判文书_cache.txt
Start Training GloVe
GloVe Training Cost 1223s.
Output Saved To: output/裁判文书-Word2Vec.200.15.bin
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;训练总耗时 1223s， 约 20 分钟。模型保存在 &lt;strong&gt;&lt;em&gt;output/裁判文书-Word2Vec.200.15.bin&lt;/em&gt;&lt;/strong&gt;， 该模型文件大小约为 1.58G。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;23-评估模型&#34;&gt;2.3 评估模型&lt;/h3&gt;
&lt;p&gt;使用近义法和类比法， 判断模型的表现。详情可查看&lt;a href=&#34;https://cntext.readthedocs.io/zh-cn/latest/model.html&#34;&gt;文档&lt;/a&gt;&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;evaluate_similarity&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;g_wv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;近义测试: similarity.txt
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/cntext/model/evaluate_data/similarity.txt
Processing Similarity Test: 100%|███████████| 537/537 [00:00&amp;lt;00:00, 131978.28it/s]

评估结果：
+----------+------------+----------------------------+
| 发现词语 | 未发现词语 | Spearman&amp;#39;s Rank Coeficient |
+----------+------------+----------------------------+
|   432    |    105     |            0.37            |
+----------+------------+----------------------------+
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;strong&gt;近义测试&lt;/strong&gt;: Spearman&amp;rsquo;s Rank Coeficient 系数取值[-1, 1], 取值越大， 说明模型表现越好。&lt;br&gt;&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;evaluate_analogy&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;g_wv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;类比测试: analogy.txt
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/cntext/model/evaluate_data/analogy.txt
Processing Analogy Test: 100%|████████████████| 1198/1198 [00:48&amp;lt;00:00, 24.75it/s]

评估结果：
+--------------------+----------+------------+------------+----------+
|      Category      | 发现词语 | 未发现词语 | 准确率 (%) | 平均排名 |
+--------------------+----------+------------+------------+----------+
| CapitalOfCountries |   507    |    170     |    7.69    |   4.38   |
|   CityInProvince   |   175    |     0      |   98.86    |   1.39   |
| FamilyRelationship |   272    |     0      |   73.53    |   1.56   |
|   SocialScience    |    8     |     62     |   25.00    |   7.00   |
+--------------------+----------+------------+------------+----------+
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;strong&gt;类比测试&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;CapitalOfCountries 裁判文书语料在此项表现较差， 应该是数据库中涉外的案件较少。&lt;/li&gt;
&lt;li&gt;CityInProvince 裁判文书语料在此项表现如此优异，是因为几乎全为国内案件， 而案件描述一般会交待案发的省市等信息。&lt;/li&gt;
&lt;li&gt;FamilyRelationship 裁判文书语料中表现较好， 可能很多的案件会描述案件相关社会关系。&lt;/li&gt;
&lt;li&gt;SocialScience 裁判文书语料在此项表现一般， 应该是语料中常见的社会科学词语提及较少。&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;整体而言，语料训练的效果很不错，抓住了数据场景的独特性语义。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;三使用-glove&#34;&gt;三、使用 GloVe&lt;/h2&gt;
&lt;h3 id=&#34;31-导入模型&#34;&gt;3.1 导入模型&lt;/h3&gt;
&lt;p&gt;使用 cntext2.1.6 读取很简单， 代码如下&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;g_wv&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;load_w2v&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;output/裁判文书-Word2Vec.200.15.bin&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;type&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;g_wv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;sa&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;模型词汇量: &lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;g_wv&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;wv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;&amp;lt;class &amp;#39;gensim.models.keyedvectors.KeyedVectors&amp;#39;&amp;gt;
模型词汇量: 2099102
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;32-keyedvectors-的操作方法或属性&#34;&gt;3.2 KeyedVectors 的操作方法(或属性)&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;方法&lt;/th&gt;
&lt;th&gt;描述&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;em&gt;KeyedVectors.index_to_key&lt;/em&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;获取词汇表中的所有单词。&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;em&gt;KeyedVectors.key_to_index&lt;/em&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;获取单词到索引的映射。&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;em&gt;KeyedVectors.vector_size&lt;/em&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;获取 GloVe 模型中任意词向量的维度。&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;em&gt;KeyedVectors.get_vector(word)&lt;/em&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;获取给定单词的词向量。&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;em&gt;KeyedVectors.similar_by_word(word, topn=10)&lt;/em&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;获取某词语最相似的 10 个近义词。&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;em&gt;KeyedVectors.similar_by_vector(vector, topn=10)&lt;/em&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;获取词向量最相似的 10 个近义词。&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;br&gt;
&lt;h2 id=&#34;33-词表&#34;&gt;3.3 词表&lt;/h2&gt;
&lt;p&gt;查看词表&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;g_wv&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;index_to_key&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;[&amp;#39;被告&amp;#39;,
 &amp;#39;原告&amp;#39;,
 &amp;#39;本院&amp;#39;,
 &amp;#39;公司&amp;#39;,
 &amp;#39;规定&amp;#39;,
 &amp;#39;执行&amp;#39;,
 ...
]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;br&gt;
&lt;p&gt;查看词汇映射表&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;g_wv&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;key_to_index&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;{&amp;#39;被告&amp;#39;: 0,
 &amp;#39;原告&amp;#39;: 1,
 &amp;#39;本院&amp;#39;: 2,
 &amp;#39;公司&amp;#39;: 3,
 &amp;#39;规定&amp;#39;: 4,
 &amp;#39;执行&amp;#39;: 5,
 ...
}
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;34-查看词向量&#34;&gt;3.4 查看词向量&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# 查询某词的词向量&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;g_wv&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;get_vector&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;经济&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;array([ 2.909250e-01,  9.074450e-01,  5.231860e-01,  5.381490e-01,
       -2.813620e-01,  2.661690e-01,  1.045510e-01, -4.516240e-01,
       -2.186710e-01,  1.867590e-01, -4.870700e-01, -1.803480e-01,
       -6.361140e-01, -8.739630e-01,  3.418450e-01,  7.470900e-02,
        ......
        ......
        2.636230e-01, -2.538920e-01, -2.442900e-02,  5.847510e-01,
        5.135750e-01, -4.009650e-01, -3.629850e-01,  2.332400e-01,
       -3.069630e-01, -4.182810e-01,  3.937240e-01, -8.510000e-01,
        7.894350e-01,  3.969710e-01,  7.895660e-01,  4.881190e-01],
      dtype=float32)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# 查询多个词的词向量&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;g_wv&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;get_mean_vector&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;([&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;经济&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;犯罪&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Ruj&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;array([ 0.02923387,  0.04620265,  0.03790346,  0.01160904, -0.02162073,
        0.01537724,  0.02025648, -0.03336571, -0.00447518, -0.00529976,
       -0.02856204,  0.01545951,  0.00780857, -0.05398807,  0.02195465,
        0.03140446, -0.02007412,  0.08278576, -0.027172  , -0.00272319,
       ......
        0.0291778 ,  0.03382879, -0.00913138,  0.04487584,  0.06375133,
        0.032144  , -0.02788475,  0.05068161,  0.0122064 ,  0.01759091,
       -0.05560436,  0.00272704, -0.01176615, -0.08875326,  0.00767812,
       -0.00486504,  0.10119167, -0.01212235,  0.06018812,  0.02998512],
      dtype=float32)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;35-近义词&#34;&gt;3.5 近义词&lt;/h3&gt;
&lt;p&gt;根据词语查找最相似的 10 个词&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;g_wv&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;similar_by_word&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;动机&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;topn&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;10&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;[(&amp;#39;主观&amp;#39;, 0.6688777804374695),
 (&amp;#39;意图&amp;#39;, 0.6248725652694702),
 (&amp;#39;恶性&amp;#39;, 0.6005507111549377),
 (&amp;#39;蓄意&amp;#39;, 0.5913136005401611),
 (&amp;#39;卑劣&amp;#39;, 0.5908187627792358),
 (&amp;#39;作案动机&amp;#39;, 0.5703221559524536),
 (&amp;#39;心态&amp;#39;, 0.5640602707862854),
 (&amp;#39;故意&amp;#39;, 0.5533956289291382),
 (&amp;#39;显而易见&amp;#39;, 0.5524264574050903),
 (&amp;#39;恶意&amp;#39;, 0.5509642958641052)]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;p&gt;根据某词的词向量查询最相似的 10 个词&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;g_wv&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;similar_by_vector&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;g_wv&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;get_vector&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;动机&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;topn&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;10&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;[(&amp;#39;动机&amp;#39;, 0.9999999403953552),
 (&amp;#39;主观&amp;#39;, 0.6688777804374695),
 (&amp;#39;意图&amp;#39;, 0.6248724460601807),
 (&amp;#39;恶性&amp;#39;, 0.600550651550293),
 (&amp;#39;蓄意&amp;#39;, 0.5913134813308716),
 (&amp;#39;卑劣&amp;#39;, 0.5908187627792358),
 (&amp;#39;作案动机&amp;#39;, 0.5703221559524536),
 (&amp;#39;心态&amp;#39;, 0.5640602707862854),
 (&amp;#39;故意&amp;#39;, 0.5533955693244934),
 (&amp;#39;显而易见&amp;#39;, 0.5524263381958008)]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;p&gt;多个词求得均值向量&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;purpose_vector&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;g_wv&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;get_mean_vector&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;([&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;动机&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;  &lt;span class=&#34;s1&#34;&gt;&amp;#39;意图&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;目的&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;g_wv&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;similar_by_vector&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;purpose_vector&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;topn&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;20&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;[(&amp;#39;意图&amp;#39;, 0.9032057523727417),
 (&amp;#39;目的&amp;#39;, 0.8639562726020813),
 (&amp;#39;动机&amp;#39;, 0.8277378678321838),
 (&amp;#39;主观&amp;#39;, 0.7455390095710754),
 (&amp;#39;恶意&amp;#39;, 0.7291366457939148),
 (&amp;#39;故意&amp;#39;, 0.7236210107803345),
 (&amp;#39;客观&amp;#39;, 0.7146263122558594),
 (&amp;#39;企图&amp;#39;, 0.7049675583839417),
 (&amp;#39;行为&amp;#39;, 0.6962229609489441),
 (&amp;#39;掩盖&amp;#39;, 0.6917882561683655),
 (&amp;#39;所谓&amp;#39;, 0.6809536218643188),
 (&amp;#39;并非&amp;#39;, 0.667915403842926),
 (&amp;#39;手段&amp;#39;, 0.6663289666175842),
 (&amp;#39;利益&amp;#39;, 0.6568542718887329),
 (&amp;#39;这种&amp;#39;, 0.6558799743652344),
 (&amp;#39;欺骗&amp;#39;, 0.6545097231864929),
 (&amp;#39;违背&amp;#39;, 0.6538694500923157),
 (&amp;#39;真相&amp;#39;, 0.6527130007743835),
 (&amp;#39;显然&amp;#39;, 0.6525647640228271),
 (&amp;#39;实质&amp;#39;, 0.6521809101104736)]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;四获取模型&#34;&gt;四、获取模型&lt;/h2&gt;
&lt;p&gt;内容创作不易， 本文为付费内容，&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;- 裁判文书-GloVe.200.15.bin   https://pan.baidu.com/s/1a0Fisvnkl8UaQZrHP7olCQ?pwd=8w49

- 更多词向量模型               https://cntext.readthedocs.io/zh-cn/latest/embeddings.html
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</description>
      <content:encoded><![CDATA[<p>前阵子分享了 <a href="https://textdata.cn/blog/2025-03-28-train_a_glove_model_on_chinese_corpus_using_stanfordnlp/">实验 | 使用 Stanford Glove 代码训练中文语料的 Glove 模型</a> ，后来我偷偷的修改了这篇技术文， 将 C 代码封装到 cntext2.x 内， 原来训练代码行几十行， 现在只需要两行就可以训练 GloVe 模型。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#pip3 install cntext --upgrade</span>
<span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>
<span class="n">g_wv</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">GloVe</span><span class="p">(</span><span class="n">corpus_file</span><span class="o">=</span><span class="s1">&#39;语料文件.txt&#39;</span><span class="p">,</span> <span class="n">window_size</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span> <span class="n">vector_size</span><span class="o">=</span><span class="mi">200</span><span class="p">,</span> <span class="n">lang</span><span class="o">=</span><span class="s1">&#39;chinese&#39;</span><span class="p">)</span>
</code></pre></div><p><br><br></p>
<h2 id="一检查数据">一、检查数据</h2>
<p>裁判文书数据集，每个月份存储到一个 csv， 每个年份有一个对应的文件夹。下图是 2021 年的文件夹截图
<img loading="lazy" src="img/2021.png" alt=""  />
</p>
<br>
<p>csv 字段格式是一致的，我们只需要找一个文件，尝试着读取前 5 行，查看数据中有哪些字段。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;2013/2013-01.csv&#39;</span><span class="p">,</span> <span class="n">nrows</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">dropna</span><span class="p">(</span><span class="n">subset</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;文书内容&#39;</span><span class="p">],</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/df.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="二训练词向量">二、训练词向量</h2>
<h3 id="21-构造语料">2.1 构造语料</h3>
<p>我们只从 csv 中选取 &ldquo;<strong>文书内容</strong>&rdquo; ，并将其存储到语料 txt 文件中。但全部裁判文书数据量高达 300G， 我希望文本语料控制在 10G 左右。</p>
<p>2010/2011/2013 这三个年度的数据只有几百 M， 数据全部保留。 剩下的年份，设置不同的抽样比例，尽可能将每年生成的语料 txt 文件控制在 1G 左右。</p>
<table>
<thead>
<tr>
<th>年份</th>
<th>解压后文件大小</th>
<th>抽样比例</th>
<th>语料 txt 大小</th>
</tr>
</thead>
<tbody>
<tr>
<td>2010</td>
<td>761M</td>
<td>100%</td>
<td>684M</td>
</tr>
<tr>
<td>2011</td>
<td>452M</td>
<td>100%</td>
<td>396M</td>
</tr>
<tr>
<td>2012</td>
<td>757M</td>
<td>100%</td>
<td>665M</td>
</tr>
<tr>
<td>2013</td>
<td>5.13G</td>
<td>20%</td>
<td>984M</td>
</tr>
<tr>
<td>2014</td>
<td>23.7G</td>
<td>4%</td>
<td>905M</td>
</tr>
<tr>
<td>2015</td>
<td>33.6G</td>
<td>3%</td>
<td>968M</td>
</tr>
<tr>
<td>2016</td>
<td>39.9G</td>
<td>2.4%</td>
<td>914M</td>
</tr>
<tr>
<td>2017</td>
<td>44.6G</td>
<td>2.2%</td>
<td>882M</td>
</tr>
<tr>
<td>2018</td>
<td>24.8G</td>
<td>4%</td>
<td>875M</td>
</tr>
<tr>
<td>2019</td>
<td>48.3G</td>
<td>2%</td>
<td>833M</td>
</tr>
<tr>
<td>2020</td>
<td>91.2G</td>
<td>1%</td>
<td>779M</td>
</tr>
<tr>
<td>2021</td>
<td>32.3G</td>
<td>3%</td>
<td>816M</td>
</tr>
</tbody>
</table>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">from</span> <span class="nn">tqdm</span> <span class="kn">import</span> <span class="n">tqdm</span>

<span class="c1"># 年份、抽样比例</span>
<span class="n">year_fracs</span> <span class="o">=</span> <span class="p">[</span>
    <span class="p">(</span><span class="s1">&#39;2010&#39;</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="p">(</span><span class="s1">&#39;2011&#39;</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="p">(</span><span class="s1">&#39;2012&#39;</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span>
    <span class="p">(</span><span class="s1">&#39;2013&#39;</span><span class="p">,</span> <span class="mf">0.2</span><span class="p">),</span> <span class="p">(</span><span class="s1">&#39;2014&#39;</span><span class="p">,</span> <span class="mf">0.04</span><span class="p">),</span> <span class="p">(</span><span class="s1">&#39;2015&#39;</span><span class="p">,</span> <span class="mf">0.03</span><span class="p">),</span>
    <span class="p">(</span><span class="s1">&#39;2016&#39;</span><span class="p">,</span> <span class="mf">0.024</span><span class="p">),</span> <span class="p">(</span><span class="s1">&#39;2017&#39;</span><span class="p">,</span> <span class="mf">0.022</span><span class="p">),</span> <span class="p">(</span><span class="s1">&#39;2018&#39;</span><span class="p">,</span> <span class="mf">0.04</span><span class="p">),</span>
    <span class="p">(</span><span class="s1">&#39;2019&#39;</span><span class="p">,</span> <span class="mf">0.02</span><span class="p">),</span> <span class="p">(</span><span class="s1">&#39;2020&#39;</span><span class="p">,</span> <span class="mf">0.01</span><span class="p">),</span> <span class="p">(</span><span class="s1">&#39;2021&#39;</span><span class="p">,</span> <span class="mf">0.03</span><span class="p">)</span>
    <span class="p">]</span>



<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s1">&#39;裁判文书.txt&#39;</span><span class="p">,</span> <span class="s1">&#39;w&#39;</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">corpus_file</span><span class="p">:</span>
    <span class="k">for</span> <span class="n">year</span><span class="p">,</span> <span class="n">frac</span> <span class="ow">in</span> <span class="n">tqdm</span><span class="p">(</span><span class="n">year_fracs</span><span class="p">):</span>
        <span class="n">csvfs</span> <span class="o">=</span> <span class="p">[</span><span class="sa">f</span><span class="s1">&#39;</span><span class="si">{</span><span class="n">year</span><span class="si">}</span><span class="s1">/</span><span class="si">{</span><span class="n">csvf</span><span class="si">}</span><span class="s1">&#39;</span> <span class="k">for</span> <span class="n">csvf</span> <span class="ow">in</span> <span class="n">os</span><span class="o">.</span><span class="n">listdir</span><span class="p">(</span><span class="n">year</span><span class="p">)</span> <span class="k">if</span> <span class="s1">&#39;.csv&#39;</span> <span class="ow">in</span> <span class="n">csvf</span><span class="p">]</span>
        <span class="k">for</span> <span class="n">csvf</span> <span class="ow">in</span> <span class="n">csvfs</span><span class="p">:</span>
            <span class="c1"># 为节省内存开销，</span>
            <span class="c1"># 只读 csv 中的 “文书内容” 一个字段，</span>
            <span class="c1"># 且设置 chunksize 分批次读取</span>
            <span class="n">chunk_dfs</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="n">csvf</span><span class="p">,</span> <span class="n">usecols</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;文书内容&#39;</span><span class="p">],</span> <span class="n">chunksize</span><span class="o">=</span><span class="mi">10000</span><span class="p">)</span>
            <span class="k">for</span> <span class="n">chunk_df</span> <span class="ow">in</span> <span class="n">chunk_dfs</span><span class="p">:</span>
                <span class="n">chunk_df</span><span class="o">.</span><span class="n">dropna</span><span class="p">(</span><span class="n">subset</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;文书内容&#39;</span><span class="p">],</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
                <span class="n">text</span> <span class="o">=</span> <span class="s1">&#39;</span><span class="se">\n</span><span class="s1">&#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">chunk_df</span><span class="p">[</span><span class="s1">&#39;文书内容&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">values</span><span class="p">)</span>
                <span class="n">corpus_file</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>
</code></pre></div><br>
<h3 id="22-训练-glove">2.2 训练 GloVe</h3>
<p>使用 cntext2.x 对代码进行了优化， 几个 G 的语料在 cntext 内预处理时候不会一次性读取全部内容，所以一般情况不会出现内存溢出问题。</p>
<p>基于语料 <strong><em>裁判文书.txt</em></strong> 训练 GloVe 词嵌入语言模型 ，参数 window_size=15, vector_size=200, 结果会自动保存到 output 文件夹内。</p>
<p><img loading="lazy" src="img/01-corpus.png" alt=""  />
</p>
<p>使用 <strong><em>cntext2.1.6</em></strong>， 代码如下</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="n">g_wv</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">GloVe</span><span class="p">(</span><span class="n">corpus_file</span><span class="o">=</span><span class="s1">&#39;裁判文书.txt&#39;</span><span class="p">,</span> <span class="n">window_size</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span> <span class="n">vector_size</span><span class="o">=</span><span class="mi">200</span><span class="p">,</span> <span class="n">lang</span><span class="o">=</span><span class="s1">&#39;chinese&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">Mac(Linux) System, Enable Parallel Processing
Cache output/裁判文书_cache.txt Not Found, Preprocessing Corpus
Processing Corpus: 100%|██████████████████| 2502938/2502938 [26:37&lt;00:00, 1566.54it/s]
Reading Preprocessed Corpus from output/裁判文书_cache.txt
Start Training GloVe
GloVe Training Cost 1223s.
Output Saved To: output/裁判文书-Word2Vec.200.15.bin
</code></pre></div><p>训练总耗时 1223s， 约 20 分钟。模型保存在 <strong><em>output/裁判文书-Word2Vec.200.15.bin</em></strong>， 该模型文件大小约为 1.58G。</p>
<br>
<h3 id="23-评估模型">2.3 评估模型</h3>
<p>使用近义法和类比法， 判断模型的表现。详情可查看<a href="https://cntext.readthedocs.io/zh-cn/latest/model.html">文档</a></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">ct</span><span class="o">.</span><span class="n">evaluate_similarity</span><span class="p">(</span><span class="n">g_wv</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">近义测试: similarity.txt
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/cntext/model/evaluate_data/similarity.txt
Processing Similarity Test: 100%|███████████| 537/537 [00:00&lt;00:00, 131978.28it/s]

评估结果：
+----------+------------+----------------------------+
| 发现词语 | 未发现词语 | Spearman&#39;s Rank Coeficient |
+----------+------------+----------------------------+
|   432    |    105     |            0.37            |
+----------+------------+----------------------------+
</code></pre></div><p><strong>近义测试</strong>: Spearman&rsquo;s Rank Coeficient 系数取值[-1, 1], 取值越大， 说明模型表现越好。<br></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">ct</span><span class="o">.</span><span class="n">evaluate_analogy</span><span class="p">(</span><span class="n">g_wv</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">类比测试: analogy.txt
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/cntext/model/evaluate_data/analogy.txt
Processing Analogy Test: 100%|████████████████| 1198/1198 [00:48&lt;00:00, 24.75it/s]

评估结果：
+--------------------+----------+------------+------------+----------+
|      Category      | 发现词语 | 未发现词语 | 准确率 (%) | 平均排名 |
+--------------------+----------+------------+------------+----------+
| CapitalOfCountries |   507    |    170     |    7.69    |   4.38   |
|   CityInProvince   |   175    |     0      |   98.86    |   1.39   |
| FamilyRelationship |   272    |     0      |   73.53    |   1.56   |
|   SocialScience    |    8     |     62     |   25.00    |   7.00   |
+--------------------+----------+------------+------------+----------+
</code></pre></div><p><strong>类比测试</strong>:</p>
<ul>
<li>CapitalOfCountries 裁判文书语料在此项表现较差， 应该是数据库中涉外的案件较少。</li>
<li>CityInProvince 裁判文书语料在此项表现如此优异，是因为几乎全为国内案件， 而案件描述一般会交待案发的省市等信息。</li>
<li>FamilyRelationship 裁判文书语料中表现较好， 可能很多的案件会描述案件相关社会关系。</li>
<li>SocialScience 裁判文书语料在此项表现一般， 应该是语料中常见的社会科学词语提及较少。</li>
</ul>
<p>整体而言，语料训练的效果很不错，抓住了数据场景的独特性语义。</p>
<p><br><br></p>
<h2 id="三使用-glove">三、使用 GloVe</h2>
<h3 id="31-导入模型">3.1 导入模型</h3>
<p>使用 cntext2.1.6 读取很简单， 代码如下</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">g_wv</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">load_w2v</span><span class="p">(</span><span class="s1">&#39;output/裁判文书-Word2Vec.200.15.bin&#39;</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="nb">type</span><span class="p">(</span><span class="n">g_wv</span><span class="p">))</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;模型词汇量: </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">g_wv</span><span class="o">.</span><span class="n">wv</span><span class="p">)</span><span class="si">}</span><span class="s1">&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">&lt;class &#39;gensim.models.keyedvectors.KeyedVectors&#39;&gt;
模型词汇量: 2099102
</code></pre></div><br>
<h3 id="32-keyedvectors-的操作方法或属性">3.2 KeyedVectors 的操作方法(或属性)</h3>
<table>
<thead>
<tr>
<th>方法</th>
<th>描述</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong><em>KeyedVectors.index_to_key</em></strong></td>
<td>获取词汇表中的所有单词。</td>
</tr>
<tr>
<td><strong><em>KeyedVectors.key_to_index</em></strong></td>
<td>获取单词到索引的映射。</td>
</tr>
<tr>
<td><strong><em>KeyedVectors.vector_size</em></strong></td>
<td>获取 GloVe 模型中任意词向量的维度。</td>
</tr>
<tr>
<td><strong><em>KeyedVectors.get_vector(word)</em></strong></td>
<td>获取给定单词的词向量。</td>
</tr>
<tr>
<td><strong><em>KeyedVectors.similar_by_word(word, topn=10)</em></strong></td>
<td>获取某词语最相似的 10 个近义词。</td>
</tr>
<tr>
<td><strong><em>KeyedVectors.similar_by_vector(vector, topn=10)</em></strong></td>
<td>获取词向量最相似的 10 个近义词。</td>
</tr>
</tbody>
</table>
<br>
<h2 id="33-词表">3.3 词表</h2>
<p>查看词表</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">g_wv</span><span class="o">.</span><span class="n">index_to_key</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[&#39;被告&#39;,
 &#39;原告&#39;,
 &#39;本院&#39;,
 &#39;公司&#39;,
 &#39;规定&#39;,
 &#39;执行&#39;,
 ...
]
</code></pre></div><br>
<br>
<p>查看词汇映射表</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">g_wv</span><span class="o">.</span><span class="n">key_to_index</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">{&#39;被告&#39;: 0,
 &#39;原告&#39;: 1,
 &#39;本院&#39;: 2,
 &#39;公司&#39;: 3,
 &#39;规定&#39;: 4,
 &#39;执行&#39;: 5,
 ...
}
</code></pre></div><br>
<h3 id="34-查看词向量">3.4 查看词向量</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1"># 查询某词的词向量</span>
<span class="n">g_wv</span><span class="o">.</span><span class="n">get_vector</span><span class="p">(</span><span class="s1">&#39;经济&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">array([ 2.909250e-01,  9.074450e-01,  5.231860e-01,  5.381490e-01,
       -2.813620e-01,  2.661690e-01,  1.045510e-01, -4.516240e-01,
       -2.186710e-01,  1.867590e-01, -4.870700e-01, -1.803480e-01,
       -6.361140e-01, -8.739630e-01,  3.418450e-01,  7.470900e-02,
        ......
        ......
        2.636230e-01, -2.538920e-01, -2.442900e-02,  5.847510e-01,
        5.135750e-01, -4.009650e-01, -3.629850e-01,  2.332400e-01,
       -3.069630e-01, -4.182810e-01,  3.937240e-01, -8.510000e-01,
        7.894350e-01,  3.969710e-01,  7.895660e-01,  4.881190e-01],
      dtype=float32)
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1"># 查询多个词的词向量</span>
<span class="n">g_wv</span><span class="o">.</span><span class="n">get_mean_vector</span><span class="p">([</span><span class="s1">&#39;经济&#39;</span><span class="p">,</span> <span class="s1">&#39;犯罪&#39;</span><span class="p">])</span>
</code></pre></div><p>Ruj</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">array([ 0.02923387,  0.04620265,  0.03790346,  0.01160904, -0.02162073,
        0.01537724,  0.02025648, -0.03336571, -0.00447518, -0.00529976,
       -0.02856204,  0.01545951,  0.00780857, -0.05398807,  0.02195465,
        0.03140446, -0.02007412,  0.08278576, -0.027172  , -0.00272319,
       ......
        0.0291778 ,  0.03382879, -0.00913138,  0.04487584,  0.06375133,
        0.032144  , -0.02788475,  0.05068161,  0.0122064 ,  0.01759091,
       -0.05560436,  0.00272704, -0.01176615, -0.08875326,  0.00767812,
       -0.00486504,  0.10119167, -0.01212235,  0.06018812,  0.02998512],
      dtype=float32)
</code></pre></div><br>
<h3 id="35-近义词">3.5 近义词</h3>
<p>根据词语查找最相似的 10 个词</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">g_wv</span><span class="o">.</span><span class="n">similar_by_word</span><span class="p">(</span><span class="s1">&#39;动机&#39;</span><span class="p">,</span> <span class="n">topn</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[(&#39;主观&#39;, 0.6688777804374695),
 (&#39;意图&#39;, 0.6248725652694702),
 (&#39;恶性&#39;, 0.6005507111549377),
 (&#39;蓄意&#39;, 0.5913136005401611),
 (&#39;卑劣&#39;, 0.5908187627792358),
 (&#39;作案动机&#39;, 0.5703221559524536),
 (&#39;心态&#39;, 0.5640602707862854),
 (&#39;故意&#39;, 0.5533956289291382),
 (&#39;显而易见&#39;, 0.5524264574050903),
 (&#39;恶意&#39;, 0.5509642958641052)]
</code></pre></div><br>
<p>根据某词的词向量查询最相似的 10 个词</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">g_wv</span><span class="o">.</span><span class="n">similar_by_vector</span><span class="p">(</span><span class="n">g_wv</span><span class="o">.</span><span class="n">get_vector</span><span class="p">(</span><span class="s1">&#39;动机&#39;</span><span class="p">),</span> <span class="n">topn</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[(&#39;动机&#39;, 0.9999999403953552),
 (&#39;主观&#39;, 0.6688777804374695),
 (&#39;意图&#39;, 0.6248724460601807),
 (&#39;恶性&#39;, 0.600550651550293),
 (&#39;蓄意&#39;, 0.5913134813308716),
 (&#39;卑劣&#39;, 0.5908187627792358),
 (&#39;作案动机&#39;, 0.5703221559524536),
 (&#39;心态&#39;, 0.5640602707862854),
 (&#39;故意&#39;, 0.5533955693244934),
 (&#39;显而易见&#39;, 0.5524263381958008)]
</code></pre></div><br>
<p>多个词求得均值向量</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">purpose_vector</span> <span class="o">=</span> <span class="n">g_wv</span><span class="o">.</span><span class="n">get_mean_vector</span><span class="p">([</span><span class="s1">&#39;动机&#39;</span><span class="p">,</span>  <span class="s1">&#39;意图&#39;</span><span class="p">,</span> <span class="s1">&#39;目的&#39;</span><span class="p">])</span>
<span class="n">g_wv</span><span class="o">.</span><span class="n">similar_by_vector</span><span class="p">(</span><span class="n">purpose_vector</span><span class="p">,</span> <span class="n">topn</span><span class="o">=</span><span class="mi">20</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[(&#39;意图&#39;, 0.9032057523727417),
 (&#39;目的&#39;, 0.8639562726020813),
 (&#39;动机&#39;, 0.8277378678321838),
 (&#39;主观&#39;, 0.7455390095710754),
 (&#39;恶意&#39;, 0.7291366457939148),
 (&#39;故意&#39;, 0.7236210107803345),
 (&#39;客观&#39;, 0.7146263122558594),
 (&#39;企图&#39;, 0.7049675583839417),
 (&#39;行为&#39;, 0.6962229609489441),
 (&#39;掩盖&#39;, 0.6917882561683655),
 (&#39;所谓&#39;, 0.6809536218643188),
 (&#39;并非&#39;, 0.667915403842926),
 (&#39;手段&#39;, 0.6663289666175842),
 (&#39;利益&#39;, 0.6568542718887329),
 (&#39;这种&#39;, 0.6558799743652344),
 (&#39;欺骗&#39;, 0.6545097231864929),
 (&#39;违背&#39;, 0.6538694500923157),
 (&#39;真相&#39;, 0.6527130007743835),
 (&#39;显然&#39;, 0.6525647640228271),
 (&#39;实质&#39;, 0.6521809101104736)]
</code></pre></div><p><br><br></p>
<h2 id="四获取模型">四、获取模型</h2>
<p>内容创作不易， 本文为付费内容，</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">- 裁判文书-GloVe.200.15.bin   https://pan.baidu.com/s/1a0Fisvnkl8UaQZrHP7olCQ?pwd=8w49

- 更多词向量模型               https://cntext.readthedocs.io/zh-cn/latest/embeddings.html
</code></pre></div>]]></content:encoded>
    </item>
    
    <item>
      <title>使用 5000w 专利申请数据集按年份(按省份)训练词向量</title>
      <link>https://textdata.cn/blog/2023-11-20-word2vec-by-year-by-province/</link>
      <pubDate>Fri, 04 Apr 2025 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-11-20-word2vec-by-year-by-province/</guid>
      <description>&lt;p&gt;想用 &lt;a href=&#34;https://textdata.cn/blog/2023-04-13-3571w-patent-dataset-in-china-mainland/&#34;&gt;中国专利申请数据集&lt;/a&gt;，按年份(或按省份)训练词向量的同学，可以好好看本文，能节省你几十个小时时间。
&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;一检查数据&#34;&gt;一、检查数据&lt;/h2&gt;
&lt;p&gt;这个数据集很大， 如图所示，文件动辄几G&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/01-data-screen.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;之前分享过 &lt;a href=&#34;&#34;&gt;&lt;/a&gt; , 面对巨大csv文件，我们要了解内部有哪些字段、字段的含义， 只读取需要的字段，减轻电脑内存压力， 让你能轻松应对几倍于内存的巨大csv文件。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;# 以山东省.csv 为例， 只读第一行(前1行)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;山东省.csv&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;nrows&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;5&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/02-shandong_df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;字段展示的不全，完整的字段应该有&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;columns&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;Index([&amp;#39;专利名称&amp;#39;, &amp;#39;专利类型&amp;#39;, &amp;#39;申请人&amp;#39;, &amp;#39;申请人类型&amp;#39;, &amp;#39;申请人地址&amp;#39;, &amp;#39;申请人国家&amp;#39;, &amp;#39;申请人省份&amp;#39;, &amp;#39;申请人城市&amp;#39;,
       &amp;#39;申请人区县&amp;#39;, &amp;#39;申请号&amp;#39;, &amp;#39;申请日&amp;#39;, &amp;#39;申请年份&amp;#39;, &amp;#39;公开公告号&amp;#39;, &amp;#39;公开公告日&amp;#39;, &amp;#39;公开公告年份&amp;#39;, &amp;#39;授权公告号&amp;#39;,
       &amp;#39;授权公告日&amp;#39;, &amp;#39;授权公告年份&amp;#39;, &amp;#39;IPC分类号&amp;#39;, &amp;#39;IPC主分类号&amp;#39;, &amp;#39;发明人&amp;#39;, &amp;#39;摘要文本&amp;#39;, &amp;#39;主权项内容&amp;#39;, &amp;#39;当前权利人&amp;#39;,
       &amp;#39;当前专利权人地址&amp;#39;, &amp;#39;专利权人类型&amp;#39;, &amp;#39;统一社会信用代码&amp;#39;, &amp;#39;引证次数&amp;#39;, &amp;#39;被引证次数&amp;#39;, &amp;#39;自引次数&amp;#39;, &amp;#39;他引次数&amp;#39;,
       &amp;#39;被自引次数&amp;#39;, &amp;#39;被他引次数&amp;#39;, &amp;#39;家族引证次数&amp;#39;, &amp;#39;家族被引证次数&amp;#39;],
      dtype=&amp;#39;object&amp;#39;)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;p&gt;训练词向量主要用文本数据， 在本案例中， 需要的字段 [&lt;strong&gt;专利摘要&lt;/strong&gt;] 。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二构造语料&#34;&gt;二、构造语料&lt;/h2&gt;
&lt;p&gt;在 [5000万专利申请全量数据1985-2025年] 文件夹中，&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;新建 [province_corpus] 和 [year_corpus] 两个文件夹&lt;/li&gt;
&lt;li&gt;新建 [code.ipynb]&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;构造语料对电脑的性能要求不高， 不论你的电脑是什么配置，基本都能运行， 而且耗时在能接受的范围。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;21-文件树结构&#34;&gt;2.1 文件树结构&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;5000万专利申请全量数据1985-2025年
  |---中国专利数据库.csv.gz
  |---code.ipynb
  |---province_corpus
     |---安徽省.txt
     |---浙江省.txt
     |---...
  |---year_corpus
     |---2025.txt
     |---2024.txt
     |---...
  |---provin_w2vs
        |---安徽省-Word2Vec.200.15.bin
        |---山东省-Word2Vec.200.15.bin
        |---...
  |---year_w2vs
        |---2025-Word2Vec.200.15.bin
        |---2022-Word2Vec.100.6.bin.syn1neg.npy
        |---2022-Word2Vec.100.6.bin.wv.vectors.npy
        |---...
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;22-构造语料代码&#34;&gt;2.2 构造语料代码&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;o&#34;&gt;%%&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;time&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pathlib&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Path&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;tqdm&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;tqdm&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;# 读取csv文件， 只读取需要的字段&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;# 按 100000 行分块读取， 避免内存溢出&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;chunk_dfs&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;中国专利数据库.csv.gz&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
                 &lt;span class=&#34;n&#34;&gt;compression&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;gzip&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
                 &lt;span class=&#34;n&#34;&gt;usecols&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;申请人省份&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;申请日&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;专利名称&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;摘要文本&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt;
                 &lt;span class=&#34;n&#34;&gt;chunksize&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;100000&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;chunk_df&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;tqdm&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;chunk_dfs&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;chunk_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;申请日&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;to_datetime&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;chunk_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;申请日&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;

    &lt;span class=&#34;c1&#34;&gt;# 新建 province_corpus 和 year_corpus文件夹&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;province_dir&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Path&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;province_corpus&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;province_dir&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;mkdir&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;parents&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;exist_ok&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;year_dir&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Path&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;year_corpus&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;year_dir&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;mkdir&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;parents&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;exist_ok&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    
    &lt;span class=&#34;c1&#34;&gt;# 按省份和年份构造语料&lt;/span&gt;
    &lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;date&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;year_df&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;chunk_df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;groupby&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Grouper&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;key&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;申请日&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;freq&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;YE&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)):&lt;/span&gt;
        &lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;date&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;year&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
        &lt;span class=&#34;n&#34;&gt;year_file&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;year_dir&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;/&lt;/span&gt; &lt;span class=&#34;sa&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;date&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;year&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;.txt&amp;#34;&lt;/span&gt;
        &lt;span class=&#34;k&#34;&gt;with&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;open&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;year_file&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;a+&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;encoding&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;utf-8&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;yf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
            &lt;span class=&#34;n&#34;&gt;y_text_series&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;year_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;专利名称&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fillna&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;year_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;摘要文本&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fillna&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
            &lt;span class=&#34;n&#34;&gt;y_text&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;se&#34;&gt;\n&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;join&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;y_text_series&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
            &lt;span class=&#34;n&#34;&gt;yf&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;write&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;y_text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    
    &lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;prov&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;prov_df&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;chunk_df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;groupby&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;申请人省份&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
        &lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;prov&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
        &lt;span class=&#34;n&#34;&gt;prov_file&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;province_dir&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;/&lt;/span&gt; &lt;span class=&#34;sa&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;prov&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;.txt&amp;#34;&lt;/span&gt;
        &lt;span class=&#34;k&#34;&gt;with&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;open&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;prov_file&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;a+&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;encoding&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;utf-8&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
            &lt;span class=&#34;n&#34;&gt;prov_text_series&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;prov_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;专利名称&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fillna&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;prov_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;摘要文本&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fillna&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
            &lt;span class=&#34;n&#34;&gt;prov_text&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;se&#34;&gt;\n&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;join&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;prov_text_series&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
            &lt;span class=&#34;n&#34;&gt;pf&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;write&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;prov_text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;2025
2024
...


上海市
云南省
...
安徽省

CPU times: total: 27min 55s
Wall time: 39min 10s
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;构造语料用 40 分钟时间，得到文件夹province_corpus和year_corpus。
&lt;img loading=&#34;lazy&#34; src=&#34;img/03-province-corpus.png&#34; alt=&#34;&#34;  /&gt;

&lt;img loading=&#34;lazy&#34; src=&#34;img/03-year-corpus.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;三训练word2vec&#34;&gt;三、训练word2vec&lt;/h2&gt;
&lt;p&gt;需要注意， 训练word2vec需要耗费很大的计算能力， 训练时间需要一两三。 本文使用的 &lt;em&gt;&lt;strong&gt;cntext2.1.5&lt;/strong&gt;&lt;/em&gt; 版本&lt;/p&gt;
&lt;h3 id=&#34;31-安装cntext&#34;&gt;3.1 安装cntext&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;cd desktop
pip install cntext --upgrade
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;32-开始训练&#34;&gt;3.2 开始训练&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;o&#34;&gt;%%&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;time&lt;/span&gt;

&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;glob&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pathlib&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Path&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;cntext&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;ct&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;os&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;# 分年份训练&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;year_f&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;glob&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;glob&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;year_corpus/*.txt&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
    &lt;span class=&#34;c1&#34;&gt;# 训练word2vec，自动保存到output文件夹内&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Word2Vec&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;corpus_file&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;year_f&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
                &lt;span class=&#34;n&#34;&gt;vector_size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;200&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
                &lt;span class=&#34;n&#34;&gt;window_size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;15&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
                &lt;span class=&#34;n&#34;&gt;only_binary&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;# 将output文件夹重命名为year_w2vs&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;os&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;rename&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;output&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;year_w2vs&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;


&lt;span class=&#34;c1&#34;&gt;# 分省份训练&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;prov_f&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;glob&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;glob&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;province_corpus/*.txt&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
    &lt;span class=&#34;c1&#34;&gt;# 训练word2vec，自动保存到output文件夹内&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Word2Vec&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;corpus_file&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;prov_f&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
                &lt;span class=&#34;n&#34;&gt;vector_size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;200&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
                &lt;span class=&#34;n&#34;&gt;window_size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;15&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
                &lt;span class=&#34;n&#34;&gt;only_binary&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; 
&lt;span class=&#34;c1&#34;&gt;# 将output文件夹重命名为province_w2vs&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;os&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;rename&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;output&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;province_w2vs&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;Windows System, Unable Parallel Processing
Cache output\1985_cache.txt Not Found or Empty, Preprocessing Corpus
Processing Corpus: 100%|████████████████████████████████████████████████████████| 10009/10009 [00:13&amp;lt;00:00, 734.66it/s]
Reading Preprocessed Corpus from output\1985_cache.txt
Start Training Word2Vec
Word2Vec Training Cost 17 s. 
Output Saved To: output\1985-Word2Vec.200.15.bin

......
......

Windows System, Unable Parallel Processing
Cache output\2025_cache.txt Not Found or Empty, Preprocessing Corpus
Processing Corpus: 100%|████████████████████████████████████████████████████████| 10009/10009 [00:13&amp;lt;00:00, 734.66it/s]
Reading Preprocessed Corpus from output\2025_cache.txt
Start Training Word2Vec
Word2Vec Training Cost 17 s. 
Output Saved To: output\2025-Word2Vec.200.15.bin


Windows System, Unable Parallel Processing
Cache output\上海市_cache.txt Not Found or Empty, Preprocessing Corpus
Processing Corpus: 100%|██████████████████████████████████████████████████| 2456943/2456943 [03:42&amp;lt;00:00, 11048.35it/s]
Reading Preprocessed Corpus from output\上海市_cache.txt
Start Training Word2Vec
Word2Vec Training Cost 1400 s. 
Output Saved To: output\上海市-Word2Vec.200.15.bin
......
......

Windows System, Unable Parallel Processing
Cache output\黑龙江省_cache.txt Not Found or Empty, Preprocessing Corpus
Processing Corpus: 100%|█████████████████████████████████████████████████████| 544329/544329 [01:07&amp;lt;00:00, 8114.12it/s]
Reading Preprocessed Corpus from output\黑龙江省_cache.txt
Start Training Word2Vec
Word2Vec Training Cost 320 s. 
Output Saved To: output\黑龙江省-Word2Vec.200.15.bin


CPU times: total: 21354 s
Wall time: 21758 s
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;训练省份词向量大概用了 6 小时，模型文件保存在 &lt;em&gt;&lt;strong&gt;provin_w2vs&lt;/strong&gt;&lt;/em&gt; 和 &lt;em&gt;&lt;strong&gt;year_w2vs&lt;/strong&gt;&lt;/em&gt; 文件夹内。
&lt;img loading=&#34;lazy&#34; src=&#34;img/05-province-w2vs.png&#34; alt=&#34;&#34;  /&gt;

&lt;img loading=&#34;lazy&#34; src=&#34;img/06-years-w2vs.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;三使用word2vec&#34;&gt;三、使用word2vec&lt;/h2&gt;
&lt;h3 id=&#34;31-导入模型&#34;&gt;3.1 导入模型&lt;/h3&gt;
&lt;p&gt;&lt;code&gt;output/provin_w2vs&lt;/code&gt; 和 &lt;code&gt;output/year_w2vs&lt;/code&gt; 内有多个模型， 单个的模型文件大约几十M ~ 几百M， &lt;strong&gt;但不建议一次性导入进来&lt;/strong&gt;。大邓的电脑内存96G，为了省事，就一次性全导入了。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;cntext&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;ct&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;glob&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;tqdm&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;tqdm&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;# 导入各省份词向量&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;provin_w2vs_&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[]&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;provin_w2v_fs&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;glob&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;glob&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;province_w2vs/*.bin&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;provin_w2v_f&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;tqdm&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;provin_w2v_fs&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;provin_w2v&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;load_w2v&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;provin_w2v_f&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;provin_w2vs_&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;append&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;provin_w2v&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;


&lt;span class=&#34;c1&#34;&gt;# 导入各年份词向量&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;year_w2vs_&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[]&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;year_w2v_fs&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;glob&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;glob&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;year_w2vs/*.bin&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;year_w2v_f&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;tqdm&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;year_w2v_fs&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;year_w2v&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;load_w2v&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;year_w2v_f&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;year_w2vs_&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;append&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;year_w2v&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;  3%|██▍                                                                                | 1/34 [00:03&amp;lt;01:57,  3.57s/it]
Loading province_w2vs\上海市-Word2Vec.200.15.bin...
  6%|████▉                                                                              | 2/34 [00:04&amp;lt;01:11,  2.23s/it]
Loading province_w2vs\云南省-Word2Vec.200.15.bin...
......
97%|███████████████████████████████████████████████████████████████████████████████▌  | 33/34 [01:07&amp;lt;00:01,  1.10s/it]
Loading province_w2vs\香港特别行政区-Word2Vec.200.15.bin...
100%|██████████████████████████████████████████████████████████████████████████████████| 34/34 [01:09&amp;lt;00:00,  2.04s/it]
Loading province_w2vs\黑龙江省-Word2Vec.200.15.bin...


  2%|██                                                                                 | 1/41 [00:00&amp;lt;00:05,  7.10it/s]
Loading year_w2vs\1985-Word2Vec.200.15.bin...
Loading year_w2vs\1986-Word2Vec.200.15.bin...
 10%|████████                                                                           | 4/41 [00:00&amp;lt;00:05,  6.80it/s]
 ......
 100%|██████████████████████████████████████████████████████████████████████████████████| 41/41 [01:11&amp;lt;00:00,  1.75s/it]
Loading year_w2vs\2025-Word2Vec.200.15.bin...

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;32-查看词汇量&#34;&gt;3.2 查看词汇量&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pathlib&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Path&lt;/span&gt;

&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;省份Word2vec词汇量&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;provin_w2v_f&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;provin_w2v&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;zip&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;provin_w2v_fs&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;provin_w2vs_&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;province&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Path&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;provin_w2v_f&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;stem&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;split&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;-&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)[&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
    &lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;sa&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;province&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt; 词汇量: &lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;provin_w2v&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;省份Word2vec词汇量&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;上海市&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;词汇量&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;640941&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;云南省&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;词汇量&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;205193&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;内蒙古自治区&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;词汇量&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;138507&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;北京市&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;词汇量&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;783162&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;台湾省&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;词汇量&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;242630&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;吉林省&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;词汇量&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;185587&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;四川省&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;词汇量&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;494241&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;天津市&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;词汇量&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;373286&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;宁夏回族自治区&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;词汇量&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;91592&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;安徽省&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;词汇量&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;540111&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;山东省&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;词汇量&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;722886&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;山西省&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;词汇量&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;188013&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;广东省&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;词汇量&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;1010230&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;广西壮族自治区&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;词汇量&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;190128&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;新疆维吾尔自治区&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;词汇量&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;110063&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;江苏省&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;词汇量&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;983871&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;江西省&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;词汇量&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;256695&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;河北省&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;词汇量&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;326042&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;河南省&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;词汇量&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;415905&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;浙江省&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;词汇量&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;795041&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;海南省&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;词汇量&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;74657&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;湖北省&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;词汇量&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;412827&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;湖南省&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;词汇量&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;400262&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;澳门特别行政区&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;词汇量&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;7806&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;甘肃省&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;词汇量&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;148753&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;福建省&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;词汇量&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;480456&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;西藏自治区&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;词汇量&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;23115&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;贵州省&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;词汇量&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;186345&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;辽宁省&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;词汇量&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;347563&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;重庆市&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;词汇量&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;358991&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;陕西省&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;词汇量&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;381781&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;青海省&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;词汇量&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;53325&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;香港特别行政区&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;词汇量&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;71947&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;黑龙江省&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;词汇量&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;253129&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;年份word2vec词汇量&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;year_w2v_f&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;year_w2v&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;zip&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;year_w2v_fs&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;year_w2vs_&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;year&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Path&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;year_w2v_f&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;stem&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;split&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;-&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)[&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
    &lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;sa&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;year&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;: &lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;year_w2v&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;index_to_key&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;年份word2vec词汇量
1985: 15494
1986: 17945
1987: 23625
1988: 27740
1989: 27394
1990: 32920
1991: 37584
1992: 45393
1993: 48326
1994: 46725
1995: 46138
1996: 50117
1997: 53625
1998: 57187
1999: 65154
2000: 78368
2001: 95927
2002: 123513
2003: 145087
2004: 158694
2005: 185840
2006: 215856
2007: 240167
2008: 279364
2009: 334179
2010: 382888
2011: 449648
2012: 508506
2013: 621644
2014: 625248
2015: 685487
2016: 732443
2017: 760332
2018: 776968
2019: 789104
2020: 817553
2021: 799388
2022: 734045
2023: 596784
2024: 516263
2025: 21230
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;33-语义检查-省份&#34;&gt;3.3 语义检查-省份&lt;/h3&gt;
&lt;p&gt;先检查省份， 查看与[&amp;lsquo;创新&amp;rsquo;, &amp;lsquo;新颖&amp;rsquo;]最相似的5个词，通过语义捕捉准确与否，粗略判断Word2vec训练效果的好坏。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;provin_w2v_f&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;provin_w2v&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;zip&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;provin_w2v_fs&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;provin_w2vs_&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
    &lt;span class=&#34;k&#34;&gt;try&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
        &lt;span class=&#34;n&#34;&gt;province&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Path&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;provin_w2v_f&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;stem&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;split&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;-&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)[&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
        &lt;span class=&#34;n&#34;&gt;wordweigths&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;provin_w2v&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;most_similar&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;([&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;创新&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;新颖&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;topn&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;5&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
        &lt;span class=&#34;n&#34;&gt;words&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;w&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;w&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;p&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;wordweigths&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
        &lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;sa&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;province&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;: &lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34; &amp;#34;&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;join&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;words&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;k&#34;&gt;except&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
        &lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;sa&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;province&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;: NA&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;上海市: 独特 巧妙 创新性 理念 新颖结构合理
云南省: 独特 巧妙 精巧 科学合理 新颖合理
内蒙古自治区: 理念 独特 巧妙 合理使用方便 全新
北京市: 巧妙 独特 全新 借鉴 新颖使用方便
台湾省: 全新 独特 精巧 简洁 巧妙
吉林省: 理念 思路 现代 全新 巧妙
四川省: 巧妙 独特 全新 理念 合理使用方便
天津市: 独特 巧妙 合理使用方便 巧妙使用方便 全新
宁夏回族自治区: 更具 多样 丰富 性价比 市场前景
安徽省: 巧妙 独特 巧妙使用方便 合理 合理结构紧凑
山东省: 巧妙 精巧 新颖结构合理 巧妙结构合理 全新
山西省: 独特 全新 科学 现代 已有
广东省: 巧妙 独特 创新性 合理 精巧
广西壮族自治区: 独特 巧妙 合理使用方便 精巧 合理实用性
新疆维吾尔自治区: 合理使用方便 巧妙 简单合理 科学合理 合理
江苏省: 巧妙 独特 合理 全新 科学
江西省: 独特 科学合理 巧妙 简洁 精巧
河北省: 巧妙 新颖使用方便 精巧 独特 理念
河南省: 巧妙 科学合理 独特 新颖使用方便 巧妙使用方便
浙江省: 独特 巧妙 科学 精巧 合理
海南省: 思路 科学合理 人性化 科学 独特
湖北省: 巧妙 巧妙合理 科学合理 独特 新颖结构合理
湖南省: 巧妙 精巧 独特 新颖独特 巧妙结构合理
澳门特别行政区: 撞击 边坡 溜槽 耐高温 材料制成
甘肃省: 独特 全新 理念 现代 普及
福建省: 巧妙 新颖使用方便 独特 巧妙结构合理 全新
西藏自治区: 既能 十分 疲劳 范围广 更加人性化
贵州省: 巧妙 独特 科学合理 合理使用方便 精巧
辽宁省: 巧妙 巧妙结构合理 新颖结构合理 新颖独特 独创
重庆市: 合理使用方便 科学合理 精巧 全新 新颖使用方便
陕西省: 巧妙 独特 新颖结构合理 合理 合理使用方便
青海省: 突破 织机 现行 经济环保 杆织机
香港特别行政区: 全新 更具 美感 市场 丰富
黑龙江省: 独特 精巧 科学合理 巧妙 构思
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;从上面的运行结果看， 除青海省，剩下的绝大多数的省份的Word2vec都很准确的捕捉到了专利摘要的语义信息。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;34-语义检查-年份&#34;&gt;3.4 语义检查-年份&lt;/h3&gt;
&lt;p&gt;查看与[&amp;lsquo;创新&amp;rsquo;, &amp;lsquo;新颖&amp;rsquo;]最相似的5个词，通过语义捕捉准确与否，粗略判断Word2vec训练效果的好坏。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;year_w2v_f&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;year_w2v&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;zip&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;year_w2v_fs&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;year_w2vs_&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
    &lt;span class=&#34;k&#34;&gt;try&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
        &lt;span class=&#34;n&#34;&gt;year&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Path&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;year_w2v_f&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;stem&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;split&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;-&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)[&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
        &lt;span class=&#34;n&#34;&gt;wordweigths&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;year_w2v&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;most_similar&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;([&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;创新&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;新颖&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;topn&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;5&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
        &lt;span class=&#34;n&#34;&gt;words&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;w&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;w&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;p&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;wordweigths&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
        &lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;sa&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;year&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;: &lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34; &amp;#34;&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;join&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;words&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;k&#34;&gt;except&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
        &lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;sa&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;year&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;: NA&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;1985: 公知 专门 特别适用 提出一种 机器人
1986: 植入 理发 135 婴儿车 街道
1987: 落后 儿童智力开发 低档 胶鞋 指甲
1988: 目前市场 低档 捕鱼 证件 普遍使用
1989: 价廉物美 课堂教学 普及型 大众 得心应手
1990: 单纯 精简 多方面 机等 应用领域
1991: 普及型 前途 大众 现代 现代化
1992: 构思 保留传统 不失为 全新 机之
1993: 完美 现代科技 崭新 式样 边墙
1994: 样式 别致 造型新颖 显得 华贵
1995: 多样 结实耐用 独特 别致 高档
1996: 花样 耐冲击浮标 极其 式样新颖 应用范围
1997: 实用美观 室内外装饰 标准化 多变 形象逼真
1998: 现代 改革 开发 高雅 创造
1999: 现代 市场 越来越 款式 大方
2000: 全新 娱乐性趣味性 多样化 多方面 各种各样
2001: 完美 现代 新颖别致 多样化 体现
2002: 全新 现代 实为 科学 大众
2003: 全新 突破传统 多样化 体现 科学
2004: 全新 科技 现代 市场 科学合理
2005: 创意 理念 全新 科学 现代
2006: 全新 构思 理念 新颖性 独特
2007: 全新 突破传统 巧妙 独特 现代
2008: 独特 巧妙 全新 新颖独特 设计理念
2009: 独特 巧妙 全新 科学 新颖独特
2010: 独特 新颖独特 精巧 巧妙 科学合理
2011: 独特 精巧 新颖独特 科学合理 巧妙
2012: 独特 巧妙 新颖独特 精巧 科学合理
2013: 独特 新颖独特 精巧 科学合理 巧妙
2014: 巧妙 独特 科学合理 巧妙合理 精巧
2015: 巧妙 独特 巧妙合理 新颖结构合理 科学合理
2016: 巧妙 结合现在 巧妙合理 独特 全新
2017: 巧妙 科学合理 独特 科学 合理
2018: 巧妙 独特 合理 科学 全新
2019: 巧妙 独特 合理 全新 精巧
2020: 巧妙 独特 合理 精巧 设计
2021: 巧妙 精巧 合理 新颖结构合理 全新
2022: 巧妙 全新 独特 创新性 精巧
2023: 巧妙 独特 全新 创新性 精巧
2024: 巧妙 创新性 独特 精巧 全新
2025: 相结合 双重 独特 多重 优势
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;也试了其他的词语，好像 1998 年之后的捕捉的语义是准确的。&lt;/p&gt;
&lt;br&gt;
&lt;h2 id=&#34;四研究潜力-语义变迁研究方法介绍&#34;&gt;四、研究潜力: 语义变迁研究方法介绍&lt;/h2&gt;
&lt;p&gt;假设语义都很准的话， 是可以研究 &lt;strong&gt;语义变迁&lt;/strong&gt; 或者 &lt;strong&gt;语义差异&lt;/strong&gt; 的。 但需要注意， 不能直接使用两个年份或者两个省份的中word1和word2的距离来体现语义的变迁或者语义的差异。 如果想做省份间差异或者某省份随时间的变化， 需要用到 &lt;strong&gt;对齐算法&lt;/strong&gt;， 常用的算法是 &lt;strong&gt;正交Procrustes矩阵对齐&lt;/strong&gt;， 使得同省份不同年份或者通年份不同省份的word2vec都有相同的语义空间。&lt;/p&gt;
&lt;h3 id=&#34;41-正交procrustes算法&#34;&gt;4.1 正交Procrustes算法&lt;/h3&gt;
&lt;p&gt;正交Procrustes矩阵对齐是一种将两个预训练语言模型的词向量矩阵对齐的方法，使得它们在相同的语义空间中表示。具体来说，它通过计算一个正交矩阵，将两个词向量矩阵进行线性变换，使得它们的Frobenius范数之和最小，从而实现对齐。 在 cntext2.x中实现了Procrustes对齐函数 &lt;em&gt;&lt;strong&gt;ct.procrustes_align()&lt;/strong&gt;&lt;/em&gt;，具体可阅读 &lt;a href=&#34;https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/&#34;&gt;文本分析库cntext2.x使用手册&lt;/a&gt;&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;42-语义变迁流程图&#34;&gt;4.2 语义变迁流程图&lt;/h3&gt;
&lt;p&gt;语义变迁类研究的流程图可参考 &lt;a href=&#34;https://github.com/Living-with-machines/DiachronicEmb-BigHistData&#34;&gt;DiachronicEmb-BigHistData&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/w2v-time-shifting-flowchart.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;大邓在 &lt;a href=&#34;https://textdata.cn/blog/2023-12-28-visualize-the-culture-change-using-people-daily-dataset/&#34;&gt;可视化 | 人民日报语料反映七十年文化演变&lt;/a&gt; 实现了历时语义对齐， 可以看出70 年整个中国社会的认知变迁。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;43-识别语义变化时间点&#34;&gt;4.3 识别语义变化时间点&lt;/h3&gt;
&lt;p&gt;该项目研究了1800-1910期间， 每10年为一个单位训练词向量， 探究词语变化。以 &lt;em&gt;&lt;strong&gt;railway&lt;/strong&gt;&lt;/em&gt; 和  &lt;em&gt;&lt;strong&gt;traffic&lt;/strong&gt;&lt;/em&gt; 为例, 先用 余弦相似度(cosine-similarity)算法识别词语语义变化的时间点，如下图&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/consine-sim-cpdetection.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;44-绘制语义变化轨迹&#34;&gt;4.4 绘制语义变化轨迹&lt;/h3&gt;
&lt;p&gt;语义变化轨迹&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/railway-time-shifting.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;五获取资源&#34;&gt;五、获取资源&lt;/h2&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;- 免费  专利摘要-Word2Vec.200.15.bin  https://pan.baidu.com/s/1CgBjy96hDKM2GKQY4G6kYA?pwd=ba92

- 免费  province_w2vs                https://pan.baidu.com/s/1eBFTIZcv2DWssLiaRnCqZQ?pwd=ikpu

- 免费  year_w2vs                    https://pan.baidu.com/s/1lrVkML92cVJdHQa1HQyAwA?pwd=4gqa
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;cntext使用声明&#34;&gt;cntext使用声明&lt;/h2&gt;
&lt;p&gt;如在研究或项目中使用 cntext ，请在文中介绍并附引用声明。引用格式可参考 &lt;a href=&#34;https://textdata.cn/blog/2025-09-09-transform-text-data-into-structured-data-with-lmstudio-and-cntext/#%E4%BD%BF%E7%94%A8%E5%A3%B0%E6%98%8E&#34;&gt;cntext 推荐引用格式&lt;/a&gt;&lt;/p&gt;
</description>
      <content:encoded><![CDATA[<p>想用 <a href="https://textdata.cn/blog/2023-04-13-3571w-patent-dataset-in-china-mainland/">中国专利申请数据集</a>，按年份(或按省份)训练词向量的同学，可以好好看本文，能节省你几十个小时时间。
<br><br></p>
<h2 id="一检查数据">一、检查数据</h2>
<p>这个数据集很大， 如图所示，文件动辄几G</p>
<p><img loading="lazy" src="img/01-data-screen.png" alt=""  />
</p>
<p>之前分享过 <a href=""></a> , 面对巨大csv文件，我们要了解内部有哪些字段、字段的含义， 只读取需要的字段，减轻电脑内存压力， 让你能轻松应对几倍于内存的巨大csv文件。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="c1"># 以山东省.csv 为例， 只读第一行(前1行)</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;山东省.csv&#39;</span><span class="p">,</span> <span class="n">nrows</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
<span class="n">df</span>
</code></pre></div><p><img loading="lazy" src="img/02-shandong_df.png" alt=""  />
</p>
<br>
<p>字段展示的不全，完整的字段应该有</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="o">.</span><span class="n">columns</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">Index([&#39;专利名称&#39;, &#39;专利类型&#39;, &#39;申请人&#39;, &#39;申请人类型&#39;, &#39;申请人地址&#39;, &#39;申请人国家&#39;, &#39;申请人省份&#39;, &#39;申请人城市&#39;,
       &#39;申请人区县&#39;, &#39;申请号&#39;, &#39;申请日&#39;, &#39;申请年份&#39;, &#39;公开公告号&#39;, &#39;公开公告日&#39;, &#39;公开公告年份&#39;, &#39;授权公告号&#39;,
       &#39;授权公告日&#39;, &#39;授权公告年份&#39;, &#39;IPC分类号&#39;, &#39;IPC主分类号&#39;, &#39;发明人&#39;, &#39;摘要文本&#39;, &#39;主权项内容&#39;, &#39;当前权利人&#39;,
       &#39;当前专利权人地址&#39;, &#39;专利权人类型&#39;, &#39;统一社会信用代码&#39;, &#39;引证次数&#39;, &#39;被引证次数&#39;, &#39;自引次数&#39;, &#39;他引次数&#39;,
       &#39;被自引次数&#39;, &#39;被他引次数&#39;, &#39;家族引证次数&#39;, &#39;家族被引证次数&#39;],
      dtype=&#39;object&#39;)
</code></pre></div><br>
<p>训练词向量主要用文本数据， 在本案例中， 需要的字段 [<strong>专利摘要</strong>] 。</p>
<p><br><br></p>
<h2 id="二构造语料">二、构造语料</h2>
<p>在 [5000万专利申请全量数据1985-2025年] 文件夹中，</p>
<ol>
<li>新建 [province_corpus] 和 [year_corpus] 两个文件夹</li>
<li>新建 [code.ipynb]</li>
</ol>
<p>构造语料对电脑的性能要求不高， 不论你的电脑是什么配置，基本都能运行， 而且耗时在能接受的范围。</p>
<br>
<h3 id="21-文件树结构">2.1 文件树结构</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">5000万专利申请全量数据1985-2025年
  |---中国专利数据库.csv.gz
  |---code.ipynb
  |---province_corpus
     |---安徽省.txt
     |---浙江省.txt
     |---...
  |---year_corpus
     |---2025.txt
     |---2024.txt
     |---...
  |---provin_w2vs
        |---安徽省-Word2Vec.200.15.bin
        |---山东省-Word2Vec.200.15.bin
        |---...
  |---year_w2vs
        |---2025-Word2Vec.200.15.bin
        |---2022-Word2Vec.100.6.bin.syn1neg.npy
        |---2022-Word2Vec.100.6.bin.wv.vectors.npy
        |---...
</code></pre></div><br>
<h3 id="22-构造语料代码">2.2 构造语料代码</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
<span class="kn">from</span> <span class="nn">tqdm</span> <span class="kn">import</span> <span class="n">tqdm</span>

<span class="c1"># 读取csv文件， 只读取需要的字段</span>
<span class="c1"># 按 100000 行分块读取， 避免内存溢出</span>
<span class="n">chunk_dfs</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;中国专利数据库.csv.gz&#39;</span><span class="p">,</span> 
                 <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">,</span> 
                 <span class="n">usecols</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;申请人省份&#39;</span><span class="p">,</span> <span class="s1">&#39;申请日&#39;</span><span class="p">,</span> <span class="s1">&#39;专利名称&#39;</span><span class="p">,</span> <span class="s1">&#39;摘要文本&#39;</span><span class="p">],</span>
                 <span class="n">chunksize</span><span class="o">=</span><span class="mi">100000</span><span class="p">)</span>

<span class="k">for</span> <span class="n">chunk_df</span> <span class="ow">in</span> <span class="n">tqdm</span><span class="p">(</span><span class="n">chunk_dfs</span><span class="p">):</span>
    <span class="n">chunk_df</span><span class="p">[</span><span class="s1">&#39;申请日&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">chunk_df</span><span class="p">[</span><span class="s1">&#39;申请日&#39;</span><span class="p">])</span>

    <span class="c1"># 新建 province_corpus 和 year_corpus文件夹</span>
    <span class="n">province_dir</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="s1">&#39;province_corpus&#39;</span><span class="p">)</span>
    <span class="n">province_dir</span><span class="o">.</span><span class="n">mkdir</span><span class="p">(</span><span class="n">parents</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">exist_ok</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
    <span class="n">year_dir</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="s1">&#39;year_corpus&#39;</span><span class="p">)</span>
    <span class="n">year_dir</span><span class="o">.</span><span class="n">mkdir</span><span class="p">(</span><span class="n">parents</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">exist_ok</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
    
    <span class="c1"># 按省份和年份构造语料</span>
    <span class="k">for</span> <span class="n">date</span><span class="p">,</span> <span class="n">year_df</span> <span class="ow">in</span> <span class="n">chunk_df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">pd</span><span class="o">.</span><span class="n">Grouper</span><span class="p">(</span><span class="n">key</span><span class="o">=</span><span class="s1">&#39;申请日&#39;</span><span class="p">,</span> <span class="n">freq</span><span class="o">=</span><span class="s1">&#39;YE&#39;</span><span class="p">)):</span>
        <span class="nb">print</span><span class="p">(</span><span class="n">date</span><span class="o">.</span><span class="n">year</span><span class="p">)</span>
        <span class="n">year_file</span> <span class="o">=</span> <span class="n">year_dir</span> <span class="o">/</span> <span class="sa">f</span><span class="s2">&#34;</span><span class="si">{</span><span class="n">date</span><span class="o">.</span><span class="n">year</span><span class="si">}</span><span class="s2">.txt&#34;</span>
        <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">year_file</span><span class="p">,</span> <span class="s1">&#39;a+&#39;</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">yf</span><span class="p">:</span>
            <span class="n">y_text_series</span> <span class="o">=</span> <span class="n">year_df</span><span class="p">[</span><span class="s1">&#39;专利名称&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span> <span class="o">+</span> <span class="n">year_df</span><span class="p">[</span><span class="s1">&#39;摘要文本&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span>
            <span class="n">y_text</span> <span class="o">=</span> <span class="s1">&#39;</span><span class="se">\n</span><span class="s1">&#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">y_text_series</span><span class="p">)</span>
            <span class="n">yf</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">y_text</span><span class="p">)</span>
    
    <span class="k">for</span> <span class="n">prov</span><span class="p">,</span> <span class="n">prov_df</span> <span class="ow">in</span> <span class="n">chunk_df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">&#39;申请人省份&#39;</span><span class="p">):</span>
        <span class="nb">print</span><span class="p">(</span><span class="n">prov</span><span class="p">)</span>
        <span class="n">prov_file</span> <span class="o">=</span> <span class="n">province_dir</span> <span class="o">/</span> <span class="sa">f</span><span class="s2">&#34;</span><span class="si">{</span><span class="n">prov</span><span class="si">}</span><span class="s2">.txt&#34;</span>
        <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">prov_file</span><span class="p">,</span> <span class="s1">&#39;a+&#39;</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">pf</span><span class="p">:</span>
            <span class="n">prov_text_series</span> <span class="o">=</span> <span class="n">prov_df</span><span class="p">[</span><span class="s1">&#39;专利名称&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span> <span class="o">+</span> <span class="n">prov_df</span><span class="p">[</span><span class="s1">&#39;摘要文本&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span>
            <span class="n">prov_text</span> <span class="o">=</span> <span class="s1">&#39;</span><span class="se">\n</span><span class="s1">&#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">prov_text_series</span><span class="p">)</span>
            <span class="n">pf</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">prov_text</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">2025
2024
...


上海市
云南省
...
安徽省

CPU times: total: 27min 55s
Wall time: 39min 10s
</code></pre></div><p>构造语料用 40 分钟时间，得到文件夹province_corpus和year_corpus。
<img loading="lazy" src="img/03-province-corpus.png" alt=""  />

<img loading="lazy" src="img/03-year-corpus.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="三训练word2vec">三、训练word2vec</h2>
<p>需要注意， 训练word2vec需要耗费很大的计算能力， 训练时间需要一两三。 本文使用的 <em><strong>cntext2.1.5</strong></em> 版本</p>
<h3 id="31-安装cntext">3.1 安装cntext</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">cd desktop
pip install cntext --upgrade
</code></pre></div><br>
<h3 id="32-开始训练">3.2 开始训练</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>

<span class="kn">import</span> <span class="nn">glob</span>
<span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
<span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>
<span class="kn">import</span> <span class="nn">os</span>

<span class="c1"># 分年份训练</span>
<span class="k">for</span> <span class="n">year_f</span> <span class="ow">in</span> <span class="n">glob</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s1">&#39;year_corpus/*.txt&#39;</span><span class="p">):</span>
    <span class="c1"># 训练word2vec，自动保存到output文件夹内</span>
    <span class="n">ct</span><span class="o">.</span><span class="n">Word2Vec</span><span class="p">(</span><span class="n">corpus_file</span> <span class="o">=</span> <span class="n">year_f</span><span class="p">,</span> 
                <span class="n">vector_size</span><span class="o">=</span><span class="mi">200</span><span class="p">,</span> 
                <span class="n">window_size</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span> 
                <span class="n">only_binary</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="c1"># 将output文件夹重命名为year_w2vs</span>
<span class="n">os</span><span class="o">.</span><span class="n">rename</span><span class="p">(</span><span class="s1">&#39;output&#39;</span><span class="p">,</span> <span class="s1">&#39;year_w2vs&#39;</span><span class="p">)</span>


<span class="c1"># 分省份训练</span>
<span class="k">for</span> <span class="n">prov_f</span> <span class="ow">in</span> <span class="n">glob</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s1">&#39;province_corpus/*.txt&#39;</span><span class="p">):</span>
    <span class="c1"># 训练word2vec，自动保存到output文件夹内</span>
    <span class="n">ct</span><span class="o">.</span><span class="n">Word2Vec</span><span class="p">(</span><span class="n">corpus_file</span> <span class="o">=</span> <span class="n">prov_f</span><span class="p">,</span> 
                <span class="n">vector_size</span><span class="o">=</span><span class="mi">200</span><span class="p">,</span> 
                <span class="n">window_size</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span> 
                <span class="n">only_binary</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span> 
<span class="c1"># 将output文件夹重命名为province_w2vs</span>
<span class="n">os</span><span class="o">.</span><span class="n">rename</span><span class="p">(</span><span class="s1">&#39;output&#39;</span><span class="p">,</span> <span class="s1">&#39;province_w2vs&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">Windows System, Unable Parallel Processing
Cache output\1985_cache.txt Not Found or Empty, Preprocessing Corpus
Processing Corpus: 100%|████████████████████████████████████████████████████████| 10009/10009 [00:13&lt;00:00, 734.66it/s]
Reading Preprocessed Corpus from output\1985_cache.txt
Start Training Word2Vec
Word2Vec Training Cost 17 s. 
Output Saved To: output\1985-Word2Vec.200.15.bin

......
......

Windows System, Unable Parallel Processing
Cache output\2025_cache.txt Not Found or Empty, Preprocessing Corpus
Processing Corpus: 100%|████████████████████████████████████████████████████████| 10009/10009 [00:13&lt;00:00, 734.66it/s]
Reading Preprocessed Corpus from output\2025_cache.txt
Start Training Word2Vec
Word2Vec Training Cost 17 s. 
Output Saved To: output\2025-Word2Vec.200.15.bin


Windows System, Unable Parallel Processing
Cache output\上海市_cache.txt Not Found or Empty, Preprocessing Corpus
Processing Corpus: 100%|██████████████████████████████████████████████████| 2456943/2456943 [03:42&lt;00:00, 11048.35it/s]
Reading Preprocessed Corpus from output\上海市_cache.txt
Start Training Word2Vec
Word2Vec Training Cost 1400 s. 
Output Saved To: output\上海市-Word2Vec.200.15.bin
......
......

Windows System, Unable Parallel Processing
Cache output\黑龙江省_cache.txt Not Found or Empty, Preprocessing Corpus
Processing Corpus: 100%|█████████████████████████████████████████████████████| 544329/544329 [01:07&lt;00:00, 8114.12it/s]
Reading Preprocessed Corpus from output\黑龙江省_cache.txt
Start Training Word2Vec
Word2Vec Training Cost 320 s. 
Output Saved To: output\黑龙江省-Word2Vec.200.15.bin


CPU times: total: 21354 s
Wall time: 21758 s
</code></pre></div><p>训练省份词向量大概用了 6 小时，模型文件保存在 <em><strong>provin_w2vs</strong></em> 和 <em><strong>year_w2vs</strong></em> 文件夹内。
<img loading="lazy" src="img/05-province-w2vs.png" alt=""  />

<img loading="lazy" src="img/06-years-w2vs.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="三使用word2vec">三、使用word2vec</h2>
<h3 id="31-导入模型">3.1 导入模型</h3>
<p><code>output/provin_w2vs</code> 和 <code>output/year_w2vs</code> 内有多个模型， 单个的模型文件大约几十M ~ 几百M， <strong>但不建议一次性导入进来</strong>。大邓的电脑内存96G，为了省事，就一次性全导入了。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>
<span class="kn">import</span> <span class="nn">glob</span>
<span class="kn">from</span> <span class="nn">tqdm</span> <span class="kn">import</span> <span class="n">tqdm</span>

<span class="c1"># 导入各省份词向量</span>
<span class="n">provin_w2vs_</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">provin_w2v_fs</span> <span class="o">=</span> <span class="n">glob</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s1">&#39;province_w2vs/*.bin&#39;</span><span class="p">)</span>
<span class="k">for</span> <span class="n">provin_w2v_f</span> <span class="ow">in</span> <span class="n">tqdm</span><span class="p">(</span><span class="n">provin_w2v_fs</span><span class="p">):</span>
    <span class="n">provin_w2v</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">load_w2v</span><span class="p">(</span><span class="n">provin_w2v_f</span><span class="p">)</span>
    <span class="n">provin_w2vs_</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">provin_w2v</span><span class="p">)</span>


<span class="c1"># 导入各年份词向量</span>
<span class="n">year_w2vs_</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">year_w2v_fs</span> <span class="o">=</span> <span class="n">glob</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s1">&#39;year_w2vs/*.bin&#39;</span><span class="p">)</span>
<span class="k">for</span> <span class="n">year_w2v_f</span> <span class="ow">in</span> <span class="n">tqdm</span><span class="p">(</span><span class="n">year_w2v_fs</span><span class="p">):</span>
    <span class="n">year_w2v</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">load_w2v</span><span class="p">(</span><span class="n">year_w2v_f</span><span class="p">)</span>
    <span class="n">year_w2vs_</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">year_w2v</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">  3%|██▍                                                                                | 1/34 [00:03&lt;01:57,  3.57s/it]
Loading province_w2vs\上海市-Word2Vec.200.15.bin...
  6%|████▉                                                                              | 2/34 [00:04&lt;01:11,  2.23s/it]
Loading province_w2vs\云南省-Word2Vec.200.15.bin...
......
97%|███████████████████████████████████████████████████████████████████████████████▌  | 33/34 [01:07&lt;00:01,  1.10s/it]
Loading province_w2vs\香港特别行政区-Word2Vec.200.15.bin...
100%|██████████████████████████████████████████████████████████████████████████████████| 34/34 [01:09&lt;00:00,  2.04s/it]
Loading province_w2vs\黑龙江省-Word2Vec.200.15.bin...


  2%|██                                                                                 | 1/41 [00:00&lt;00:05,  7.10it/s]
Loading year_w2vs\1985-Word2Vec.200.15.bin...
Loading year_w2vs\1986-Word2Vec.200.15.bin...
 10%|████████                                                                           | 4/41 [00:00&lt;00:05,  6.80it/s]
 ......
 100%|██████████████████████████████████████████████████████████████████████████████████| 41/41 [01:11&lt;00:00,  1.75s/it]
Loading year_w2vs\2025-Word2Vec.200.15.bin...

</code></pre></div><br>
<h3 id="32-查看词汇量">3.2 查看词汇量</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>

<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;省份Word2vec词汇量&#39;</span><span class="p">)</span>
<span class="k">for</span> <span class="n">provin_w2v_f</span><span class="p">,</span> <span class="n">provin_w2v</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">provin_w2v_fs</span><span class="p">,</span> <span class="n">provin_w2vs_</span><span class="p">):</span>
    <span class="n">province</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="n">provin_w2v_f</span><span class="p">)</span><span class="o">.</span><span class="n">stem</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">&#39;-&#39;</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
    <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;</span><span class="si">{</span><span class="n">province</span><span class="si">}</span><span class="s1"> 词汇量: </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">provin_w2v</span><span class="p">)</span><span class="si">}</span><span class="s1">&#39;</span><span class="p">)</span>

</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">省份Word2vec词汇量</span>
<span class="n">上海市</span> <span class="n">词汇量</span><span class="p">:</span> <span class="mi">640941</span>
<span class="n">云南省</span> <span class="n">词汇量</span><span class="p">:</span> <span class="mi">205193</span>
<span class="n">内蒙古自治区</span> <span class="n">词汇量</span><span class="p">:</span> <span class="mi">138507</span>
<span class="n">北京市</span> <span class="n">词汇量</span><span class="p">:</span> <span class="mi">783162</span>
<span class="n">台湾省</span> <span class="n">词汇量</span><span class="p">:</span> <span class="mi">242630</span>
<span class="n">吉林省</span> <span class="n">词汇量</span><span class="p">:</span> <span class="mi">185587</span>
<span class="n">四川省</span> <span class="n">词汇量</span><span class="p">:</span> <span class="mi">494241</span>
<span class="n">天津市</span> <span class="n">词汇量</span><span class="p">:</span> <span class="mi">373286</span>
<span class="n">宁夏回族自治区</span> <span class="n">词汇量</span><span class="p">:</span> <span class="mi">91592</span>
<span class="n">安徽省</span> <span class="n">词汇量</span><span class="p">:</span> <span class="mi">540111</span>
<span class="n">山东省</span> <span class="n">词汇量</span><span class="p">:</span> <span class="mi">722886</span>
<span class="n">山西省</span> <span class="n">词汇量</span><span class="p">:</span> <span class="mi">188013</span>
<span class="n">广东省</span> <span class="n">词汇量</span><span class="p">:</span> <span class="mi">1010230</span>
<span class="n">广西壮族自治区</span> <span class="n">词汇量</span><span class="p">:</span> <span class="mi">190128</span>
<span class="n">新疆维吾尔自治区</span> <span class="n">词汇量</span><span class="p">:</span> <span class="mi">110063</span>
<span class="n">江苏省</span> <span class="n">词汇量</span><span class="p">:</span> <span class="mi">983871</span>
<span class="n">江西省</span> <span class="n">词汇量</span><span class="p">:</span> <span class="mi">256695</span>
<span class="n">河北省</span> <span class="n">词汇量</span><span class="p">:</span> <span class="mi">326042</span>
<span class="n">河南省</span> <span class="n">词汇量</span><span class="p">:</span> <span class="mi">415905</span>
<span class="n">浙江省</span> <span class="n">词汇量</span><span class="p">:</span> <span class="mi">795041</span>
<span class="n">海南省</span> <span class="n">词汇量</span><span class="p">:</span> <span class="mi">74657</span>
<span class="n">湖北省</span> <span class="n">词汇量</span><span class="p">:</span> <span class="mi">412827</span>
<span class="n">湖南省</span> <span class="n">词汇量</span><span class="p">:</span> <span class="mi">400262</span>
<span class="n">澳门特别行政区</span> <span class="n">词汇量</span><span class="p">:</span> <span class="mi">7806</span>
<span class="n">甘肃省</span> <span class="n">词汇量</span><span class="p">:</span> <span class="mi">148753</span>
<span class="n">福建省</span> <span class="n">词汇量</span><span class="p">:</span> <span class="mi">480456</span>
<span class="n">西藏自治区</span> <span class="n">词汇量</span><span class="p">:</span> <span class="mi">23115</span>
<span class="n">贵州省</span> <span class="n">词汇量</span><span class="p">:</span> <span class="mi">186345</span>
<span class="n">辽宁省</span> <span class="n">词汇量</span><span class="p">:</span> <span class="mi">347563</span>
<span class="n">重庆市</span> <span class="n">词汇量</span><span class="p">:</span> <span class="mi">358991</span>
<span class="n">陕西省</span> <span class="n">词汇量</span><span class="p">:</span> <span class="mi">381781</span>
<span class="n">青海省</span> <span class="n">词汇量</span><span class="p">:</span> <span class="mi">53325</span>
<span class="n">香港特别行政区</span> <span class="n">词汇量</span><span class="p">:</span> <span class="mi">71947</span>
<span class="n">黑龙江省</span> <span class="n">词汇量</span><span class="p">:</span> <span class="mi">253129</span>
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="nb">print</span><span class="p">(</span><span class="s1">&#39;年份word2vec词汇量&#39;</span><span class="p">)</span>
<span class="k">for</span> <span class="n">year_w2v_f</span><span class="p">,</span> <span class="n">year_w2v</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">year_w2v_fs</span><span class="p">,</span> <span class="n">year_w2vs_</span><span class="p">):</span>
    <span class="n">year</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="n">year_w2v_f</span><span class="p">)</span><span class="o">.</span><span class="n">stem</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">&#39;-&#39;</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
    <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;</span><span class="si">{</span><span class="n">year</span><span class="si">}</span><span class="s1">: </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">year_w2v</span><span class="o">.</span><span class="n">index_to_key</span><span class="p">)</span><span class="si">}</span><span class="s1">&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">年份word2vec词汇量
1985: 15494
1986: 17945
1987: 23625
1988: 27740
1989: 27394
1990: 32920
1991: 37584
1992: 45393
1993: 48326
1994: 46725
1995: 46138
1996: 50117
1997: 53625
1998: 57187
1999: 65154
2000: 78368
2001: 95927
2002: 123513
2003: 145087
2004: 158694
2005: 185840
2006: 215856
2007: 240167
2008: 279364
2009: 334179
2010: 382888
2011: 449648
2012: 508506
2013: 621644
2014: 625248
2015: 685487
2016: 732443
2017: 760332
2018: 776968
2019: 789104
2020: 817553
2021: 799388
2022: 734045
2023: 596784
2024: 516263
2025: 21230
</code></pre></div><br>
<h3 id="33-语义检查-省份">3.3 语义检查-省份</h3>
<p>先检查省份， 查看与[&lsquo;创新&rsquo;, &lsquo;新颖&rsquo;]最相似的5个词，通过语义捕捉准确与否，粗略判断Word2vec训练效果的好坏。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="k">for</span> <span class="n">provin_w2v_f</span><span class="p">,</span> <span class="n">provin_w2v</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">provin_w2v_fs</span><span class="p">,</span> <span class="n">provin_w2vs_</span><span class="p">):</span>
    <span class="k">try</span><span class="p">:</span>
        <span class="n">province</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="n">provin_w2v_f</span><span class="p">)</span><span class="o">.</span><span class="n">stem</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">&#39;-&#39;</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
        <span class="n">wordweigths</span> <span class="o">=</span> <span class="n">provin_w2v</span><span class="o">.</span><span class="n">most_similar</span><span class="p">([</span><span class="s1">&#39;创新&#39;</span><span class="p">,</span> <span class="s1">&#39;新颖&#39;</span><span class="p">],</span> <span class="n">topn</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
        <span class="n">words</span> <span class="o">=</span> <span class="p">[</span><span class="n">w</span> <span class="k">for</span> <span class="n">w</span><span class="p">,</span><span class="n">p</span> <span class="ow">in</span> <span class="n">wordweigths</span><span class="p">]</span>
        <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;</span><span class="si">{</span><span class="n">province</span><span class="si">}</span><span class="s1">: </span><span class="si">{</span><span class="s2">&#34; &#34;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">words</span><span class="p">)</span><span class="si">}</span><span class="s1">&#39;</span><span class="p">)</span>
    <span class="k">except</span><span class="p">:</span>
        <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;</span><span class="si">{</span><span class="n">province</span><span class="si">}</span><span class="s1">: NA&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">上海市: 独特 巧妙 创新性 理念 新颖结构合理
云南省: 独特 巧妙 精巧 科学合理 新颖合理
内蒙古自治区: 理念 独特 巧妙 合理使用方便 全新
北京市: 巧妙 独特 全新 借鉴 新颖使用方便
台湾省: 全新 独特 精巧 简洁 巧妙
吉林省: 理念 思路 现代 全新 巧妙
四川省: 巧妙 独特 全新 理念 合理使用方便
天津市: 独特 巧妙 合理使用方便 巧妙使用方便 全新
宁夏回族自治区: 更具 多样 丰富 性价比 市场前景
安徽省: 巧妙 独特 巧妙使用方便 合理 合理结构紧凑
山东省: 巧妙 精巧 新颖结构合理 巧妙结构合理 全新
山西省: 独特 全新 科学 现代 已有
广东省: 巧妙 独特 创新性 合理 精巧
广西壮族自治区: 独特 巧妙 合理使用方便 精巧 合理实用性
新疆维吾尔自治区: 合理使用方便 巧妙 简单合理 科学合理 合理
江苏省: 巧妙 独特 合理 全新 科学
江西省: 独特 科学合理 巧妙 简洁 精巧
河北省: 巧妙 新颖使用方便 精巧 独特 理念
河南省: 巧妙 科学合理 独特 新颖使用方便 巧妙使用方便
浙江省: 独特 巧妙 科学 精巧 合理
海南省: 思路 科学合理 人性化 科学 独特
湖北省: 巧妙 巧妙合理 科学合理 独特 新颖结构合理
湖南省: 巧妙 精巧 独特 新颖独特 巧妙结构合理
澳门特别行政区: 撞击 边坡 溜槽 耐高温 材料制成
甘肃省: 独特 全新 理念 现代 普及
福建省: 巧妙 新颖使用方便 独特 巧妙结构合理 全新
西藏自治区: 既能 十分 疲劳 范围广 更加人性化
贵州省: 巧妙 独特 科学合理 合理使用方便 精巧
辽宁省: 巧妙 巧妙结构合理 新颖结构合理 新颖独特 独创
重庆市: 合理使用方便 科学合理 精巧 全新 新颖使用方便
陕西省: 巧妙 独特 新颖结构合理 合理 合理使用方便
青海省: 突破 织机 现行 经济环保 杆织机
香港特别行政区: 全新 更具 美感 市场 丰富
黑龙江省: 独特 精巧 科学合理 巧妙 构思
</code></pre></div><p>从上面的运行结果看， 除青海省，剩下的绝大多数的省份的Word2vec都很准确的捕捉到了专利摘要的语义信息。</p>
<br>
<h3 id="34-语义检查-年份">3.4 语义检查-年份</h3>
<p>查看与[&lsquo;创新&rsquo;, &lsquo;新颖&rsquo;]最相似的5个词，通过语义捕捉准确与否，粗略判断Word2vec训练效果的好坏。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="k">for</span> <span class="n">year_w2v_f</span><span class="p">,</span> <span class="n">year_w2v</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">year_w2v_fs</span><span class="p">,</span> <span class="n">year_w2vs_</span><span class="p">):</span>
    <span class="k">try</span><span class="p">:</span>
        <span class="n">year</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="n">year_w2v_f</span><span class="p">)</span><span class="o">.</span><span class="n">stem</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">&#39;-&#39;</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
        <span class="n">wordweigths</span> <span class="o">=</span> <span class="n">year_w2v</span><span class="o">.</span><span class="n">most_similar</span><span class="p">([</span><span class="s1">&#39;创新&#39;</span><span class="p">,</span> <span class="s1">&#39;新颖&#39;</span><span class="p">],</span> <span class="n">topn</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
        <span class="n">words</span> <span class="o">=</span> <span class="p">[</span><span class="n">w</span> <span class="k">for</span> <span class="n">w</span><span class="p">,</span><span class="n">p</span> <span class="ow">in</span> <span class="n">wordweigths</span><span class="p">]</span>
        <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;</span><span class="si">{</span><span class="n">year</span><span class="si">}</span><span class="s1">: </span><span class="si">{</span><span class="s2">&#34; &#34;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">words</span><span class="p">)</span><span class="si">}</span><span class="s1">&#39;</span><span class="p">)</span>
    <span class="k">except</span><span class="p">:</span>
        <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;</span><span class="si">{</span><span class="n">year</span><span class="si">}</span><span class="s1">: NA&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">1985: 公知 专门 特别适用 提出一种 机器人
1986: 植入 理发 135 婴儿车 街道
1987: 落后 儿童智力开发 低档 胶鞋 指甲
1988: 目前市场 低档 捕鱼 证件 普遍使用
1989: 价廉物美 课堂教学 普及型 大众 得心应手
1990: 单纯 精简 多方面 机等 应用领域
1991: 普及型 前途 大众 现代 现代化
1992: 构思 保留传统 不失为 全新 机之
1993: 完美 现代科技 崭新 式样 边墙
1994: 样式 别致 造型新颖 显得 华贵
1995: 多样 结实耐用 独特 别致 高档
1996: 花样 耐冲击浮标 极其 式样新颖 应用范围
1997: 实用美观 室内外装饰 标准化 多变 形象逼真
1998: 现代 改革 开发 高雅 创造
1999: 现代 市场 越来越 款式 大方
2000: 全新 娱乐性趣味性 多样化 多方面 各种各样
2001: 完美 现代 新颖别致 多样化 体现
2002: 全新 现代 实为 科学 大众
2003: 全新 突破传统 多样化 体现 科学
2004: 全新 科技 现代 市场 科学合理
2005: 创意 理念 全新 科学 现代
2006: 全新 构思 理念 新颖性 独特
2007: 全新 突破传统 巧妙 独特 现代
2008: 独特 巧妙 全新 新颖独特 设计理念
2009: 独特 巧妙 全新 科学 新颖独特
2010: 独特 新颖独特 精巧 巧妙 科学合理
2011: 独特 精巧 新颖独特 科学合理 巧妙
2012: 独特 巧妙 新颖独特 精巧 科学合理
2013: 独特 新颖独特 精巧 科学合理 巧妙
2014: 巧妙 独特 科学合理 巧妙合理 精巧
2015: 巧妙 独特 巧妙合理 新颖结构合理 科学合理
2016: 巧妙 结合现在 巧妙合理 独特 全新
2017: 巧妙 科学合理 独特 科学 合理
2018: 巧妙 独特 合理 科学 全新
2019: 巧妙 独特 合理 全新 精巧
2020: 巧妙 独特 合理 精巧 设计
2021: 巧妙 精巧 合理 新颖结构合理 全新
2022: 巧妙 全新 独特 创新性 精巧
2023: 巧妙 独特 全新 创新性 精巧
2024: 巧妙 创新性 独特 精巧 全新
2025: 相结合 双重 独特 多重 优势
</code></pre></div><p>也试了其他的词语，好像 1998 年之后的捕捉的语义是准确的。</p>
<br>
<h2 id="四研究潜力-语义变迁研究方法介绍">四、研究潜力: 语义变迁研究方法介绍</h2>
<p>假设语义都很准的话， 是可以研究 <strong>语义变迁</strong> 或者 <strong>语义差异</strong> 的。 但需要注意， 不能直接使用两个年份或者两个省份的中word1和word2的距离来体现语义的变迁或者语义的差异。 如果想做省份间差异或者某省份随时间的变化， 需要用到 <strong>对齐算法</strong>， 常用的算法是 <strong>正交Procrustes矩阵对齐</strong>， 使得同省份不同年份或者通年份不同省份的word2vec都有相同的语义空间。</p>
<h3 id="41-正交procrustes算法">4.1 正交Procrustes算法</h3>
<p>正交Procrustes矩阵对齐是一种将两个预训练语言模型的词向量矩阵对齐的方法，使得它们在相同的语义空间中表示。具体来说，它通过计算一个正交矩阵，将两个词向量矩阵进行线性变换，使得它们的Frobenius范数之和最小，从而实现对齐。 在 cntext2.x中实现了Procrustes对齐函数 <em><strong>ct.procrustes_align()</strong></em>，具体可阅读 <a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/">文本分析库cntext2.x使用手册</a></p>
<br>
<h3 id="42-语义变迁流程图">4.2 语义变迁流程图</h3>
<p>语义变迁类研究的流程图可参考 <a href="https://github.com/Living-with-machines/DiachronicEmb-BigHistData">DiachronicEmb-BigHistData</a></p>
<p><img loading="lazy" src="img/w2v-time-shifting-flowchart.png" alt=""  />
</p>
<p>大邓在 <a href="https://textdata.cn/blog/2023-12-28-visualize-the-culture-change-using-people-daily-dataset/">可视化 | 人民日报语料反映七十年文化演变</a> 实现了历时语义对齐， 可以看出70 年整个中国社会的认知变迁。</p>
<br>
<h3 id="43-识别语义变化时间点">4.3 识别语义变化时间点</h3>
<p>该项目研究了1800-1910期间， 每10年为一个单位训练词向量， 探究词语变化。以 <em><strong>railway</strong></em> 和  <em><strong>traffic</strong></em> 为例, 先用 余弦相似度(cosine-similarity)算法识别词语语义变化的时间点，如下图</p>
<p><img loading="lazy" src="img/consine-sim-cpdetection.png" alt=""  />
</p>
<br>
<h3 id="44-绘制语义变化轨迹">4.4 绘制语义变化轨迹</h3>
<p>语义变化轨迹</p>
<p><img loading="lazy" src="img/railway-time-shifting.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="五获取资源">五、获取资源</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">- 免费  专利摘要-Word2Vec.200.15.bin  https://pan.baidu.com/s/1CgBjy96hDKM2GKQY4G6kYA?pwd=ba92

- 免费  province_w2vs                https://pan.baidu.com/s/1eBFTIZcv2DWssLiaRnCqZQ?pwd=ikpu

- 免费  year_w2vs                    https://pan.baidu.com/s/1lrVkML92cVJdHQa1HQyAwA?pwd=4gqa
</code></pre></div><p><br><br></p>
<h2 id="cntext使用声明">cntext使用声明</h2>
<p>如在研究或项目中使用 cntext ，请在文中介绍并附引用声明。引用格式可参考 <a href="https://textdata.cn/blog/2025-09-09-transform-text-data-into-structured-data-with-lmstudio-and-cntext/#%E4%BD%BF%E7%94%A8%E5%A3%B0%E6%98%8E">cntext 推荐引用格式</a></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>可视化 | 人民日报语料反映七十年文化演变</title>
      <link>https://textdata.cn/blog/2023-12-28-visualize-the-culture-change-using-people-daily-dataset/</link>
      <pubDate>Thu, 03 Apr 2025 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-12-28-visualize-the-culture-change-using-people-daily-dataset/</guid>
      <description>使用人民日报1946-2023年之间的新闻数据，通过语义距离刻画文化的变迁。</description>
      <content:encoded><![CDATA[<h2 id="一引言">一、引言</h2>
<p>社会文化是一个不断演变的复杂系统，受到历史、科技、经济和社会变革等多种因素的影响。随着时代的推移，人们的语言使用和文化认知也经历着变迁，反映着社会的发展脉络。在这个背景下，使用Word2Vec等词嵌入技术来研究社会文化变迁和刻板印象的重要性日益凸显。</p>
<p>Word2Vec作为一种词向量表示方法，通过将词汇映射到高维空间中的向量，有效地捕捉了词语之间的语义关系。这使得我们能够以全新的方式理解语言的演变和文化认知的转变。通过对比不同时期的Word2Vec模型，我们可以深入挖掘语言的时代特征，捕捉到文化观念、价值观念以及社会角色的演变。</p>
<p>研究社会文化变迁和刻板印象，不仅有助于解构历史时刻下的社会结构和文化动态，还能为我们提供深刻的洞察力，揭示出社会变迁中潜在的驱动力和趋势。这种研究有助于建构更为全面、客观的历史记忆，帮助我们更好地理解人类行为背后的深层次原因。</p>
<p><br><br></p>
<h2 id="二训练模型">二、训练模型</h2>
<h3 id="21-获取数据">2.1 获取数据</h3>
<ul>
<li><a href="https://textdata.cn/blog/2023-12-14-daily-news-dataset/">新闻数据集 | 含 人民日报/经济日报/光明日报 等数十家媒体(2024.05)</a></li>
<li><a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/">文本分析库 cntext 使用手册</a></li>
</ul>
<br>
<h3 id="22--构造语料">2.2  构造语料</h3>
<p>本使用的 <em><strong>rmrb.csv.gz</strong></em> 对该数据集感兴趣的同学，可点击查看  <a href="https://textdata.cn/blog/2023-12-14-daily-news-dataset/">新闻数据集 | 含 人民日报/经济日报/光明日报 等数十家媒体(2024.05)</a>  。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>
<span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>

<span class="n">df</span><span class="p">[</span><span class="s1">&#39;date&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;date&#39;</span><span class="p">])</span>

<span class="c1"># 每5年构造一个语料txt文件</span>
<span class="k">for</span> <span class="n">date</span><span class="p">,</span> <span class="n">freq_df</span> <span class="ow">in</span> <span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">pd</span><span class="o">.</span><span class="n">Grouper</span><span class="p">(</span><span class="n">key</span><span class="o">=</span><span class="s1">&#39;date&#39;</span><span class="p">,</span> <span class="n">freq</span><span class="o">=</span><span class="s1">&#39;5YE&#39;</span><span class="p">)):</span>
    <span class="nb">print</span><span class="p">(</span><span class="n">date</span><span class="p">)</span>
    <span class="n">corpus_dir</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="s1">&#39;corpus&#39;</span><span class="p">)</span>
    <span class="n">corpus_dir</span><span class="o">.</span><span class="n">mkdir</span><span class="p">(</span><span class="n">parents</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">exist_ok</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
    <span class="n">corpus_file</span> <span class="o">=</span> <span class="n">corpus_dir</span> <span class="o">/</span> <span class="sa">f</span><span class="s2">&#34;</span><span class="si">{</span><span class="n">date</span><span class="o">.</span><span class="n">year</span><span class="si">}</span><span class="s2">.txt&#34;</span>
    <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">corpus_file</span><span class="p">,</span> <span class="s1">&#39;w&#39;</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
        <span class="n">text_series</span> <span class="o">=</span> <span class="n">freq_df</span><span class="p">[</span><span class="s1">&#39;content&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span>
        <span class="n">raw_text</span> <span class="o">=</span> <span class="s1">&#39;</span><span class="se">\n</span><span class="s1">&#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">text_series</span><span class="p">)</span>
        <span class="n">f</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">raw_text</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">1946-12-31 00:00:00
1951-12-31 00:00:00
1956-12-31 00:00:00
1961-12-31 00:00:00
1966-12-31 00:00:00
1971-12-31 00:00:00
1976-12-31 00:00:00
1981-12-31 00:00:00
1986-12-31 00:00:00
1991-12-31 00:00:00
1996-12-31 00:00:00
2001-12-31 00:00:00
2006-12-31 00:00:00
2011-12-31 00:00:00
2016-12-31 00:00:00
2021-12-31 00:00:00
2026-12-31 00:00:00
CPU times: user 2.64 s, sys: 1.54 s, total: 4.18 s
Wall time: 5.29 s
</code></pre></div><p><img loading="lazy" src="img/01-corpus.jpg" alt=""  />
</p>
<p>语料txt命名规则， 实际上每个 <em><strong>year.txt</strong></em> 是存储了 <em><strong>year-5</strong></em>  ~  <em><strong>year</strong></em> 期间的新闻数据。</p>
<p><em><strong>1946.txt</strong></em> 内实际上只存储了 <em><strong>1946.5.15</strong></em> ~ <em><strong>1946.12.31</strong></em> 之间半年多的数据， 由于数据量太小，后续训练出的 <em><strong>word2vec</strong></em> 模型，其语义大概率不准。</p>
<p><em><strong>2006.txt</strong></em> 存储了 <em><strong>2002.1.1. ~ 2006.12.31</strong></em> 之间所有的数据</p>
<p>而 <em><strong>2026.txt</strong></em> 则存储了 <em><strong>2022.1.1 ~ 2026.12.31</strong></em> 之间所有的数据</p>
<br>
<h2 id="三训练word2vec">三、训练word2vec</h2>
<h3 id="31-配置环境">3.1 配置环境</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">cd desktop
pip3 install cntext --upgrade
``

&lt;br&gt;

### 3.2 开始训练

训练代码比较简单，已经封装到 **cntext*， 只需几行代码即可。且 cntext 对代码进行了优化， 训练速度更快， 内存占用更小。

训练环境 Mac 内存 96G， 大家回去可以试试 16G、32G，应该也能跑通。

```python
%%time
import cntext as ct
import glob


# 获取corpus文件夹内的所有语料txt文件的文件路径
corpus_files = sorted(glob.glob(&#39;corpus/*.txt&#39;))
for corpus_file in corpus_files:
    print(corpus_file)
    # 结果自动保存到output文件夹内
    w2v = ct.Word2Vec(corpus_file=corpus_file,
                 vector_size=200,
                 window_size=15,
                 min_count=5)  
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-corpus/1946.txt" data-lang="corpus/1946.txt">Mac(Linux) System, Enable Parallel Processing
Cache output/1946_cache.txt Not Found or Empty, Preprocessing Corpus
Processing Corpus: 100%|███████████████████| 5954/5954 [00:07&lt;00:00, 757.68it/s]
Reading Preprocessed Corpus from output/1946_cache.txt
Start Training Word2Vec
Word2Vec Training Cost 16 s. 
Output Saved To: output/1946-Word2Vec.200.15.bin
......
......
......
corpus/2026.txt
Mac(Linux) System, Enable Parallel Processing
Cache output/2026_cache.txt Not Found or Empty, Preprocessing Corpus
Processing Corpus: 100%|██████████████| 105037/105037 [00:34&lt;00:00, 3075.29it/s]
Reading Preprocessed Corpus from output/2026_cache.txt
Start Training Word2Vec
Word2Vec Training Cost 194. 
Output Saved To: output/2026-Word2Vec.200.15.bin
CPU times: user 2h 38min 4s, sys: 4min 41s, total: 2h 42min 45s
Wall time: 1h 5min 39s
</code></pre></div><p><img loading="lazy" src="img/02-word2vec.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="四检查模型">四、检查模型</h2>
<p>现在我们要检查模型， 为了方便，我就随机抽查 1946/1981/2001/2026， 查看这四个模型关于「工业」的近义词，看模型语义捕捉的准不准。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="n">mfiles</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;output/1946-Word2Vec.200.15.bin&#39;</span><span class="p">,</span>
          <span class="s1">&#39;output/1981-Word2Vec.200.15.bin&#39;</span><span class="p">,</span>
          <span class="s1">&#39;output/2001-Word2Vec.200.15.bin&#39;</span><span class="p">,</span>
          <span class="s1">&#39;output/2026-Word2Vec.200.15.bin&#39;</span><span class="p">]</span>

<span class="k">for</span> <span class="n">mfile</span> <span class="ow">in</span> <span class="n">mfiles</span><span class="p">:</span>
    <span class="n">w2v_model</span>  <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">load_w2v</span><span class="p">(</span><span class="n">mfile</span><span class="p">)</span>
    <span class="nb">print</span><span class="p">(</span><span class="n">mfile</span><span class="p">)</span>
    <span class="n">word_scores</span> <span class="o">=</span> <span class="n">w2v_model</span><span class="o">.</span><span class="n">most_similar</span><span class="p">([</span><span class="s1">&#39;工业&#39;</span><span class="p">],</span> <span class="n">topn</span><span class="o">=</span><span class="mi">20</span><span class="p">)</span>
    <span class="k">for</span> <span class="n">word</span><span class="p">,</span> <span class="n">score</span> <span class="ow">in</span> <span class="n">word_scores</span><span class="p">:</span>
        <span class="nb">print</span><span class="p">(</span><span class="n">word</span><span class="p">,</span> <span class="n">score</span><span class="p">)</span>
    <span class="nb">print</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">output/1946-Word2Vec.200.15.bin
市场 0.9601176381111145
重工业 0.9589242935180664
工业部门 0.9484396576881409
物价 0.9464751482009888
商业 0.9423016309738159
工业原料 0.9378510117530823
焦煤 0.9368941783905029
物价暴涨 0.9348677396774292
第一年 0.9331346154212952
农产品 0.9329909682273865
输入 0.9329512119293213
农业 0.9327669143676758
水平 0.9323830008506775
通货 0.9320995807647705
国民经济 0.9268764853477478
投资 0.9261932373046875
输出 0.9258642792701721
钢铁工厂 0.925421953201294
工业生产 0.9251945614814758
十三亿 0.9251589775085449

Loading output/1981-Word2Vec.200.15.bin...
output/1981-Word2Vec.200.15.bin
工业部门 0.6696006655693054
重工业 0.6490920782089233
建筑业 0.6461381316184998
轻工业 0.6443966627120972
工业生产 0.6364479064941406
机器制造业 0.6220380067825317
化学工业 0.6116607785224915
钢铁工业 0.5941601991653442
加工工业 0.5932750701904297
电子工业 0.5880091190338135
轻纺工业 0.5786471366882324
食品工业 0.5777474045753479
重工业轻工业 0.5734774470329285
民用工业 0.5729294419288635
消费品生产 0.5721379518508911
纺织业 0.56629878282547
农业轻工业 0.5642068982124329
机器制造 0.5622154474258423
制造业 0.5620284676551819
化工 0.5588406324386597

Loading output/2001-Word2Vec.200.15.bin...
output/2001-Word2Vec.200.15.bin
重工业 0.6766582727432251
工业生产 0.6742461323738098
制造业 0.641242504119873
轻工业 0.615958571434021
传统产业 0.6039909720420837
加工工业 0.5936708450317383
机械电子 0.5892737507820129
工业部门 0.5891364216804504
轻工 0.5785651803016663
化学工业 0.5783289670944214
纺织 0.5708677172660828
支柱行业 0.5655868053436279
钢铁工业 0.5648497939109802
化工 0.5617026686668396
机械工业 0.5609593987464905
振兴国防科技 0.5588745474815369
纺织业 0.5520373582839966
工业体系 0.5505329370498657
工业总产值 0.5477191805839539
冶金纺织 0.5463222861289978

Loading output/2026-Word2Vec.200.15.bin...
output/2026-Word2Vec.200.15.bin
制造业 0.6705414056777954
工业生产 0.6067013144493103
智能制造 0.5936543941497803
轻工业 0.5885797142982483
钢铁行业 0.5884692072868347
化工 0.5675483345985413
钢铁企业 0.5637045502662659
工业互联网 0.559167742729187
装备制造业 0.5545477271080017
制造 0.5482359528541565
建筑业 0.5467448234558105
冶金 0.5400071740150452
规模工业 0.5395020246505737
重工业 0.537196695804596
钢铁 0.5245063304901123
工业遗产 0.5208563804626465
钢铁工业 0.5142995715141296
改数 0.512413740158081
纺织业 0.5109716653823853
规上工业 0.5082385540008545
</code></pre></div><p>从四个年代，我们可以看到中国人民对于 「<strong>工业</strong>」的认识发生了变化， 相比建国初期的一穷二白，工农业等领域经济凋敝； 而2026年的「<strong>工业</strong>」已实现工业现代化，更加注制造业、智能制造、工业互联网、装备制造业等概念。</p>
<p><br><br></p>
<h2 id="五对齐模型">五、对齐模型</h2>
<h3 id="51--为什么要进行对齐">5.1  为什么要进行对齐?</h3>
<p>Word2Vec是一种词嵌入（word embedding）算法，它将词语映射到高维空间中的向量，使得语义相近的词在该空间中距离较近。然而，不同年份的Word2Vec模型在训练时可能受到不同的语料库、训练参数等因素的影响，导致它们的向量空间之间存在一定的差异，所以不能直接拿不同年年份模型直接进行语义比较。</p>
<p><strong>Procrustes对齐算法目的是通过线性变换来使两个向量空间尽可能地对齐，以便进行比较</strong>。这个过程涉及到对两个向量空间进行旋转、缩放和平移等变换，使它们在某种意义上尽量一致。</p>
<p>具体原因包括：</p>
<ol>
<li><strong>词汇漂移（Lexical Drift）：</strong> 随着时间的推移，词汇的含义和使用可能发生变化，导致不同年份的语料库中的词语存在一定的漂移。Procrustes分析可以在一定程度上对齐这种漂移。</li>
<li><strong>训练参数不同：</strong> Word2Vec模型的训练参数，如窗口大小、迭代次数等，可能在不同年份有所不同，导致生成的向量空间差异较大。</li>
<li><strong>语料库的差异：</strong> 不同年份的语料库可能覆盖的主题、文体等存在差异，这也会影响词向量的学习结果。</li>
</ol>
<p>通过Procrustes对齐，可以在一定程度上解决这些问题，使得不同年份的Word2Vec模型在语义上更具可比性。这有助于在跨时间的语料库中进行一致的语义分析。</p>
<br>
<h3 id="52-对齐之后">5.2 对齐之后</h3>
<p>对齐后的Word2Vec模型进行的语义变迁研究：</p>
<ol>
<li><strong>词义演变：</strong> 比较不同年份相同词汇的词向量，观察其在向量空间中的位置变化，分析词义在语义空间中的演变趋势。</li>
<li><strong>语境变迁：</strong> 考察同一词语在不同年份的上下文中的变化，了解词语在不同语境下的语义演变情况。</li>
<li><strong>主题变迁：</strong> 通过对齐后的向量空间，分析不同年份语料库中词语的主题分布变化，探讨社会、文化因素对语言使用的影响。</li>
<li><strong>时代特征分析：</strong> 通过对比不同年份的模型，识别出每个时期在词向量空间中的独特特征，从而揭示时代背景对语义的影响。</li>
<li><strong>探索新兴词汇：</strong> 通过对比不同年份的模型，发现在语义空间中新兴词汇的出现和演变，了解新兴概念和文化趋势。</li>
</ol>
<p>总的来说，通过对齐Word2Vec模型，你可以更准确地比较不同年份的语料库，深入研究语义的演变和语言使用的变迁。这有助于揭示社会、文化、科技等方面的发展对语言表达的影响。</p>
<br>
<h3 id="53-对齐代码">5.3 对齐代码</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>
<span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>
<span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
<span class="kn">import</span> <span class="nn">glob</span>

<span class="c1"># 基准模型</span>
<span class="n">base_wv</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">load_w2v</span><span class="p">(</span><span class="s1">&#39;output/2026-Word2Vec.200.15.bin&#39;</span><span class="p">)</span>

<span class="c1">#将其他模型与基准模型对齐</span>
<span class="k">for</span> <span class="n">file</span> <span class="ow">in</span> <span class="n">glob</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s1">&#39;output/*.bin&#39;</span><span class="p">):</span>
    <span class="nb">print</span><span class="p">(</span><span class="n">file</span><span class="p">)</span>
    <span class="n">other_wv</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">load_w2v</span><span class="p">(</span><span class="n">file</span><span class="p">)</span>
    <span class="n">procrusted_w2v</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">procrustes_align</span><span class="p">(</span><span class="n">base_wv</span><span class="o">=</span><span class="n">base_wv</span><span class="p">,</span>
                                         <span class="n">other_wv</span><span class="o">=</span><span class="n">other_wv</span><span class="p">)</span>
    <span class="c1"># win</span>
    <span class="c1">#year = file.split(&#39;\\&#39;)[-1][:4]</span>
    
    <span class="c1"># mac</span>
    <span class="n">year</span> <span class="o">=</span> <span class="n">file</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">&#39;/&#39;</span><span class="p">)[</span><span class="o">-</span><span class="mi">1</span><span class="p">][:</span><span class="mi">4</span><span class="p">]</span>
    
    <span class="n">output_dir</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="s1">&#39;Aligned_Word2Vec&#39;</span><span class="p">)</span>
    <span class="n">output_dir</span><span class="o">.</span><span class="n">mkdir</span><span class="p">(</span><span class="n">parents</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">exist_ok</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
    <span class="n">procrusted_w2v</span><span class="o">.</span><span class="n">save_word2vec_format</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;Aligned_Word2Vec/</span><span class="si">{</span><span class="n">year</span><span class="si">}</span><span class="s1">.200.15.bin&#39;</span><span class="p">,</span> <span class="n">binary</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">Loading output/2026-Word2Vec.200.15.bin...
output/1956-Word2Vec.200.15.bin
Loading output/1956-Word2Vec.200.15.bin...

output/2026-Word2Vec.200.15.bin
Loading output/2026-Word2Vec.200.15.bin...

output/2021-Word2Vec.200.15.bin
Loading output/2021-Word2Vec.200.15.bin...

output/1951-Word2Vec.200.15.bin
Loading output/1951-Word2Vec.200.15.bin...

output/1946-Word2Vec.200.15.bin
Loading output/1946-Word2Vec.200.15.bin...

output/2001-Word2Vec.200.15.bin
Loading output/2001-Word2Vec.200.15.bin...

output/1981-Word2Vec.200.15.bin
Loading output/1981-Word2Vec.200.15.bin...

output/1971-Word2Vec.200.15.bin
Loading output/1971-Word2Vec.200.15.bin...

output/1976-Word2Vec.200.15.bin
Loading output/1976-Word2Vec.200.15.bin...

output/2006-Word2Vec.200.15.bin
Loading output/2006-Word2Vec.200.15.bin...

output/1986-Word2Vec.200.15.bin
Loading output/1986-Word2Vec.200.15.bin...

output/1961-Word2Vec.200.15.bin
Loading output/1961-Word2Vec.200.15.bin...

output/2011-Word2Vec.200.15.bin
Loading output/2011-Word2Vec.200.15.bin...

output/1991-Word2Vec.200.15.bin
Loading output/1991-Word2Vec.200.15.bin...

output/2016-Word2Vec.200.15.bin
Loading output/2016-Word2Vec.200.15.bin...

output/1996-Word2Vec.200.15.bin
Loading output/1996-Word2Vec.200.15.bin...

output/1966-Word2Vec.200.15.bin
Loading output/1966-Word2Vec.200.15.bin...

CPU times: user 1min 8s, sys: 49.7 s, total: 1min 58s
Wall time: 46.3 s
</code></pre></div><p><img loading="lazy" src="img/03-align.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="六实验-文化变迁">六、实验-文化变迁</h2>
<p>时代的宣传必然在语义中深刻的影响社会认知，不同时代语料中必然蕴含着不同的文化特征，如语义距离的变化。这里我演示 两个对立词组分别与目标词组进行语义距离计算， 根据语义距离反应刻板印象态度偏见，其实这也反映了文化变迁。</p>
<h3 id="61-性别与成功">6.1 性别与成功</h3>
<p>男性、女性与成功之间的语义距离</p>
<p><strong>cntext</strong> 内置了两种算法， 语义投影和语义距离，</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">distance = distance(女, 成功) - distance(男, 成功)
</code></pre></div><p>如果distance趋近于0， 男女在成功概念上语义接近， 无明显刻板印象。</p>
<p>但是当distance明显大于0， 当人们聊到成功概念时，更容易联想到男性，而不是女性。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">import pandas as pd
import glob

gender_suceess_data = []

words = [&#39;成功&#39;, &#39;成就&#39;, &#39;胜利&#39;]
c_words1 = [&#39;女&#39;, &#39;女人&#39;, &#39;她&#39;, &#39;母亲&#39;, &#39;女儿&#39;, &#39;奶奶&#39;]
c_words2 = [&#39;男&#39;, &#39;男人&#39;, &#39;他&#39;, &#39;父亲&#39;, &#39;儿子&#39;, &#39;爷爷&#39;]

# 当前代码所处文件 与 Aligned_Word2Vec 处于同一文件夹内
mfiles = sorted(glob.glob(&#39;Aligned_Word2Vec/*.bin&#39;))
for file in mfiles:
    distance = ct.sematic_distance(wv=ct.load_w2v(file),
                                   words=words, 
                                   c_words1=c_words1, 
                                   c_words2=c_words2)
    data = dict()
    data[&#39;year&#39;] = file.split(&#39;/&#39;)[-1][:4]
    data[&#39;distance&#39;] = distance
    gender_suceess_data.append(data)
    

gender_success_df = pd.DataFrame(gender_suceess_data)
gender_success_df
</code></pre></div><p><img loading="lazy" src="img/04-df.png" alt=""  />
</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">import</span> <span class="nn">matplotlib_inline</span>
<span class="n">matplotlib_inline</span><span class="o">.</span><span class="n">backend_inline</span><span class="o">.</span><span class="n">set_matplotlib_formats</span><span class="p">(</span><span class="s1">&#39;png&#39;</span><span class="p">,</span> <span class="s1">&#39;svg&#39;</span><span class="p">)</span>
<span class="kn">import</span> <span class="nn">warnings</span>
<span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>
<span class="kn">import</span> <span class="nn">scienceplots</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">import</span> <span class="nn">glob</span>
<span class="n">plt</span><span class="o">.</span><span class="n">style</span><span class="o">.</span><span class="n">use</span><span class="p">([</span><span class="s1">&#39;science&#39;</span><span class="p">,</span> <span class="s1">&#39;no-latex&#39;</span><span class="p">,</span> <span class="s1">&#39;cjk-sc-font&#39;</span><span class="p">])</span>
<span class="n">ct</span><span class="o">.</span><span class="n">matplotlib_chinese</span><span class="p">()</span> <span class="c1">#为正常显示中文</span>

<span class="n">gender_suceess_data</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">words</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;成功&#39;</span><span class="p">,</span> <span class="s1">&#39;成就&#39;</span><span class="p">,</span> <span class="s1">&#39;胜利&#39;</span><span class="p">]</span>
<span class="n">c_words1</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;女&#39;</span><span class="p">,</span> <span class="s1">&#39;女人&#39;</span><span class="p">,</span> <span class="s1">&#39;她&#39;</span><span class="p">,</span> <span class="s1">&#39;母亲&#39;</span><span class="p">,</span> <span class="s1">&#39;女儿&#39;</span><span class="p">,</span> <span class="s1">&#39;奶奶&#39;</span><span class="p">]</span>
<span class="n">c_words2</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;男&#39;</span><span class="p">,</span> <span class="s1">&#39;男人&#39;</span><span class="p">,</span> <span class="s1">&#39;他&#39;</span><span class="p">,</span> <span class="s1">&#39;父亲&#39;</span><span class="p">,</span> <span class="s1">&#39;儿子&#39;</span><span class="p">,</span> <span class="s1">&#39;爷爷&#39;</span><span class="p">]</span>

<span class="n">mfiles</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">glob</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s1">&#39;Aligned_Word2Vec/*.bin&#39;</span><span class="p">))</span>
<span class="k">for</span> <span class="n">file</span> <span class="ow">in</span> <span class="n">mfiles</span><span class="p">:</span>
    <span class="n">distance</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">sematic_distance</span><span class="p">(</span><span class="n">wv</span><span class="o">=</span><span class="n">ct</span><span class="o">.</span><span class="n">load_w2v</span><span class="p">(</span><span class="n">file</span><span class="p">),</span>
                                   <span class="n">words</span><span class="o">=</span><span class="n">words</span><span class="p">,</span> 
                                   <span class="n">c_words1</span><span class="o">=</span><span class="n">c_words1</span><span class="p">,</span> 
                                   <span class="n">c_words2</span><span class="o">=</span><span class="n">c_words2</span><span class="p">)</span>
    <span class="n">data</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span>
    <span class="n">data</span><span class="p">[</span><span class="s1">&#39;year&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">file</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">&#39;/&#39;</span><span class="p">)[</span><span class="o">-</span><span class="mi">1</span><span class="p">][:</span><span class="mi">4</span><span class="p">]</span>
    <span class="n">data</span><span class="p">[</span><span class="s1">&#39;distance&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">distance</span>
    <span class="n">gender_suceess_data</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
    
<span class="n">gender_success_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">gender_suceess_data</span><span class="p">)</span>
<span class="n">gender_success_df</span><span class="o">.</span><span class="n">set_index</span><span class="p">(</span><span class="s1">&#39;year&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">5</span><span class="p">),</span> <span class="n">kind</span><span class="o">=</span><span class="s1">&#39;bar&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">suptitle</span><span class="p">(</span><span class="s1">&#39;人民日报在「成就」概念的文化变迁&#39;</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">15</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xticks</span><span class="p">(</span><span class="n">rotation</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">&#39;大于0表示社会更容易将成功与男性联系起来&#39;</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/04-gender.png" alt=""  />
</p>
<p>从图中可以看到， 新中国初期， 我国的女性解放运动在全世界都是领先的，成果十分卓著。而今耳熟能详的口号恰好说明当时的宣传已经刻入每个中国人的认知中，如</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">- 谁说女子不如男
- 不爱红装爱武装
- 女人撑起半边天
...
</code></pre></div><br>
<p>提到「成功概念」时，在新中国初期，由于破除性别刻板印象，宣传更加中性， 立榜样考虑了性别的平衡。而随着时间推移，口号式的宣传运动沉寂后， 历史的惯性(传统文化的基因)可能会重新复活， 提到「成功概念」时，社会更容易将「成功」与「男性」联系起来。</p>
<br>
<h3 id="52-性别与责任">5.2 性别与责任</h3>
<p>成就与男性有更高的关联， 背后是否意味着传统文化建构的社会要求男性承担远多于女性的责任。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python">
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">import</span> <span class="nn">matplotlib_inline</span>
<span class="n">matplotlib_inline</span><span class="o">.</span><span class="n">backend_inline</span><span class="o">.</span><span class="n">set_matplotlib_formats</span><span class="p">(</span><span class="s1">&#39;png&#39;</span><span class="p">,</span> <span class="s1">&#39;svg&#39;</span><span class="p">)</span>
<span class="kn">import</span> <span class="nn">warnings</span>
<span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>
<span class="kn">import</span> <span class="nn">scienceplots</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">import</span> <span class="nn">glob</span>
<span class="n">plt</span><span class="o">.</span><span class="n">style</span><span class="o">.</span><span class="n">use</span><span class="p">([</span><span class="s1">&#39;science&#39;</span><span class="p">,</span> <span class="s1">&#39;no-latex&#39;</span><span class="p">,</span> <span class="s1">&#39;cjk-sc-font&#39;</span><span class="p">])</span>
<span class="n">ct</span><span class="o">.</span><span class="n">matplotlib_chinese</span><span class="p">()</span> <span class="c1">#为正常显示中文</span>

<span class="n">gender_responsibility_data</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">words</span> <span class="o">=</span>   <span class="p">[</span><span class="s1">&#39;责任&#39;</span><span class="p">,</span> <span class="s1">&#39;重担&#39;</span><span class="p">,</span> <span class="s1">&#39;担当&#39;</span><span class="p">]</span>
<span class="n">c_words1</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;女&#39;</span><span class="p">,</span> <span class="s1">&#39;女人&#39;</span><span class="p">,</span> <span class="s1">&#39;她&#39;</span><span class="p">,</span> <span class="s1">&#39;母亲&#39;</span><span class="p">,</span> <span class="s1">&#39;女儿&#39;</span><span class="p">,</span> <span class="s1">&#39;奶奶&#39;</span><span class="p">]</span>
<span class="n">c_words2</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;男&#39;</span><span class="p">,</span> <span class="s1">&#39;男人&#39;</span><span class="p">,</span> <span class="s1">&#39;他&#39;</span><span class="p">,</span> <span class="s1">&#39;父亲&#39;</span><span class="p">,</span> <span class="s1">&#39;儿子&#39;</span><span class="p">,</span> <span class="s1">&#39;爷爷&#39;</span><span class="p">]</span>

<span class="n">mfiles</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">glob</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s1">&#39;Aligned_Word2Vec/*.bin&#39;</span><span class="p">))</span>
<span class="k">for</span> <span class="n">file</span> <span class="ow">in</span> <span class="n">mfiles</span><span class="p">:</span>
    <span class="n">distance</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">sematic_distance</span><span class="p">(</span><span class="n">wv</span><span class="o">=</span><span class="n">ct</span><span class="o">.</span><span class="n">load_w2v</span><span class="p">(</span><span class="n">file</span><span class="p">),</span>
                                   <span class="n">words</span><span class="o">=</span><span class="n">words</span><span class="p">,</span> 
                                   <span class="n">c_words1</span><span class="o">=</span><span class="n">c_words1</span><span class="p">,</span> 
                                   <span class="n">c_words2</span><span class="o">=</span><span class="n">c_words2</span><span class="p">)</span>
    <span class="n">data</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span>
    <span class="n">data</span><span class="p">[</span><span class="s1">&#39;year&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">file</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">&#39;/&#39;</span><span class="p">)[</span><span class="o">-</span><span class="mi">1</span><span class="p">][:</span><span class="mi">4</span><span class="p">]</span>
    <span class="n">data</span><span class="p">[</span><span class="s1">&#39;distance&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">distance</span>
    <span class="n">gender_responsibility_data</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
    
<span class="n">gender_responsibility_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">gender_responsibility_data</span><span class="p">)</span>
<span class="n">gender_responsibility_df</span><span class="o">.</span><span class="n">set_index</span><span class="p">(</span><span class="s1">&#39;year&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">5</span><span class="p">),</span> <span class="n">kind</span><span class="o">=</span><span class="s1">&#39;bar&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xticks</span><span class="p">(</span><span class="n">rotation</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">suptitle</span><span class="p">(</span><span class="s1">&#39;人民日报在「责任」语义的文化变迁&#39;</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">15</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">&#39;大于0表示社会更容易将「责任」与男性联系起来&#39;</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/05-responsibility.png" alt=""  />
</p>
<p>从图中可以看出，在大多数年份， distance是大于0的，即 提到「责任」概念时，社会更容易联想到「男性」，而不是「女性」。</p>
<p><br><br></p>
<h2 id="七获取资源">七、获取资源</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">- 免费   Word2Vec          https://pan.baidu.com/s/1Ru_wxu9egsmhM7lATjSlgQ?pwd=bcea

- 免费   Aligned_Word2Vec  https://pan.baidu.com/s/1IVgP0MyQpez0hpoJyEyFdA?pwd=7qsu
</code></pre></div><br>
<br>
<h2 id="cntext使用声明">cntext使用声明</h2>
<p>如在研究或项目中使用 cntext ，请在文中介绍并附引用声明。引用格式可参考 <a href="https://textdata.cn/blog/2025-09-09-transform-text-data-into-structured-data-with-lmstudio-and-cntext/#%E4%BD%BF%E7%94%A8%E5%A3%B0%E6%98%8E">cntext 推荐引用格式</a></p>
<br>
<br>
<h2 id="相关内容">相关内容</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[1]冉雅璇,李志强,刘佳妮,张逸石.大数据时代下社会科学研究方法的拓展——基于词嵌入技术的文本分析的应用[J].南开管理评论:1-27.
[2]Hamilton, William L., Jure Leskovec, and Dan Jurafsky. &#34;Diachronic word embeddings reveal statistical laws of semantic change.&#34; arXiv preprint arXiv:1605.09096 (2016).
[3]Garg, Nikhil, Londa Schiebinger, Dan Jurafsky, and James Zou. &#34;Word embeddings quantify 100 years of gender and ethnic stereotypes.&#34; Proceedings of the National Academy of Sciences 115, no. 16 (2018): E3635-E3644.
[3]Aceves, Pedro, and James A. Evans. “Mobilizing conceptual spaces: How word embedding models can inform measurement and theory within organization science.” Organization Science (2023).
[4]Kozlowski, A.C., Taddy, M. and Evans, J.A., 2019. The geometry of culture: Analyzing the meanings of class through word embeddings. American Sociological Review, 84(5), pp.905-949.
</code></pre></div><br>
<ul>
<li><a href="https://textdata.cn/blog/2023-11-03-organization-science-with-word-embeddings/">OS2022 | 概念空间 | 词嵌入模型如何为组织科学中的测量和理论提供信息</a></li>
<li><a href="https://textdata.cn/blog/2023-03-15-39faq-about-word-embeddings-for-social-science/">词嵌入技术在社会科学领域进行数据挖掘常见39个FAQ汇总</a></li>
<li><a href="https://textdata.cn/blog/2022-04-09-literature-about-embeddings/">文献汇总 | 词嵌入 与 社会科学中的偏见(态度)</a></li>
<li><a href="https://textdata.cn/blog/2023-12-28-train-word2vec-using-renmin-gov-leader-board-dataset/">词向量  | 使用<strong>人民网领导留言板</strong>语料训练Word2Vec模型</a></li>
<li><a href="https://textdata.cn/blog/2025-03-28-train_a_glove_model_on_chinese_corpus_using_stanfordnlp/">实验 | 使用 Stanford Glove 代码训练中文语料的 GloVe 模型</a></li>
</ul>
]]></content:encoded>
    </item>
    
    <item>
      <title>词向量  | 使用人民网领导留言板语料训练 Word2Vec 模型</title>
      <link>https://textdata.cn/blog/2023-12-28-train-word2vec-using-renmin-gov-leader-board-dataset/</link>
      <pubDate>Thu, 03 Apr 2025 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-12-28-train-word2vec-using-renmin-gov-leader-board-dataset/</guid>
      <description>&lt;p&gt;本文使用 3.88G 语料训练得到词汇量近 150w 的 Word2Vec 模型，使用该模型，可以用于寻找近义词，扩展(构建)概念词典。 &lt;strong&gt;该 Word2Vec 模型文件可在文末免费下载&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;一构建语料&#34;&gt;一、构建语料&lt;/h2&gt;
&lt;p&gt;使用 &lt;a href=&#34;https://textdata.cn/blog/2023-12-22-renmin-gov-leader-comment-board/&#34;&gt;&lt;strong&gt;数据集 | 人民网地方领导留言板原始文本(2011-2023.12)&lt;/strong&gt;&lt;/a&gt; 来构建本文的语料。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;re&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df1&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;2011-2019.csv.gzip&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;usecols&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;留言内容&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;回复内容&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;compression&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;gzip&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df2&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;2020-2023.csv.gzip&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;usecols&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;留言内容&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;回复内容&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;compression&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;gzip&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;text&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;留言标题&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fillna&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;留言内容&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fillna&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;回复内容&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fillna&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;text&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;留言标题&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fillna&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;留言内容&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fillna&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;回复内容&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fillna&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;head&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;


&lt;span class=&#34;k&#34;&gt;with&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;open&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;留言板.txt&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;w&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;encoding&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;utf-8&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;text&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;se&#34;&gt;\n&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;join&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;text&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;write&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;最终得到 4.62 G 的 &lt;strong&gt;留言板.txt&lt;/strong&gt; 。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二训练模型&#34;&gt;二、训练模型&lt;/h2&gt;
&lt;h3 id=&#34;21-配置-cntext&#34;&gt;2.1 配置 cntext&lt;/h3&gt;
&lt;p&gt;将 &lt;strong&gt;cntext-2.1.6-py3-none-any.whl&lt;/strong&gt; 放置于桌面， 打开 **命令行 cmd **(苹果电脑 terminal)，依次执行以下命令&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;cd&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;desktop&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;pip3&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;install&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;cntext&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mf&#34;&gt;2.1.6&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;py3&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;none&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;any&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;whl&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;cntext2.x 是付费未公开版本， 100 元，如有需要可加微信 372335839 ，备注 「姓名-学校-专业」。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;22-训练-word2vec&#34;&gt;2.2 训练 Word2Vec&lt;/h3&gt;
&lt;p&gt;训练 word2vec 代码已封装 cntext2， 只有三行代码。 大邓训练环境 Mac，内存 96G， 核 12。 代码对硬件要求不高， 16G 内存绝对跑得动，可能速度会慢一些。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;cntext&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;ct&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;# 大邓Mac 96G内存， 12核使用的代码。&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;w2v&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Word2Vec&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;corpus_file&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;留言板.txt&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
                  &lt;span class=&#34;n&#34;&gt;vector_size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;200&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
                  &lt;span class=&#34;n&#34;&gt;window_size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;15&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
                  &lt;span class=&#34;n&#34;&gt;lang&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;chinese&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
                  &lt;span class=&#34;n&#34;&gt;chunksize&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;100000&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
                  &lt;span class=&#34;n&#34;&gt;min_count&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;5&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;# 考虑到大家电脑普遍8G、16G内存，保守的训练代码&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;# w2v = ct.Word2Vec(corpus_file=&amp;#39;留言板.txt&amp;#39;,&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#                  vector_size=200,&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#                  window_size=15,&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#                  lang=&amp;#39;chinese&amp;#39;,&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#                  chunksize=10000,&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#                  min_count=5)&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;wv&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;Mac(Linux) System, Enable Parallel Processing
Cache output/renmin_board_cache.txt Not Found or Empty, Preprocessing Corpus
Reading Preprocessed Corpus from output/renmin_board_cache.txt
Start Training Word2Vec
Word2Vec Training Cost 2692 s.
Output Saved To: output/留言板-Word2Vec.200.15.bin
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;使用 4.62 G 的 &lt;strong&gt;&lt;em&gt;留言板.txt&lt;/em&gt;&lt;/strong&gt; ，训练了 2692 秒， 约 40 分钟。 在 &lt;strong&gt;&lt;em&gt;Python&lt;/em&gt;&lt;/strong&gt; 代码文件所在的文件夹内，出现了 &lt;strong&gt;&lt;em&gt;output&lt;/em&gt;&lt;/strong&gt; 文件夹，打开可以看到:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;em&gt;留言板_cache.txt&lt;/em&gt;&lt;/strong&gt; 语料处理后的缓存文件&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;em&gt;renmin_board-Word2Vec.200.15.bin&lt;/em&gt;&lt;/strong&gt; 训练好的模型文件&lt;/li&gt;
&lt;/ul&gt;
&lt;br&gt;
&lt;h3 id=&#34;23-评估模型&#34;&gt;2.3 评估模型&lt;/h3&gt;
&lt;p&gt;使用近义法和类比法， 判断模型的表现。详情可查看&lt;a href=&#34;https://cntext.readthedocs.io/zh-cn/latest/model.html&#34;&gt;文档&lt;/a&gt;&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;evaluate_similarity&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;wv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;evaluate_analogy&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;wv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;近义测试: similarity.txt
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/cntext/model/evaluate_data/similarity.txt

评估结果：
+----------+------------+----------------------------+
| 发现词语 | 未发现词语 | Spearman&amp;#39;s Rank Coeficient |
+----------+------------+----------------------------+
|   426    |    111     |            0.45            |
+----------+------------+----------------------------+


类比测试: analogy.txt
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/cntext/model/evaluate_data/analogy.txt
Processing Analogy Test: 100%|██████████████| 1198/1198 [00:11&amp;lt;00:00, 99.91it/s]

评估结果：
+--------------------+----------+------------+------------+----------+
|      Category      | 发现词语 | 未发现词语 | 准确率 (%) | 平均排名 |
+--------------------+----------+------------+------------+----------+
| CapitalOfCountries |   238    |    439     |   19.33    |   2.74   |
|   CityInProvince   |   175    |     0      |   100.00   |   1.01   |
| FamilyRelationship |   272    |     0      |   61.40    |   1.96   |
|   SocialScience    |    10    |     60     |   20.00    |   1.50   |
+--------------------+----------+------------+------------+----------+
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;strong&gt;近义测试&lt;/strong&gt;: Spearman&amp;rsquo;s Rank Coeficient 系数取值[-1, 1], 取值越大， 说明模型表现越好。&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;&lt;strong&gt;类比测试&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;CapitalOfCountries 留言板语料在此项表现较差， 应该是语料中常见国家首度的提及较少。&lt;/li&gt;
&lt;li&gt;CityInProvince 留言板语料在此项表现如此优异，应该是语料中省份、省会地域词经常出现。&lt;/li&gt;
&lt;li&gt;FamilyRelationship 留言板中应该少不了家长里短， 所以此项准确率还可以。 以&lt;a href=&#34;https://textdata.cn/blog/2023-03-24-load-w2v-and-expand-your-concpet-dicitonary/&#34;&gt;年报 MD&amp;amp;A&lt;/a&gt;为例，此处准确率只有 10%, 而&lt;a href=&#34;https://textdata.cn/blog/2024-04-16-douban-movie-1000w-ratings-comments-dataset/&#34;&gt;豆瓣影评&lt;/a&gt;该处准确率高达 92.65%。&lt;/li&gt;
&lt;li&gt;SocialScience 留言板语料在此项表现一般， 应该是语料中常见的社会科学词语提及较少。&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;整体而言，语料训练的效果很不错，抓住了数据场景的独特性语义。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;三使用模型&#34;&gt;三、使用模型&lt;/h2&gt;
&lt;h3 id=&#34;31-读取模型&#34;&gt;3.1 读取模型&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;cntext&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;w2v&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;load_w2v&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;output/留言板-Word2Vec.200.15.bin&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;维度数:&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;w2v&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;vector_size&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;词汇量: &amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;w2v&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;w2v&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;Loading output/留言板-Word2Vec.200.15.bin...
维度数: 200
词汇量:  1050245
&amp;lt;gensim.models.keyedvectors.KeyedVectors at 0x328d737a0&amp;gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;32-keyedvectors-的操作方法或属性&#34;&gt;3.2 KeyedVectors 的操作方法(或属性)&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;方法&lt;/th&gt;
&lt;th&gt;描述&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;em&gt;KeyedVectors.index_to_key&lt;/em&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;获取词汇表中的所有单词。&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;em&gt;KeyedVectors.key_to_index&lt;/em&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;获取单词到索引的映射。&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;em&gt;KeyedVectors.vector_size&lt;/em&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;获取 GloVe 模型中任意词向量的维度。&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;em&gt;KeyedVectors.get_vector(word)&lt;/em&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;获取给定单词的词向量。&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;em&gt;KeyedVectors.similar_by_word(word, topn=10)&lt;/em&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;获取某词语最相似的 10 个近义词。&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;em&gt;KeyedVectors.similar_by_vector(vector, topn=10)&lt;/em&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;获取词向量最相似的 10 个近义词。&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&amp;hellip;&lt;/td&gt;
&lt;td&gt;&amp;hellip;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;br&gt;
&lt;h3 id=&#34;33-查看词表&#34;&gt;3.3 查看词表&lt;/h3&gt;
&lt;p&gt;因为词表有 &lt;strong&gt;&lt;em&gt;1050245&lt;/em&gt;&lt;/strong&gt; 个词， 为了方便，这里只显示前 20 个词&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;# 词表带顺序的
list(w2v.index_to_key)[:20]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;[&amp;#39;问题&amp;#39;,
 &amp;#39;进行&amp;#39;,
 &amp;#39;您好&amp;#39;,
 &amp;#39;工作&amp;#39;,
 &amp;#39;小区&amp;#39;,
 &amp;#39;反映&amp;#39;,
 &amp;#39;领导&amp;#39;,
 &amp;#39;情况&amp;#39;,
 &amp;#39;相关&amp;#39;,
 &amp;#39;留言&amp;#39;,
 &amp;#39;没有&amp;#39;,
 &amp;#39;感谢您&amp;#39;,
 &amp;#39;网友&amp;#39;,
 &amp;#39;业主&amp;#39;,
 &amp;#39;办理&amp;#39;,
 &amp;#39;公司&amp;#39;,
 &amp;#39;建设&amp;#39;,
 &amp;#39;回复&amp;#39;,
 &amp;#39;支持&amp;#39;,
 &amp;#39;部门&amp;#39;]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;p&gt;查看词表映射&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;w2v.key_to_index
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;{&amp;#39;问题&amp;#39;: 0,
 &amp;#39;进行&amp;#39;: 1,
 &amp;#39;您好&amp;#39;: 2,
 &amp;#39;工作&amp;#39;: 3,
 &amp;#39;小区&amp;#39;: 4,
 &amp;#39;反映&amp;#39;: 5,
 &amp;#39;领导&amp;#39;: 6,
 ...
  &amp;#39;连续&amp;#39;: 995,
 &amp;#39;稳定&amp;#39;: 996,
 &amp;#39;市住建局&amp;#39;: 997,
 &amp;#39;降低&amp;#39;: 998,
 &amp;#39;会同&amp;#39;: 999,
 ...}
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;34-获取某词的向量&#34;&gt;3.4 获取某词的向量&lt;/h3&gt;
&lt;p&gt;查找某词对应的词向量&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# w2v[&amp;#39;问题&amp;#39;]&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;w2v&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;get_vector&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;问题&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;array([-6.2813835 ,  1.5916584 , -0.48086444, -2.6446412 , 10.031776  ,
       -0.11915778, -5.039283  , -2.1107564 ,  1.1351422 , -2.881387  ,
        4.2890835 , -1.1337206 ,  3.7850847 , -3.640467  , -0.96282107,
        ...
        ...
        1.1314462 , -2.5386178 , -2.3993561 , -2.0407596 ,  0.95457   ,
        3.03732   , -2.033116  , -0.20390491,  3.5368073 ,  6.5452943 ,
        2.1186016 ,  0.79572505,  2.5855987 ,  0.88565165, -1.812104  ],
      dtype=float32)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;受限于篇幅，这里显示词向量的部分数值。&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;需要注意，如果查询的词不存在于模型词表，则会出现报错。例如&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;word = &amp;#39;这是一个不存在的词&amp;#39;
w2v.get_vector(word)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[130], line 2
      1 word = &amp;#39;这是一个不存在的词&amp;#39;
----&amp;gt; 2 w2v.wv.get_vector(word)

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/gensim/models/keyedvectors.py:446, in KeyedVectors.get_vector(self, key, norm)
    422 def get_vector(self, key, norm=False):
    423     &amp;#34;&amp;#34;&amp;#34;Get the key&amp;#39;s vector, as a 1D numpy array.
    424
    425     Parameters
   (...)
    444
    445     &amp;#34;&amp;#34;&amp;#34;
--&amp;gt; 446     index = self.get_index(key)
    447     if norm:
    448         self.fill_norms()

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/gensim/models/keyedvectors.py:420, in KeyedVectors.get_index(self, key, default)
    418     return default
    419 else:
--&amp;gt; 420     raise KeyError(f&amp;#34;Key &amp;#39;{key}&amp;#39; not present&amp;#34;)

KeyError: &amp;#34;Key &amp;#39;这是一个不存在的词&amp;#39; not present&amp;#34;

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;35-近义词&#34;&gt;3.5 近义词&lt;/h3&gt;
&lt;p&gt;根据词语查寻近义词，返回最相似的 10 个词&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;w2v&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;similar_by_word&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;问题&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;topn&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;10&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;[(&amp;#39;情况&amp;#39;, 0.6178732514381409),
 (&amp;#39;现象&amp;#39;, 0.5385990142822266),
 (&amp;#39;此类情况&amp;#39;, 0.418301522731781),
 (&amp;#39;留言&amp;#39;, 0.4179410934448242),
 (&amp;#39;一事&amp;#39;, 0.40703579783439636),
 (&amp;#39;事项&amp;#39;, 0.39551448822021484),
 (&amp;#39;事情&amp;#39;, 0.3860214948654175),
 (&amp;#39;情形&amp;#39;, 0.38478103280067444),
 (&amp;#39;事件&amp;#39;, 0.36725184321403503),
 (&amp;#39;现像&amp;#39;, 0.3665226995944977)]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;p&gt;根据语义向量查寻近义词，返回最相似的 10 个词&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;question_vector&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;w2v&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;get_vector&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;问题&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;w2v&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;similar_by_word&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;question_vector&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;topn&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;10&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;[(&amp;#39;问题&amp;#39;, 1.0),
 (&amp;#39;情况&amp;#39;, 0.6178732514381409),
 (&amp;#39;现象&amp;#39;, 0.5385990142822266),
 (&amp;#39;此类情况&amp;#39;, 0.4183014929294586),
 (&amp;#39;留言&amp;#39;, 0.4179410934448242),
 (&amp;#39;一事&amp;#39;, 0.40703579783439636),
 (&amp;#39;事项&amp;#39;, 0.39551448822021484),
 (&amp;#39;事情&amp;#39;, 0.3860214948654175),
 (&amp;#39;情形&amp;#39;, 0.38478103280067444),
 (&amp;#39;事件&amp;#39;, 0.36725184321403503)]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;36-计算多个词的中心向量&#34;&gt;3.6 计算多个词的中心向量&lt;/h3&gt;
&lt;p&gt;我们可以计算「经济」、「建设」、「发展」的中心向量 eco_vector。 并试图寻找中心向量 eco_vector 的最相似的 10 个词。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;eco_vector&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;semantic_centroid&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;wv&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;w2v&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
                                  &lt;span class=&#34;n&#34;&gt;words&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;经济&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;建设&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;发展&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;


&lt;span class=&#34;c1&#34;&gt;# 寻找 eco_vector 语义最相似的10个词&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;w2v&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;similar_by_vector&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;eco_vector&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;topn&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;10&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;[(&amp;#39;发展&amp;#39;, 0.8317984938621521),
 (&amp;#39;建设&amp;#39;, 0.7508440613746643),
 (&amp;#39;经济&amp;#39;, 0.6406075954437256),
 (&amp;#39;经济社会发展&amp;#39;, 0.6385446786880493),
 (&amp;#39;发展壮大&amp;#39;, 0.6317417621612549),
 (&amp;#39;化发展&amp;#39;, 0.5961641073226929),
 (&amp;#39;大力发展&amp;#39;, 0.585274338722229),
 (&amp;#39;经济腾飞&amp;#39;, 0.5823679566383362),
 (&amp;#39;产业&amp;#39;, 0.5820372700691223),
 (&amp;#39;高质量发展&amp;#39;, 0.5803337097167969)]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;语义捕捉的很准。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;37-概念轴&#34;&gt;3.7 概念轴&lt;/h3&gt;
&lt;p&gt;男性概念向量由多个男性词的向量加总求均值得到，女性概念向量算法类似。当性质或方向明显相反的两个概念向量相减， 得到的新的向量，我们可以称之为**&lt;em&gt;概念轴向量 Concept Axis&lt;/em&gt;**。常见的概念轴，&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;- 尺寸(大, 小)
- 湿度(干燥,潮湿)
- 性别(男, 女)
- 财富(富裕, 贫穷)
- 等
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;其实任意概念的向量也可看做概念轴，即该概念向量与 0 向量相减。只不过两组性质方向相反的方式得到的概念轴， 在语义上更稳定。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;cntext&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;ct&lt;/span&gt;


&lt;span class=&#34;c1&#34;&gt;# 数值越大，表示越接近于c_words2，越寒冷。&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;sematic_projection&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;wv&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;w2v&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
                     &lt;span class=&#34;n&#34;&gt;words&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;杭州&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;哈尔滨&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;广州&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;潍坊&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt;
                     &lt;span class=&#34;n&#34;&gt;poswords&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;寒冷&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;冰雪&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt;
                     &lt;span class=&#34;n&#34;&gt;negwords&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;炎热&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;酷暑&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt;
                     &lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;[(&amp;#39;杭州&amp;#39;, -2.52), (&amp;#39;广州&amp;#39;, -2.06), (&amp;#39;潍坊&amp;#39;, 2.18), (&amp;#39;哈尔滨&amp;#39;, 2.78)]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;在人民网留言板中， 肯定蕴藏着丰富的语义信息，只是大邓理屈词穷，现在真不知道还有啥词可以探索。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;四-相关&#34;&gt;四、 相关&lt;/h2&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;郑石明, 兰雨潇, 黎枫. 网络公共舆论与政府回应的互动逻辑——基于新冠肺炎疫情期间“领导留言板”的数据分析[J]. 公共管理学报, 2021, 18 (03): 24-37+169.
王磊,易扬.公共卫生危机中的数字政府回应如何纾解网络负面舆情——基于人民网“领导留言板”回复情况的调查[J].公共管理学报,2022,19(04):65-78+169.
Lu, Liangdong, Jia Xu, and Jiuchang Wei. &amp;#34;Understanding the effects of the textual complexity on government communication: Insights from China’s online public service platform.&amp;#34; Telematics and Informatics 83 (2023): 102028.
...
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-03-15-39faq-about-word-embeddings-for-social-science/&#34;&gt;词嵌入技术在社会科学领域进行数据挖掘常见 39 个 FAQ 汇总&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-11-03-organization-science-with-word-embeddings/&#34;&gt;OS2022 | 概念空间 | 词嵌入模型如何为组织科学中的测量和理论提供信息&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-12-22-renmin-gov-leader-comment-board/&#34;&gt;数据集 | 人民网地方领导留言板原始文本(2011-2023.12)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2025-03-28-train_a_glove_model_on_chinese_corpus_using_stanfordnlp/&#34;&gt;实验 | 使用 Stanford Glove 代码训练中文语料的 Glove 模型&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-12-28-visualize-the-culture-change-using-people-daily-dataset/&#34;&gt;可视化 | 人民日报语料反映七十年文化演变&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;五获取资料&#34;&gt;五、获取资料&lt;/h2&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;- 免费    留言板-Word2Vec.200.15.bin 链接: https://pan.baidu.com/s/12H-kh6guBWtDqpIFTXov0w?pwd=x2dt 提取码: x2dt

- 加大邓 WeChat: 372335839， 备注「姓名-学校-专业」， 100元领取 cntext-2.1.6-py3-none-any.whl 文件
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;六使用说明&#34;&gt;六、使用说明&lt;/h2&gt;
&lt;p&gt;如研究中用到该词向量或 cntext2.x， 请声明出处。&lt;/p&gt;
&lt;h3 id=&#34;apalike&#34;&gt;apalike&lt;/h3&gt;
&lt;p&gt;Deng, X., &amp;amp; Nan, P. (2022). &lt;strong&gt;cntext: a Python tool for text mining&lt;/strong&gt; [Computer software]. Zenodo. &lt;a href=&#34;https://doi.org/10.5281/zenodo.7063523&#34;&gt;https://doi.org/10.5281/zenodo.7063523&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Source Code URL: &lt;a href=&#34;https://github.com/hiDaDeng/cntext&#34;&gt;https://github.com/hiDaDeng/cntext&lt;/a&gt;&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;bibtex&#34;&gt;bibtex&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;@misc{deng2022cntext,
  author       = {Deng, X. and Nan, P.},
  title        = {cntext: a Python tool for text mining},
  year         = {2022},
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.7063523},
  url          = {https://doi.org/10.5281/zenodo.7063523},
  howpublished = {[Computer software]},
  note         = {Source Code URL: \url{https://github.com/hiDaDeng/cntext}}
}
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;endnote&#34;&gt;endnote&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;%0 Generic
%A Deng, X.
%A Nan, P.
%T cntext: a Python tool for text mining
%Y [Computer software]
%D 2022
%I Zenodo
%R 10.5281/zenodo.7063523
%U https://doi.org/10.5281/zenodo.7063523
%Z Source Code URL: https://github.com/hiDaDeng/cntext
%@
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</description>
      <content:encoded><![CDATA[<p>本文使用 3.88G 语料训练得到词汇量近 150w 的 Word2Vec 模型，使用该模型，可以用于寻找近义词，扩展(构建)概念词典。 <strong>该 Word2Vec 模型文件可在文末免费下载</strong></p>
<p><br><br></p>
<h2 id="一构建语料">一、构建语料</h2>
<p>使用 <a href="https://textdata.cn/blog/2023-12-22-renmin-gov-leader-comment-board/"><strong>数据集 | 人民网地方领导留言板原始文本(2011-2023.12)</strong></a> 来构建本文的语料。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">import</span> <span class="nn">re</span>

<span class="n">df1</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;2011-2019.csv.gzip&#39;</span><span class="p">,</span> <span class="n">usecols</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;留言内容&#39;</span><span class="p">,</span> <span class="s1">&#39;回复内容&#39;</span><span class="p">],</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">)</span>
<span class="n">df2</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;2020-2023.csv.gzip&#39;</span><span class="p">,</span> <span class="n">usecols</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;留言内容&#39;</span><span class="p">,</span> <span class="s1">&#39;回复内容&#39;</span><span class="p">],</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">)</span>
<span class="n">df1</span><span class="p">[</span><span class="s1">&#39;text&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df1</span><span class="p">[</span><span class="s1">&#39;留言标题&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span> <span class="o">+</span> <span class="n">df1</span><span class="p">[</span><span class="s1">&#39;留言内容&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span> <span class="o">+</span> <span class="n">df1</span><span class="p">[</span><span class="s1">&#39;回复内容&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span>
<span class="n">df2</span><span class="p">[</span><span class="s1">&#39;text&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df2</span><span class="p">[</span><span class="s1">&#39;留言标题&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span> <span class="o">+</span> <span class="n">df2</span><span class="p">[</span><span class="s1">&#39;留言内容&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span> <span class="o">+</span> <span class="n">df2</span><span class="p">[</span><span class="s1">&#39;回复内容&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>


<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s1">&#39;留言板.txt&#39;</span><span class="p">,</span> <span class="s1">&#39;w&#39;</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
    <span class="n">text</span> <span class="o">=</span> <span class="s1">&#39;</span><span class="se">\n</span><span class="s1">&#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;text&#39;</span><span class="p">])</span>
    <span class="n">f</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>
</code></pre></div><p>最终得到 4.62 G 的 <strong>留言板.txt</strong> 。</p>
<p><br><br></p>
<h2 id="二训练模型">二、训练模型</h2>
<h3 id="21-配置-cntext">2.1 配置 cntext</h3>
<p>将 <strong>cntext-2.1.6-py3-none-any.whl</strong> 放置于桌面， 打开 **命令行 cmd **(苹果电脑 terminal)，依次执行以下命令</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">cd</span> <span class="n">desktop</span>
<span class="n">pip3</span> <span class="n">install</span> <span class="n">cntext</span><span class="o">-</span><span class="mf">2.1.6</span><span class="o">-</span><span class="n">py3</span><span class="o">-</span><span class="n">none</span><span class="o">-</span><span class="nb">any</span><span class="o">.</span><span class="n">whl</span>
</code></pre></div><p>cntext2.x 是付费未公开版本， 100 元，如有需要可加微信 372335839 ，备注 「姓名-学校-专业」。</p>
<br>
<h3 id="22-训练-word2vec">2.2 训练 Word2Vec</h3>
<p>训练 word2vec 代码已封装 cntext2， 只有三行代码。 大邓训练环境 Mac，内存 96G， 核 12。 代码对硬件要求不高， 16G 内存绝对跑得动，可能速度会慢一些。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="c1"># 大邓Mac 96G内存， 12核使用的代码。</span>
<span class="n">w2v</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">Word2Vec</span><span class="p">(</span><span class="n">corpus_file</span><span class="o">=</span><span class="s1">&#39;留言板.txt&#39;</span><span class="p">,</span>
                  <span class="n">vector_size</span><span class="o">=</span><span class="mi">200</span><span class="p">,</span>
                  <span class="n">window_size</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span>
                  <span class="n">lang</span><span class="o">=</span><span class="s1">&#39;chinese&#39;</span><span class="p">,</span>
                  <span class="n">chunksize</span><span class="o">=</span><span class="mi">100000</span><span class="p">,</span>
                  <span class="n">min_count</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>

<span class="c1"># 考虑到大家电脑普遍8G、16G内存，保守的训练代码</span>
<span class="c1"># w2v = ct.Word2Vec(corpus_file=&#39;留言板.txt&#39;,</span>
<span class="c1">#                  vector_size=200,</span>
<span class="c1">#                  window_size=15,</span>
<span class="c1">#                  lang=&#39;chinese&#39;,</span>
<span class="c1">#                  chunksize=10000,</span>
<span class="c1">#                  min_count=5)</span>

<span class="n">wv</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">Mac(Linux) System, Enable Parallel Processing
Cache output/renmin_board_cache.txt Not Found or Empty, Preprocessing Corpus
Reading Preprocessed Corpus from output/renmin_board_cache.txt
Start Training Word2Vec
Word2Vec Training Cost 2692 s.
Output Saved To: output/留言板-Word2Vec.200.15.bin
</code></pre></div><p>使用 4.62 G 的 <strong><em>留言板.txt</em></strong> ，训练了 2692 秒， 约 40 分钟。 在 <strong><em>Python</em></strong> 代码文件所在的文件夹内，出现了 <strong><em>output</em></strong> 文件夹，打开可以看到:</p>
<ul>
<li><strong><em>留言板_cache.txt</em></strong> 语料处理后的缓存文件</li>
<li><strong><em>renmin_board-Word2Vec.200.15.bin</em></strong> 训练好的模型文件</li>
</ul>
<br>
<h3 id="23-评估模型">2.3 评估模型</h3>
<p>使用近义法和类比法， 判断模型的表现。详情可查看<a href="https://cntext.readthedocs.io/zh-cn/latest/model.html">文档</a></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">ct</span><span class="o">.</span><span class="n">evaluate_similarity</span><span class="p">(</span><span class="n">wv</span><span class="p">)</span>

<span class="n">ct</span><span class="o">.</span><span class="n">evaluate_analogy</span><span class="p">(</span><span class="n">wv</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">近义测试: similarity.txt
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/cntext/model/evaluate_data/similarity.txt

评估结果：
+----------+------------+----------------------------+
| 发现词语 | 未发现词语 | Spearman&#39;s Rank Coeficient |
+----------+------------+----------------------------+
|   426    |    111     |            0.45            |
+----------+------------+----------------------------+


类比测试: analogy.txt
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/cntext/model/evaluate_data/analogy.txt
Processing Analogy Test: 100%|██████████████| 1198/1198 [00:11&lt;00:00, 99.91it/s]

评估结果：
+--------------------+----------+------------+------------+----------+
|      Category      | 发现词语 | 未发现词语 | 准确率 (%) | 平均排名 |
+--------------------+----------+------------+------------+----------+
| CapitalOfCountries |   238    |    439     |   19.33    |   2.74   |
|   CityInProvince   |   175    |     0      |   100.00   |   1.01   |
| FamilyRelationship |   272    |     0      |   61.40    |   1.96   |
|   SocialScience    |    10    |     60     |   20.00    |   1.50   |
+--------------------+----------+------------+------------+----------+
</code></pre></div><p><strong>近义测试</strong>: Spearman&rsquo;s Rank Coeficient 系数取值[-1, 1], 取值越大， 说明模型表现越好。</p>
<br>
<p><strong>类比测试</strong>:</p>
<ul>
<li>CapitalOfCountries 留言板语料在此项表现较差， 应该是语料中常见国家首度的提及较少。</li>
<li>CityInProvince 留言板语料在此项表现如此优异，应该是语料中省份、省会地域词经常出现。</li>
<li>FamilyRelationship 留言板中应该少不了家长里短， 所以此项准确率还可以。 以<a href="https://textdata.cn/blog/2023-03-24-load-w2v-and-expand-your-concpet-dicitonary/">年报 MD&amp;A</a>为例，此处准确率只有 10%, 而<a href="https://textdata.cn/blog/2024-04-16-douban-movie-1000w-ratings-comments-dataset/">豆瓣影评</a>该处准确率高达 92.65%。</li>
<li>SocialScience 留言板语料在此项表现一般， 应该是语料中常见的社会科学词语提及较少。</li>
</ul>
<p>整体而言，语料训练的效果很不错，抓住了数据场景的独特性语义。</p>
<p><br><br></p>
<h2 id="三使用模型">三、使用模型</h2>
<h3 id="31-读取模型">3.1 读取模型</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="n">ct</span>

<span class="n">w2v</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">load_w2v</span><span class="p">(</span><span class="s1">&#39;output/留言板-Word2Vec.200.15.bin&#39;</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;维度数:&#39;</span><span class="p">,</span> <span class="n">w2v</span><span class="o">.</span><span class="n">vector_size</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;词汇量: &#39;</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">w2v</span><span class="p">))</span>
<span class="n">w2v</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">Loading output/留言板-Word2Vec.200.15.bin...
维度数: 200
词汇量:  1050245
&lt;gensim.models.keyedvectors.KeyedVectors at 0x328d737a0&gt;
</code></pre></div><br>
<h3 id="32-keyedvectors-的操作方法或属性">3.2 KeyedVectors 的操作方法(或属性)</h3>
<table>
<thead>
<tr>
<th>方法</th>
<th>描述</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong><em>KeyedVectors.index_to_key</em></strong></td>
<td>获取词汇表中的所有单词。</td>
</tr>
<tr>
<td><strong><em>KeyedVectors.key_to_index</em></strong></td>
<td>获取单词到索引的映射。</td>
</tr>
<tr>
<td><strong><em>KeyedVectors.vector_size</em></strong></td>
<td>获取 GloVe 模型中任意词向量的维度。</td>
</tr>
<tr>
<td><strong><em>KeyedVectors.get_vector(word)</em></strong></td>
<td>获取给定单词的词向量。</td>
</tr>
<tr>
<td><strong><em>KeyedVectors.similar_by_word(word, topn=10)</em></strong></td>
<td>获取某词语最相似的 10 个近义词。</td>
</tr>
<tr>
<td><strong><em>KeyedVectors.similar_by_vector(vector, topn=10)</em></strong></td>
<td>获取词向量最相似的 10 个近义词。</td>
</tr>
<tr>
<td>&hellip;</td>
<td>&hellip;</td>
</tr>
</tbody>
</table>
<br>
<h3 id="33-查看词表">3.3 查看词表</h3>
<p>因为词表有 <strong><em>1050245</em></strong> 个词， 为了方便，这里只显示前 20 个词</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"># 词表带顺序的
list(w2v.index_to_key)[:20]
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[&#39;问题&#39;,
 &#39;进行&#39;,
 &#39;您好&#39;,
 &#39;工作&#39;,
 &#39;小区&#39;,
 &#39;反映&#39;,
 &#39;领导&#39;,
 &#39;情况&#39;,
 &#39;相关&#39;,
 &#39;留言&#39;,
 &#39;没有&#39;,
 &#39;感谢您&#39;,
 &#39;网友&#39;,
 &#39;业主&#39;,
 &#39;办理&#39;,
 &#39;公司&#39;,
 &#39;建设&#39;,
 &#39;回复&#39;,
 &#39;支持&#39;,
 &#39;部门&#39;]
</code></pre></div><br>
<p>查看词表映射</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">w2v.key_to_index
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">{&#39;问题&#39;: 0,
 &#39;进行&#39;: 1,
 &#39;您好&#39;: 2,
 &#39;工作&#39;: 3,
 &#39;小区&#39;: 4,
 &#39;反映&#39;: 5,
 &#39;领导&#39;: 6,
 ...
  &#39;连续&#39;: 995,
 &#39;稳定&#39;: 996,
 &#39;市住建局&#39;: 997,
 &#39;降低&#39;: 998,
 &#39;会同&#39;: 999,
 ...}
</code></pre></div><br>
<h3 id="34-获取某词的向量">3.4 获取某词的向量</h3>
<p>查找某词对应的词向量</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1"># w2v[&#39;问题&#39;]</span>
<span class="n">w2v</span><span class="o">.</span><span class="n">get_vector</span><span class="p">(</span><span class="s1">&#39;问题&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">array([-6.2813835 ,  1.5916584 , -0.48086444, -2.6446412 , 10.031776  ,
       -0.11915778, -5.039283  , -2.1107564 ,  1.1351422 , -2.881387  ,
        4.2890835 , -1.1337206 ,  3.7850847 , -3.640467  , -0.96282107,
        ...
        ...
        1.1314462 , -2.5386178 , -2.3993561 , -2.0407596 ,  0.95457   ,
        3.03732   , -2.033116  , -0.20390491,  3.5368073 ,  6.5452943 ,
        2.1186016 ,  0.79572505,  2.5855987 ,  0.88565165, -1.812104  ],
      dtype=float32)
</code></pre></div><p>受限于篇幅，这里显示词向量的部分数值。</p>
<br>
<p>需要注意，如果查询的词不存在于模型词表，则会出现报错。例如</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">word = &#39;这是一个不存在的词&#39;
w2v.get_vector(word)
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[130], line 2
      1 word = &#39;这是一个不存在的词&#39;
----&gt; 2 w2v.wv.get_vector(word)

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/gensim/models/keyedvectors.py:446, in KeyedVectors.get_vector(self, key, norm)
    422 def get_vector(self, key, norm=False):
    423     &#34;&#34;&#34;Get the key&#39;s vector, as a 1D numpy array.
    424
    425     Parameters
   (...)
    444
    445     &#34;&#34;&#34;
--&gt; 446     index = self.get_index(key)
    447     if norm:
    448         self.fill_norms()

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/gensim/models/keyedvectors.py:420, in KeyedVectors.get_index(self, key, default)
    418     return default
    419 else:
--&gt; 420     raise KeyError(f&#34;Key &#39;{key}&#39; not present&#34;)

KeyError: &#34;Key &#39;这是一个不存在的词&#39; not present&#34;

</code></pre></div><br>
<h3 id="35-近义词">3.5 近义词</h3>
<p>根据词语查寻近义词，返回最相似的 10 个词</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">w2v</span><span class="o">.</span><span class="n">similar_by_word</span><span class="p">(</span><span class="s1">&#39;问题&#39;</span><span class="p">,</span> <span class="n">topn</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[(&#39;情况&#39;, 0.6178732514381409),
 (&#39;现象&#39;, 0.5385990142822266),
 (&#39;此类情况&#39;, 0.418301522731781),
 (&#39;留言&#39;, 0.4179410934448242),
 (&#39;一事&#39;, 0.40703579783439636),
 (&#39;事项&#39;, 0.39551448822021484),
 (&#39;事情&#39;, 0.3860214948654175),
 (&#39;情形&#39;, 0.38478103280067444),
 (&#39;事件&#39;, 0.36725184321403503),
 (&#39;现像&#39;, 0.3665226995944977)]
</code></pre></div><br>
<p>根据语义向量查寻近义词，返回最相似的 10 个词</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">question_vector</span> <span class="o">=</span> <span class="n">w2v</span><span class="o">.</span><span class="n">get_vector</span><span class="p">(</span><span class="s1">&#39;问题&#39;</span><span class="p">)</span>
<span class="n">w2v</span><span class="o">.</span><span class="n">similar_by_word</span><span class="p">(</span><span class="n">question_vector</span><span class="p">,</span> <span class="n">topn</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[(&#39;问题&#39;, 1.0),
 (&#39;情况&#39;, 0.6178732514381409),
 (&#39;现象&#39;, 0.5385990142822266),
 (&#39;此类情况&#39;, 0.4183014929294586),
 (&#39;留言&#39;, 0.4179410934448242),
 (&#39;一事&#39;, 0.40703579783439636),
 (&#39;事项&#39;, 0.39551448822021484),
 (&#39;事情&#39;, 0.3860214948654175),
 (&#39;情形&#39;, 0.38478103280067444),
 (&#39;事件&#39;, 0.36725184321403503)]
</code></pre></div><br>
<h3 id="36-计算多个词的中心向量">3.6 计算多个词的中心向量</h3>
<p>我们可以计算「经济」、「建设」、「发展」的中心向量 eco_vector。 并试图寻找中心向量 eco_vector 的最相似的 10 个词。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">eco_vector</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">semantic_centroid</span><span class="p">(</span><span class="n">wv</span><span class="o">=</span><span class="n">w2v</span><span class="p">,</span>
                                  <span class="n">words</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;经济&#39;</span><span class="p">,</span> <span class="s1">&#39;建设&#39;</span><span class="p">,</span> <span class="s1">&#39;发展&#39;</span><span class="p">])</span>


<span class="c1"># 寻找 eco_vector 语义最相似的10个词</span>
<span class="n">w2v</span><span class="o">.</span><span class="n">similar_by_vector</span><span class="p">(</span><span class="n">eco_vector</span><span class="p">,</span> <span class="n">topn</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[(&#39;发展&#39;, 0.8317984938621521),
 (&#39;建设&#39;, 0.7508440613746643),
 (&#39;经济&#39;, 0.6406075954437256),
 (&#39;经济社会发展&#39;, 0.6385446786880493),
 (&#39;发展壮大&#39;, 0.6317417621612549),
 (&#39;化发展&#39;, 0.5961641073226929),
 (&#39;大力发展&#39;, 0.585274338722229),
 (&#39;经济腾飞&#39;, 0.5823679566383362),
 (&#39;产业&#39;, 0.5820372700691223),
 (&#39;高质量发展&#39;, 0.5803337097167969)]
</code></pre></div><p>语义捕捉的很准。</p>
<br>
<h3 id="37-概念轴">3.7 概念轴</h3>
<p>男性概念向量由多个男性词的向量加总求均值得到，女性概念向量算法类似。当性质或方向明显相反的两个概念向量相减， 得到的新的向量，我们可以称之为**<em>概念轴向量 Concept Axis</em>**。常见的概念轴，</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">- 尺寸(大, 小)
- 湿度(干燥,潮湿)
- 性别(男, 女)
- 财富(富裕, 贫穷)
- 等
</code></pre></div><p>其实任意概念的向量也可看做概念轴，即该概念向量与 0 向量相减。只不过两组性质方向相反的方式得到的概念轴， 在语义上更稳定。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>


<span class="c1"># 数值越大，表示越接近于c_words2，越寒冷。</span>
<span class="n">ct</span><span class="o">.</span><span class="n">sematic_projection</span><span class="p">(</span><span class="n">wv</span><span class="o">=</span><span class="n">w2v</span><span class="p">,</span>
                     <span class="n">words</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;杭州&#39;</span><span class="p">,</span> <span class="s1">&#39;哈尔滨&#39;</span><span class="p">,</span> <span class="s1">&#39;广州&#39;</span><span class="p">,</span> <span class="s1">&#39;潍坊&#39;</span><span class="p">],</span>
                     <span class="n">poswords</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;寒冷&#39;</span><span class="p">,</span> <span class="s1">&#39;冰雪&#39;</span><span class="p">],</span>
                     <span class="n">negwords</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;炎热&#39;</span><span class="p">,</span> <span class="s1">&#39;酷暑&#39;</span><span class="p">],</span>
                     <span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[(&#39;杭州&#39;, -2.52), (&#39;广州&#39;, -2.06), (&#39;潍坊&#39;, 2.18), (&#39;哈尔滨&#39;, 2.78)]
</code></pre></div><p>在人民网留言板中， 肯定蕴藏着丰富的语义信息，只是大邓理屈词穷，现在真不知道还有啥词可以探索。</p>
<p><br><br></p>
<h2 id="四-相关">四、 相关</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">郑石明, 兰雨潇, 黎枫. 网络公共舆论与政府回应的互动逻辑——基于新冠肺炎疫情期间“领导留言板”的数据分析[J]. 公共管理学报, 2021, 18 (03): 24-37+169.
王磊,易扬.公共卫生危机中的数字政府回应如何纾解网络负面舆情——基于人民网“领导留言板”回复情况的调查[J].公共管理学报,2022,19(04):65-78+169.
Lu, Liangdong, Jia Xu, and Jiuchang Wei. &#34;Understanding the effects of the textual complexity on government communication: Insights from China’s online public service platform.&#34; Telematics and Informatics 83 (2023): 102028.
...
</code></pre></div><br>
<ul>
<li><a href="https://textdata.cn/blog/2023-03-15-39faq-about-word-embeddings-for-social-science/">词嵌入技术在社会科学领域进行数据挖掘常见 39 个 FAQ 汇总</a></li>
<li><a href="https://textdata.cn/blog/2023-11-03-organization-science-with-word-embeddings/">OS2022 | 概念空间 | 词嵌入模型如何为组织科学中的测量和理论提供信息</a></li>
<li><a href="https://textdata.cn/blog/2023-12-22-renmin-gov-leader-comment-board/">数据集 | 人民网地方领导留言板原始文本(2011-2023.12)</a></li>
<li><a href="https://textdata.cn/blog/2025-03-28-train_a_glove_model_on_chinese_corpus_using_stanfordnlp/">实验 | 使用 Stanford Glove 代码训练中文语料的 Glove 模型</a></li>
<li><a href="https://textdata.cn/blog/2023-12-28-visualize-the-culture-change-using-people-daily-dataset/">可视化 | 人民日报语料反映七十年文化演变</a></li>
</ul>
<p><br><br></p>
<h2 id="五获取资料">五、获取资料</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">- 免费    留言板-Word2Vec.200.15.bin 链接: https://pan.baidu.com/s/12H-kh6guBWtDqpIFTXov0w?pwd=x2dt 提取码: x2dt

- 加大邓 WeChat: 372335839， 备注「姓名-学校-专业」， 100元领取 cntext-2.1.6-py3-none-any.whl 文件
</code></pre></div><p><br><br></p>
<h2 id="六使用说明">六、使用说明</h2>
<p>如研究中用到该词向量或 cntext2.x， 请声明出处。</p>
<h3 id="apalike">apalike</h3>
<p>Deng, X., &amp; Nan, P. (2022). <strong>cntext: a Python tool for text mining</strong> [Computer software]. Zenodo. <a href="https://doi.org/10.5281/zenodo.7063523">https://doi.org/10.5281/zenodo.7063523</a></p>
<p>Source Code URL: <a href="https://github.com/hiDaDeng/cntext">https://github.com/hiDaDeng/cntext</a></p>
<br>
<h3 id="bibtex">bibtex</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">@misc{deng2022cntext,
  author       = {Deng, X. and Nan, P.},
  title        = {cntext: a Python tool for text mining},
  year         = {2022},
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.7063523},
  url          = {https://doi.org/10.5281/zenodo.7063523},
  howpublished = {[Computer software]},
  note         = {Source Code URL: \url{https://github.com/hiDaDeng/cntext}}
}
</code></pre></div><br>
<h3 id="endnote">endnote</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">%0 Generic
%A Deng, X.
%A Nan, P.
%T cntext: a Python tool for text mining
%Y [Computer software]
%D 2022
%I Zenodo
%R 10.5281/zenodo.7063523
%U https://doi.org/10.5281/zenodo.7063523
%Z Source Code URL: https://github.com/hiDaDeng/cntext
%@
</code></pre></div>]]></content:encoded>
    </item>
    
    <item>
      <title>词向量 | 使用1985年-2025年专利申请摘要训练 Word2Vec 模型</title>
      <link>https://textdata.cn/blog/2023-11-10-training-word2vec-model-using-china-3751w-patent-application-dataset/</link>
      <pubDate>Thu, 03 Apr 2025 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-11-10-training-word2vec-model-using-china-3751w-patent-application-dataset/</guid>
      <description>&lt;p&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-04-13-3571w-patent-dataset-in-china-mainland/&#34;&gt;&lt;strong&gt;5112万条专利申请数据集(1985-2025年)&lt;/strong&gt;&lt;/a&gt; 中随机抽取了30%的 「&lt;strong&gt;专利摘要&lt;/strong&gt;」，构成6.14G的训练语料(千万级别)， 耗时6小时，训练得到word2vec模型。&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;需要注意， 100%全部语料有30+G， 训练时间非常长。&lt;/p&gt;
&lt;p&gt;没办法，我不会优化代码性能，所以只能抽取 30% 的文本数据来训练word2vec ，语料体积大概10G。&lt;/p&gt;
&lt;/blockquote&gt;
&lt;br&gt;
&lt;p&gt;本文需要用到新cntext，因为bug较多， 直接上传到PyPi，将导致之前制作的课程和公众号推文相关内容全部重新一遍。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;一语料构建&#34;&gt;一、语料构建&lt;/h2&gt;
&lt;p&gt;随机抽取20%的记录，构成千万专利文本摘要训练语料。&lt;/p&gt;
&lt;p&gt;为了防止电脑内存爆炸， 对任意单个大csv文件，分批次读取，每次读10w行。最终将专利摘要文本保存到txt文件中，编码方式为utf-8。&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;如果想开发一些词典，可以跳过此部分内容，并不影响代码运行。&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/screen-datasets.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# 将代码放在csv数据文件夹内&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;os&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;re&lt;/span&gt;

&lt;span class=&#34;k&#34;&gt;with&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;open&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;专利摘要.txt&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;w&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;encoding&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;utf-8&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;corpus_file&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;c1&#34;&gt;# 获得当前文件夹内所有的csv文件路径&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;chunk_dfs&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;中国专利数据库.csv.gz&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
                            &lt;span class=&#34;n&#34;&gt;usecols&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;专利名称&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;摘要文本&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt; 
                            &lt;span class=&#34;n&#34;&gt;chunksize&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;100000&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;chunk_df&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;chunk_dfs&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
        &lt;span class=&#34;c1&#34;&gt;# 剔除专利摘要为空的记录&lt;/span&gt;
        &lt;span class=&#34;n&#34;&gt;sample_df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;chunk_df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;sample&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;frac&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mf&#34;&gt;0.3&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
        &lt;span class=&#34;n&#34;&gt;raw_text&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;se&#34;&gt;\n&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;join&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;sample_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;摘要文本&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fillna&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;))&lt;/span&gt;
        &lt;span class=&#34;n&#34;&gt;corpus_file&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;write&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;raw_text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;最终得到的 &lt;strong&gt;专利摘要.txt&lt;/strong&gt;  文件有 10G&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二训练word2vec&#34;&gt;二、训练word2vec&lt;/h2&gt;
&lt;h3 id=&#34;21-安装&#34;&gt;2.1 安装&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;pip3 install cntext --upgrade
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;22-训练-word2vec&#34;&gt;2.2 训练 Word2Vec&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# cntext为2.1.6&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;cntext&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;ct&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;w2v_model&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Word2Vec&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;corpus_file&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;专利摘要.txt&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
                        &lt;span class=&#34;n&#34;&gt;lang&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;chinese&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
                        &lt;span class=&#34;n&#34;&gt;vector_size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;200&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;c1&#34;&gt;# 词向量维度&lt;/span&gt;
                        &lt;span class=&#34;n&#34;&gt;window_size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;15&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;c1&#34;&gt;# 窗口大小&lt;/span&gt;
                        &lt;span class=&#34;n&#34;&gt;chunksize&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;10000&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;c1&#34;&gt;# 每次读取10000行&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;Mac(Linux) System, Enable Parallel Processing
Cache output/专利摘要_cache.txt Not Found or Empty, Preprocessing Corpus

Reading Preprocessed Corpus from output/专利摘要_cache.txt
Start Training Word2Vec
Word2Vec Training Cost 8816 s. 
Output Saved To: output/专利摘要-Word2Vec.200.15.bin
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;整理训练过程 2.5 小时， 训练结束后得到 &lt;em&gt;&lt;strong&gt;output&lt;/strong&gt;&lt;/em&gt; 文件夹， 里面有&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;&lt;strong&gt;output/专利摘要-Word2Vec.200.15.bin&lt;/strong&gt;&lt;/em&gt;  模型文件&lt;/li&gt;
&lt;li&gt;&lt;em&gt;&lt;strong&gt;专利摘要_cache.txt&lt;/strong&gt;&lt;/em&gt;                   训练缓存文件&lt;/li&gt;
&lt;/ul&gt;
&lt;br&gt;
&lt;h3 id=&#34;23-评估模型&#34;&gt;2.3 评估模型&lt;/h3&gt;
&lt;p&gt;使用近义法和类比法， 判断模型的表现。详情可查看&lt;a href=&#34;https://cntext.readthedocs.io/zh-cn/latest/model.html&#34;&gt;文档&lt;/a&gt;&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;evaluate_similarity&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;w2v_model&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;evaluate_analogy&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;w2v_model&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;近义测试: similarity.txt
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/cntext/model/evaluate_data/similarity.txt

评估结果：
+----------+------------+----------------------------+
| 发现词语 | 未发现词语 | Spearman&amp;#39;s Rank Coeficient |
+----------+------------+----------------------------+
|   427    |    110     |            0.46            |
+----------+------------+----------------------------+


类比测试: analogy.txt
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/cntext/model/evaluate_data/analogy.txt
Processing Analogy Test: 100%|██████████████| 1198/1198 [00:11&amp;lt;00:00, 99.91it/s]

评估结果：
+--------------------+----------+------------+------------+----------+
|      Category      | 发现词语 | 未发现词语 | 准确率 (%) | 平均排名 |
+--------------------+----------+------------+------------+----------+
| CapitalOfCountries |   238    |    439     |    3.78    |   5.67   |
|   CityInProvince   |   175    |     0      |   25.14    |   4.48   |
| FamilyRelationship |   156    |    116     |   33.33    |   2.29   |
|   SocialScience    |    8     |     62     |   37.50    |   2.33   |
+--------------------+----------+------------+------------+----------+
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;strong&gt;近义测试&lt;/strong&gt;: Spearman&amp;rsquo;s Rank Coeficient系数取值[-1, 1], 取值越大， 说明模型表现越好。&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;&lt;strong&gt;类比测试&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;CapitalOfCountries   专利语料在此项表现很差， 应该是语料中常见国家首度的提及较少。&lt;/li&gt;
&lt;li&gt;CityInProvince       专利语料在此项好于CapitalOfCountries， 毕竟在中国大地进行科创。&lt;/li&gt;
&lt;li&gt;FamilyRelationship   专利语料中没想到在此项准确率中显著大于0， 我原本以为准确率为0，毕竟专利摘要中出现家人管理不太技术。没想到没想到啊。 发明可能类似于电影非诚勿扰里解决人类问题的例子，发明很雷人。&lt;/li&gt;
&lt;li&gt;SocialScience        专利语料在此项表现一般， 应该是语料中常见的社会科学词语提及较少。&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;整体而言，在四个维度准确率较低。 但是需要说明， 这四个维度是大邓自己收集的，评判模型类比表现维度有很多， 有可能专利摘要在别的类比维度上表现会很好。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;三使用词向量&#34;&gt;三、使用词向量&lt;/h2&gt;
&lt;h3 id=&#34;31-录入模型&#34;&gt;3.1 录入模型&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;cntext&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;ct&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;w2v&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;load_w2v&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;output/专利摘要-Word2Vec.200.15.bin&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;w2v&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;Loading 专利摘要-Word2Vec.200.15.bin...
&amp;lt;gensim.models.keyedvectors.KeyedVectors at 0x32b079340&amp;gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;32-keyedvectors的操作方法或属性&#34;&gt;3.2 KeyedVectors的操作方法(或属性)&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;方法&lt;/th&gt;
&lt;th&gt;描述&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;em&gt;&lt;strong&gt;KeyedVectors.index_to_key&lt;/strong&gt;&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;获取词汇表中的所有单词。&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;em&gt;&lt;strong&gt;KeyedVectors.key_to_index&lt;/strong&gt;&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;获取单词到索引的映射。&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;em&gt;&lt;strong&gt;KeyedVectors.vector_size&lt;/strong&gt;&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;获取GloVe模型中任意词向量的维度。&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;em&gt;&lt;strong&gt;KeyedVectors.get_vector(word)&lt;/strong&gt;&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;获取给定单词的词向量。&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;em&gt;&lt;strong&gt;KeyedVectors.most_similar(words, topn=10)&lt;/strong&gt;&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;获取某类词(list)最相似的10个近义词。&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;em&gt;&lt;strong&gt;KeyedVectors.similar_by_word(word, topn=10)&lt;/strong&gt;&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;获取某词语最相似的10个近义词。&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;em&gt;&lt;strong&gt;KeyedVectors.similar_by_vector(vector, topn=10)&lt;/strong&gt;&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;获取词向量最相似的10个近义词。&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;br&gt;
&lt;h3 id=&#34;33-词汇量维度数&#34;&gt;3.3 词汇量&amp;amp;维度数&lt;/h3&gt;
&lt;p&gt;查看模型中的词汇量&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;print(f&amp;#39;词汇量: {len(w2v)}&amp;#39;)
print(f&amp;#39;维度数: {w2v.vector_size}&amp;#39;)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;词汇量: 1059801
维度数: 200
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;34-查看词向量&#34;&gt;3.4 查看词向量&lt;/h3&gt;
&lt;p&gt;查看任意词的词向量，例如“&lt;strong&gt;”人工智能”&lt;/strong&gt;”&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# 查看 ”人工智能” 的词向量&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;w2v&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;人工智能&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;array([-1.1817173 , -2.1371903 , -3.0181015 ,  1.7000161 , -3.081852  ,
        3.554449  , -0.22385244,  3.6647737 , -3.7086377 , -1.4868759 ,
       -0.7706527 ,  5.9335155 ,  2.8328223 , -1.7995875 , -6.051175  ,
       -0.91756725, -4.15509   , -1.6975762 , -4.5753274 , -3.022245  ,
       ......
       -2.0807118 , -3.4522808 ,  4.29429   , -1.712142  , -1.6512033 ,
        2.625037  , -3.4015207 ,  1.3526493 , -0.7858534 , -1.6782432 ,
       -3.1669524 , -2.6371615 , -1.5394825 ,  3.101744  ,  0.44502366,
       -1.4104489 , -0.01298253, -4.217453  , -0.92512876,  0.10754411],
      dtype=float32)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;35-最相似词&#34;&gt;3.5 最相似词&lt;/h3&gt;
&lt;p&gt;与&amp;rsquo;创新&#39;, &amp;lsquo;颠覆&amp;rsquo;最相似的20个词&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;# 词语列表中可传入任意多个词，
# 大邓词穷，只想到这两个相似的种子词
w2v.most_similar([&amp;#39;创新&amp;#39;, &amp;#39;颠覆&amp;#39;], topn=20)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;[(&amp;#39;革新&amp;#39;, 0.7983665466308594),
 (&amp;#39;改革&amp;#39;, 0.7454208731651306),
 (&amp;#39;变革&amp;#39;, 0.7136300206184387),
 (&amp;#39;全新&amp;#39;, 0.707391619682312),
 (&amp;#39;彻底改变&amp;#39;, 0.7064372301101685),
 (&amp;#39;创造性&amp;#39;, 0.6960274577140808),
 (&amp;#39;颠覆性&amp;#39;, 0.6874485611915588),
 (&amp;#39;有别于&amp;#39;, 0.6775000095367432),
 (&amp;#39;加以改进&amp;#39;, 0.6736693978309631),
 (&amp;#39;摒弃&amp;#39;, 0.6716011762619019),
 (&amp;#39;独创&amp;#39;, 0.6609643697738647),
 (&amp;#39;颠覆传统&amp;#39;, 0.6604534983634949),
 (&amp;#39;开创&amp;#39;, 0.6531570553779602),
 (&amp;#39;核心技术&amp;#39;, 0.6419240236282349),
 (&amp;#39;彻底颠覆&amp;#39;, 0.6397384405136108),
 (&amp;#39;技术创新&amp;#39;, 0.6390863060951233),
 (&amp;#39;突破性&amp;#39;, 0.6368305087089539),
 (&amp;#39;大胆&amp;#39;, 0.6357517242431641),
 (&amp;#39;技术革新&amp;#39;, 0.6347700357437134),
 (&amp;#39;沿用&amp;#39;, 0.6328355669975281)]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;p&gt;刚刚的运行，体现模型很好的学习到了专利摘要中的语义关系。&lt;/p&gt;
&lt;p&gt;如果我想开发三个词典，分别是 &lt;strong&gt;创新&lt;/strong&gt;、&lt;strong&gt;成本&lt;/strong&gt;、&lt;strong&gt;质量&lt;/strong&gt; ，想直接将结果保存到txt中，可以运行如下代码&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;seeds&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;创新概念&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;创新&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;颠覆&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt;
         &lt;span class=&#34;s1&#34;&gt;&amp;#39;成本概念&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;成本&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt;
         &lt;span class=&#34;s1&#34;&gt;&amp;#39;质量概念&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;质量&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]}&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;expand_dictionary&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;wv&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;w2v&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;          &lt;span class=&#34;c1&#34;&gt;# word2vec词向量&lt;/span&gt;
                     &lt;span class=&#34;n&#34;&gt;seeddict&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;seeds&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;  &lt;span class=&#34;c1&#34;&gt;# 种子词字典&lt;/span&gt;
                     &lt;span class=&#34;n&#34;&gt;topn&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;20&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;         &lt;span class=&#34;c1&#34;&gt;# 保留20个最相似的词&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;Finish! 创新概念 candidates saved to output/创新概念.txt
Finish! 成本概念 candidates saved to output/成本概念.txt
Finish! 质量概念 candidates saved to output/质量概念.txt
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/similar-words.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;四获取资源&#34;&gt;四、获取资源&lt;/h2&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;- 免费     专利摘要-Word2Vec.200.15.bin 链接: https://pan.baidu.com/s/1LKebAWL5fzjUVo_MR7dVug?pwd=a56c 提取码: a56c
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;cntext使用声明&#34;&gt;cntext使用声明&lt;/h2&gt;
&lt;p&gt;如在研究或项目中使用 cntext ，请在文中介绍并附引用声明。引用格式可参考 &lt;a href=&#34;https://textdata.cn/blog/2025-09-09-transform-text-data-into-structured-data-with-lmstudio-and-cntext/#%E4%BD%BF%E7%94%A8%E5%A3%B0%E6%98%8E&#34;&gt;cntext 推荐引用格式&lt;/a&gt;&lt;/p&gt;
</description>
      <content:encoded><![CDATA[<p><a href="https://textdata.cn/blog/2023-04-13-3571w-patent-dataset-in-china-mainland/"><strong>5112万条专利申请数据集(1985-2025年)</strong></a> 中随机抽取了30%的 「<strong>专利摘要</strong>」，构成6.14G的训练语料(千万级别)， 耗时6小时，训练得到word2vec模型。</p>
<blockquote>
<p>需要注意， 100%全部语料有30+G， 训练时间非常长。</p>
<p>没办法，我不会优化代码性能，所以只能抽取 30% 的文本数据来训练word2vec ，语料体积大概10G。</p>
</blockquote>
<br>
<p>本文需要用到新cntext，因为bug较多， 直接上传到PyPi，将导致之前制作的课程和公众号推文相关内容全部重新一遍。</p>
<p><br><br></p>
<h2 id="一语料构建">一、语料构建</h2>
<p>随机抽取20%的记录，构成千万专利文本摘要训练语料。</p>
<p>为了防止电脑内存爆炸， 对任意单个大csv文件，分批次读取，每次读10w行。最终将专利摘要文本保存到txt文件中，编码方式为utf-8。</p>
<blockquote>
<p>如果想开发一些词典，可以跳过此部分内容，并不影响代码运行。</p>
</blockquote>
<p><img loading="lazy" src="img/screen-datasets.png" alt=""  />
</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1"># 将代码放在csv数据文件夹内</span>
<span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">import</span> <span class="nn">re</span>

<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s1">&#39;专利摘要.txt&#39;</span><span class="p">,</span> <span class="s1">&#39;w&#39;</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">corpus_file</span><span class="p">:</span>
    <span class="c1"># 获得当前文件夹内所有的csv文件路径</span>
    <span class="n">chunk_dfs</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;中国专利数据库.csv.gz&#39;</span><span class="p">,</span> 
                            <span class="n">usecols</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;专利名称&#39;</span><span class="p">,</span> <span class="s1">&#39;摘要文本&#39;</span><span class="p">],</span> 
                            <span class="n">chunksize</span><span class="o">=</span><span class="mi">100000</span><span class="p">)</span>
    <span class="k">for</span> <span class="n">chunk_df</span> <span class="ow">in</span> <span class="n">chunk_dfs</span><span class="p">:</span>
        <span class="c1"># 剔除专利摘要为空的记录</span>
        <span class="n">sample_df</span> <span class="o">=</span> <span class="n">chunk_df</span><span class="o">.</span><span class="n">sample</span><span class="p">(</span><span class="n">frac</span><span class="o">=</span><span class="mf">0.3</span><span class="p">)</span>
        <span class="n">raw_text</span> <span class="o">=</span> <span class="s1">&#39;</span><span class="se">\n</span><span class="s1">&#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">sample_df</span><span class="p">[</span><span class="s1">&#39;摘要文本&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;))</span>
        <span class="n">corpus_file</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">raw_text</span><span class="p">)</span>
</code></pre></div><p>最终得到的 <strong>专利摘要.txt</strong>  文件有 10G<br><br></p>
<h2 id="二训练word2vec">二、训练word2vec</h2>
<h3 id="21-安装">2.1 安装</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">pip3 install cntext --upgrade
</code></pre></div><br>
<h3 id="22-训练-word2vec">2.2 训练 Word2Vec</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1"># cntext为2.1.6</span>
<span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="n">w2v_model</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">Word2Vec</span><span class="p">(</span><span class="n">corpus_file</span><span class="o">=</span><span class="s1">&#39;专利摘要.txt&#39;</span><span class="p">,</span>
                        <span class="n">lang</span><span class="o">=</span><span class="s1">&#39;chinese&#39;</span><span class="p">,</span>
                        <span class="n">vector_size</span><span class="o">=</span><span class="mi">200</span><span class="p">,</span> <span class="c1"># 词向量维度</span>
                        <span class="n">window_size</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span><span class="c1"># 窗口大小</span>
                        <span class="n">chunksize</span><span class="o">=</span><span class="mi">10000</span><span class="p">)</span> <span class="c1"># 每次读取10000行</span>

</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">Mac(Linux) System, Enable Parallel Processing
Cache output/专利摘要_cache.txt Not Found or Empty, Preprocessing Corpus

Reading Preprocessed Corpus from output/专利摘要_cache.txt
Start Training Word2Vec
Word2Vec Training Cost 8816 s. 
Output Saved To: output/专利摘要-Word2Vec.200.15.bin
</code></pre></div><p>整理训练过程 2.5 小时， 训练结束后得到 <em><strong>output</strong></em> 文件夹， 里面有</p>
<ul>
<li><em><strong>output/专利摘要-Word2Vec.200.15.bin</strong></em>  模型文件</li>
<li><em><strong>专利摘要_cache.txt</strong></em>                   训练缓存文件</li>
</ul>
<br>
<h3 id="23-评估模型">2.3 评估模型</h3>
<p>使用近义法和类比法， 判断模型的表现。详情可查看<a href="https://cntext.readthedocs.io/zh-cn/latest/model.html">文档</a></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">ct</span><span class="o">.</span><span class="n">evaluate_similarity</span><span class="p">(</span><span class="n">w2v_model</span><span class="p">)</span>

<span class="n">ct</span><span class="o">.</span><span class="n">evaluate_analogy</span><span class="p">(</span><span class="n">w2v_model</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">近义测试: similarity.txt
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/cntext/model/evaluate_data/similarity.txt

评估结果：
+----------+------------+----------------------------+
| 发现词语 | 未发现词语 | Spearman&#39;s Rank Coeficient |
+----------+------------+----------------------------+
|   427    |    110     |            0.46            |
+----------+------------+----------------------------+


类比测试: analogy.txt
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/cntext/model/evaluate_data/analogy.txt
Processing Analogy Test: 100%|██████████████| 1198/1198 [00:11&lt;00:00, 99.91it/s]

评估结果：
+--------------------+----------+------------+------------+----------+
|      Category      | 发现词语 | 未发现词语 | 准确率 (%) | 平均排名 |
+--------------------+----------+------------+------------+----------+
| CapitalOfCountries |   238    |    439     |    3.78    |   5.67   |
|   CityInProvince   |   175    |     0      |   25.14    |   4.48   |
| FamilyRelationship |   156    |    116     |   33.33    |   2.29   |
|   SocialScience    |    8     |     62     |   37.50    |   2.33   |
+--------------------+----------+------------+------------+----------+
</code></pre></div><p><strong>近义测试</strong>: Spearman&rsquo;s Rank Coeficient系数取值[-1, 1], 取值越大， 说明模型表现越好。</p>
<br>
<p><strong>类比测试</strong>:</p>
<ul>
<li>CapitalOfCountries   专利语料在此项表现很差， 应该是语料中常见国家首度的提及较少。</li>
<li>CityInProvince       专利语料在此项好于CapitalOfCountries， 毕竟在中国大地进行科创。</li>
<li>FamilyRelationship   专利语料中没想到在此项准确率中显著大于0， 我原本以为准确率为0，毕竟专利摘要中出现家人管理不太技术。没想到没想到啊。 发明可能类似于电影非诚勿扰里解决人类问题的例子，发明很雷人。</li>
<li>SocialScience        专利语料在此项表现一般， 应该是语料中常见的社会科学词语提及较少。</li>
</ul>
<p>整体而言，在四个维度准确率较低。 但是需要说明， 这四个维度是大邓自己收集的，评判模型类比表现维度有很多， 有可能专利摘要在别的类比维度上表现会很好。</p>
<p><br><br></p>
<h2 id="三使用词向量">三、使用词向量</h2>
<h3 id="31-录入模型">3.1 录入模型</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="n">w2v</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">load_w2v</span><span class="p">(</span><span class="s1">&#39;output/专利摘要-Word2Vec.200.15.bin&#39;</span><span class="p">)</span>
<span class="n">w2v</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">Loading 专利摘要-Word2Vec.200.15.bin...
&lt;gensim.models.keyedvectors.KeyedVectors at 0x32b079340&gt;
</code></pre></div><br>
<h3 id="32-keyedvectors的操作方法或属性">3.2 KeyedVectors的操作方法(或属性)</h3>
<table>
<thead>
<tr>
<th>方法</th>
<th>描述</th>
</tr>
</thead>
<tbody>
<tr>
<td><em><strong>KeyedVectors.index_to_key</strong></em></td>
<td>获取词汇表中的所有单词。</td>
</tr>
<tr>
<td><em><strong>KeyedVectors.key_to_index</strong></em></td>
<td>获取单词到索引的映射。</td>
</tr>
<tr>
<td><em><strong>KeyedVectors.vector_size</strong></em></td>
<td>获取GloVe模型中任意词向量的维度。</td>
</tr>
<tr>
<td><em><strong>KeyedVectors.get_vector(word)</strong></em></td>
<td>获取给定单词的词向量。</td>
</tr>
<tr>
<td><em><strong>KeyedVectors.most_similar(words, topn=10)</strong></em></td>
<td>获取某类词(list)最相似的10个近义词。</td>
</tr>
<tr>
<td><em><strong>KeyedVectors.similar_by_word(word, topn=10)</strong></em></td>
<td>获取某词语最相似的10个近义词。</td>
</tr>
<tr>
<td><em><strong>KeyedVectors.similar_by_vector(vector, topn=10)</strong></em></td>
<td>获取词向量最相似的10个近义词。</td>
</tr>
</tbody>
</table>
<br>
<h3 id="33-词汇量维度数">3.3 词汇量&amp;维度数</h3>
<p>查看模型中的词汇量</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">print(f&#39;词汇量: {len(w2v)}&#39;)
print(f&#39;维度数: {w2v.vector_size}&#39;)
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">词汇量: 1059801
维度数: 200
</code></pre></div><br>
<h3 id="34-查看词向量">3.4 查看词向量</h3>
<p>查看任意词的词向量，例如“<strong>”人工智能”</strong>”</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1"># 查看 ”人工智能” 的词向量</span>
<span class="n">w2v</span><span class="p">[</span><span class="s1">&#39;人工智能&#39;</span><span class="p">]</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">array([-1.1817173 , -2.1371903 , -3.0181015 ,  1.7000161 , -3.081852  ,
        3.554449  , -0.22385244,  3.6647737 , -3.7086377 , -1.4868759 ,
       -0.7706527 ,  5.9335155 ,  2.8328223 , -1.7995875 , -6.051175  ,
       -0.91756725, -4.15509   , -1.6975762 , -4.5753274 , -3.022245  ,
       ......
       -2.0807118 , -3.4522808 ,  4.29429   , -1.712142  , -1.6512033 ,
        2.625037  , -3.4015207 ,  1.3526493 , -0.7858534 , -1.6782432 ,
       -3.1669524 , -2.6371615 , -1.5394825 ,  3.101744  ,  0.44502366,
       -1.4104489 , -0.01298253, -4.217453  , -0.92512876,  0.10754411],
      dtype=float32)
</code></pre></div><br>
<h3 id="35-最相似词">3.5 最相似词</h3>
<p>与&rsquo;创新', &lsquo;颠覆&rsquo;最相似的20个词</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"># 词语列表中可传入任意多个词，
# 大邓词穷，只想到这两个相似的种子词
w2v.most_similar([&#39;创新&#39;, &#39;颠覆&#39;], topn=20)
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[(&#39;革新&#39;, 0.7983665466308594),
 (&#39;改革&#39;, 0.7454208731651306),
 (&#39;变革&#39;, 0.7136300206184387),
 (&#39;全新&#39;, 0.707391619682312),
 (&#39;彻底改变&#39;, 0.7064372301101685),
 (&#39;创造性&#39;, 0.6960274577140808),
 (&#39;颠覆性&#39;, 0.6874485611915588),
 (&#39;有别于&#39;, 0.6775000095367432),
 (&#39;加以改进&#39;, 0.6736693978309631),
 (&#39;摒弃&#39;, 0.6716011762619019),
 (&#39;独创&#39;, 0.6609643697738647),
 (&#39;颠覆传统&#39;, 0.6604534983634949),
 (&#39;开创&#39;, 0.6531570553779602),
 (&#39;核心技术&#39;, 0.6419240236282349),
 (&#39;彻底颠覆&#39;, 0.6397384405136108),
 (&#39;技术创新&#39;, 0.6390863060951233),
 (&#39;突破性&#39;, 0.6368305087089539),
 (&#39;大胆&#39;, 0.6357517242431641),
 (&#39;技术革新&#39;, 0.6347700357437134),
 (&#39;沿用&#39;, 0.6328355669975281)]
</code></pre></div><br>
<p>刚刚的运行，体现模型很好的学习到了专利摘要中的语义关系。</p>
<p>如果我想开发三个词典，分别是 <strong>创新</strong>、<strong>成本</strong>、<strong>质量</strong> ，想直接将结果保存到txt中，可以运行如下代码</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">seeds</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;创新概念&#39;</span><span class="p">:</span> <span class="p">[</span><span class="s1">&#39;创新&#39;</span><span class="p">,</span> <span class="s1">&#39;颠覆&#39;</span><span class="p">],</span>
         <span class="s1">&#39;成本概念&#39;</span><span class="p">:</span> <span class="p">[</span><span class="s1">&#39;成本&#39;</span><span class="p">,</span> <span class="s1">&#39;&#39;</span><span class="p">],</span>
         <span class="s1">&#39;质量概念&#39;</span><span class="p">:</span> <span class="p">[</span><span class="s1">&#39;质量&#39;</span><span class="p">]}</span>

<span class="n">ct</span><span class="o">.</span><span class="n">expand_dictionary</span><span class="p">(</span><span class="n">wv</span><span class="o">=</span><span class="n">w2v</span><span class="p">,</span>          <span class="c1"># word2vec词向量</span>
                     <span class="n">seeddict</span><span class="o">=</span><span class="n">seeds</span><span class="p">,</span>  <span class="c1"># 种子词字典</span>
                     <span class="n">topn</span><span class="o">=</span><span class="mi">20</span><span class="p">)</span>         <span class="c1"># 保留20个最相似的词</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">Finish! 创新概念 candidates saved to output/创新概念.txt
Finish! 成本概念 candidates saved to output/成本概念.txt
Finish! 质量概念 candidates saved to output/质量概念.txt
</code></pre></div><p><img loading="lazy" src="img/similar-words.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="四获取资源">四、获取资源</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">- 免费     专利摘要-Word2Vec.200.15.bin 链接: https://pan.baidu.com/s/1LKebAWL5fzjUVo_MR7dVug?pwd=a56c 提取码: a56c
</code></pre></div><p><br><br></p>
<h2 id="cntext使用声明">cntext使用声明</h2>
<p>如在研究或项目中使用 cntext ，请在文中介绍并附引用声明。引用格式可参考 <a href="https://textdata.cn/blog/2025-09-09-transform-text-data-into-structured-data-with-lmstudio-and-cntext/#%E4%BD%BF%E7%94%A8%E5%A3%B0%E6%98%8E">cntext 推荐引用格式</a></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>词向量 | 使用1亿B站用户签名训练word2vec词向量</title>
      <link>https://textdata.cn/blog/2023-11-12-using-100m-bilibili-user-sign-data-to-training-word2vec/</link>
      <pubDate>Thu, 03 Apr 2025 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-11-12-using-100m-bilibili-user-sign-data-to-training-word2vec/</guid>
      <description>&lt;h2 id=&#34;一用户签名&#34;&gt;一、用户签名&lt;/h2&gt;
&lt;p&gt;1 亿 B 站用户群体十分庞大，文本中蕴含着这个群体的认知信息(如兴趣、身份、座右铭等)，如果能用签名训练 word2vec 词向量模型，说不定就有利用这个模型，对每个用户签名进行量化, 对用户进行分类。 本文要解决&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;构建语料训练出模型&lt;/li&gt;
&lt;li&gt;简单看看模型训练效果&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二准备语料&#34;&gt;二、准备语料&lt;/h2&gt;
&lt;p&gt;Kaggle 网有 1 亿 B 站用户数据集，下载地址&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;a href=&#34;https://www.kaggle.com/datasets/beats0/bilibili-user&#34;&gt;https://www.kaggle.com/datasets/beats0/bilibili-user&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;之前分享过 &lt;a href=&#34;https://textdata.cn/blog/2023-05-10-100m-bilibili-user-info-dataset/&#34;&gt;数据集 | 哔哩哔哩 1 亿用户数据&lt;/a&gt; ， 阅读此文可以熟悉 pandas 的一些基本操作，如数据读取、文本操作等。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# 从kaggle下载B站1亿用户数据&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;# 查看前5行&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;User.csv&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;nrows&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;5&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/df2.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;将 9093092 个非空签名汇总到 &lt;strong&gt;&lt;em&gt;B 站用户签名语料.txt&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;k&#34;&gt;with&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;open&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;B站签名.txt&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;w&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;encoding&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;utf-8&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;raw_text&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;se&#34;&gt;\n&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;join&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;sign&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fillna&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;write&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;raw_text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;代码运行后，得到 320M 的 &lt;strong&gt;&lt;em&gt;B 站签名.txt&lt;/em&gt;&lt;/strong&gt; 。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;三训练-word2vec&#34;&gt;三、训练 Word2Vec&lt;/h2&gt;
&lt;h3 id=&#34;31-安装-cntext&#34;&gt;3.1 安装 cntext&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;pip3 install cntext --upgrade
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;32-训练-word2vec&#34;&gt;3.2 训练 word2vec&lt;/h3&gt;
&lt;p&gt;cntext 训练时候 Word2Vec 模型参数&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;em&gt;corpus_file&lt;/em&gt;&lt;/strong&gt; 语料 txt 文件路径； 刚刚准备的 &lt;strong&gt;&lt;em&gt;B 站用户签名语料.txt&lt;/em&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;em&gt;window_size&lt;/em&gt;&lt;/strong&gt; 上下文窗口大小(上下文语义)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;em&gt;vector_size&lt;/em&gt;&lt;/strong&gt; 向量维度数&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;em&gt;chunksize&lt;/em&gt;&lt;/strong&gt; 每次语料 txt 文件中读取的行数&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;em&gt;lang&lt;/em&gt;&lt;/strong&gt; 语言的语言&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# cntext2.1.6未公开，获取2.1.6请阅读文末获取方式&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;cntext&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;ct&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;model&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Word2Vec&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;corpus_file&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;B站用户签名语料.txt&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
                    &lt;span class=&#34;n&#34;&gt;window&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;15&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
                    &lt;span class=&#34;n&#34;&gt;vector_size&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;200&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
                    &lt;span class=&#34;n&#34;&gt;window_size&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;15&lt;/span&gt;
                    &lt;span class=&#34;n&#34;&gt;lang&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;chinese&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;Mac(Linux) System, Enable Parallel Processing
Cache output/B站签名_cache.txt Not Found or Empty, Preprocessing Corpus

Reading Preprocessed Corpus from output/B站签名_cache.txt
Start Training Word2Vec
Word2Vec Training Cost 275 s.
Output Saved To: output/B站签名-Word2Vec.200.15.bin
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;耗时 275s， 模型训练完成！需要注意， output 文件夹内有&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;em&gt;B 站签名-Word2Vec.200.15.bin&lt;/em&gt;&lt;/strong&gt; 模型文件&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;em&gt;B 站签名_cache.txt&lt;/em&gt;&lt;/strong&gt; 训练缓存文件&lt;/li&gt;
&lt;/ul&gt;
&lt;br&gt;
&lt;h3 id=&#34;33-评估模型&#34;&gt;3.3 评估模型&lt;/h3&gt;
&lt;p&gt;使用近义法和类比法， 判断模型的表现。详情可查看&lt;a href=&#34;https://cntext.readthedocs.io/zh-cn/latest/model.html&#34;&gt;文档&lt;/a&gt;&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;evaluate_similarity&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;model&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;evaluate_analogy&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;model&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;近义测试: similarity.txt
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/cntext/model/evaluate_data/similarity.txt

评估结果：
+----------+------------+----------------------------+
| 发现词语 | 未发现词语 | Spearman&amp;#39;s Rank Coeficient |
+----------+------------+----------------------------+
|   434    |    103     |            0.34            |
+----------+------------+----------------------------+


类比测试: analogy.txt
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/cntext/model/evaluate_data/analogy.txt
Processing Analogy Test: 100%|██████████████| 1198/1198 [00:11&amp;lt;00:00, 99.91it/s]

评估结果：
+--------------------+----------+------------+------------+----------+
|      Category      | 发现词语 | 未发现词语 | 准确率 (%) | 平均排名 |
+--------------------+----------+------------+------------+----------+
| CapitalOfCountries |   360    |    317     |   25.56    |   4.02   |
|   CityInProvince   |   175    |     0      |   33.71    |   4.64   |
| FamilyRelationship |   240    |     32     |   44.17    |   1.93   |
|   SocialScience    |    2     |     68     |    0.00    |   NaN    |
+--------------------+----------+------------+------------+----------+
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;strong&gt;近义测试&lt;/strong&gt;: Spearman&amp;rsquo;s Rank Coeficient 系数取值[-1, 1], 取值越大， 说明模型表现越好。&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;&lt;strong&gt;类比测试&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;CapitalOfCountries B 站用户签名语料在此项表现大于 0，说明很多签名里会出现国家首都这类信息。&lt;/li&gt;
&lt;li&gt;CityInProvince B 站用户签名语料在此项表现大于 0，说明很多签名里会出现省份省会这类信息。考虑到用户几乎全为中国人，所以此项准确率高于 CapitalOfCountries。&lt;/li&gt;
&lt;li&gt;FamilyRelationship B 站用户签名语料体现的是一个个鲜活的中国人，签名中必然含有更多的人际关系， 所以此项准确率是四个项目中最高的。&lt;/li&gt;
&lt;li&gt;SocialScience B 站用户签名语料在此项表现最差， 应该是语料中常见的社会科学词语提及很少。&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;整体而言，模型效果一般，但是不是算法代码问题，而是语料出的问题。毕竟每个用户的签名一般都是一句话，太短，信息太少。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;四使用-word2vec&#34;&gt;四、使用 word2vec&lt;/h2&gt;
&lt;h3 id=&#34;41-读取模型&#34;&gt;4.1 读取模型&lt;/h3&gt;
&lt;p&gt;使用 gensim 录入模型 &lt;strong&gt;&lt;em&gt;B 站用户签名语料-Word2Vec.100.15.bin&lt;/em&gt;&lt;/strong&gt; ,&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;gensim.models&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;KeyedVectors&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;cntext&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;ct&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;w2v&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;load_w2v&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;output/B站签名-Word2Vec.200.15.bin&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;模型词汇量: &amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;w2v&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;模型词汇量:  244491
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;42-查询某词的词向量&#34;&gt;4.2 查询某词的词向量&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;w2v[&amp;#39;高冷&amp;#39;]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;array([ 1.05914783e+00,  4.51383203e-01, -1.34764791e+00, -9.42894161e-01,
        5.28594255e-01,  8.05936933e-01, -1.59555584e-01,  2.42719814e-01,
       -6.04722261e-01, -9.25606042e-02,  9.69056904e-01,  8.85407850e-02,
       -1.67851341e+00,  3.26303959e-01,  6.52321458e-01,  5.77043407e-02,
       -4.24268842e-02, -2.64299393e-01,  5.24512887e-01,  2.15208486e-01,
       -2.09263057e-01, -4.55661058e-01,  8.78976703e-01, -1.24363959e+00,
       -1.71196852e-02, -9.03965294e-01, -6.52690083e-02,  2.47650072e-02,
       -2.82155067e-01,  9.09134224e-02,  9.13890541e-01, -1.40862179e+00,
       -1.31956196e+00, -5.29659569e-01,  1.23605825e-01, -4.00647372e-01,
        4.94630456e-01,  2.81695575e-01,  1.71391249e-01,  1.23341233e-01,
       -7.70617545e-01,  5.81079908e-02, -4.89788234e-01,  2.14924827e-01,
       -7.73121595e-01, -6.66803181e-01, -1.31617844e+00,  1.18301921e-01,
        6.22543573e-01, -8.07524860e-01, -4.36694354e-01,  2.95946062e-01,
        3.10503364e-01, -4.93252903e-01,  1.27962172e-01,  1.97043195e-01,
        6.61175609e-01, -1.80842638e-01,  1.13270843e+00, -5.34760773e-01,
        9.13145125e-01,  5.48191011e-01,  7.68198539e-03,  1.17955339e+00,
       -1.96015276e-02,  9.14144278e-01, -9.06695664e-01,  4.39731702e-02,
       -3.87832075e-01,  4.72544342e-01,  4.95476156e-01, -1.21628530e-01,
       -4.41256445e-03,  1.82375580e-01, -7.00045705e-01,  4.34259921e-01,
        2.00862193e+00, -5.61490715e-01, -7.67120644e-02,  5.78972995e-01,
       -7.80492842e-01, -5.01321375e-01, -5.50926566e-01, -8.99926543e-01,
       -1.66289490e-02,  1.77679747e-01,  4.23889339e-01,  1.40111005e+00,
       -7.63866380e-02, -8.86032939e-01, -1.08106744e+00,  3.31989765e-01,
        3.78885448e-01, -1.23718023e+00,  2.09680721e-01,  2.39727721e-01,
        2.46049106e-01,  2.32866824e-01, -6.65583909e-02,  1.09542537e+00,
       -5.44713318e-01,  7.68220305e-01, -1.56612769e-02,  3.48719925e-01,
        2.91741371e-01,  1.88722059e-01, -2.12467611e-01,  8.20825279e-01,
       -1.74725935e-01, -8.05535197e-01, -1.41250715e-01, -7.84179568e-01,
       -8.00660312e-01, -1.12991728e-01, -2.16052849e-02, -1.07448053e+00,
        2.53552765e-01, -1.28611282e-01, -1.16868567e+00, -6.08788371e-01,
        4.30017859e-02, -5.11076570e-01,  6.43583059e-01,  3.11966389e-01,
       -1.63116843e-01,  3.58751595e-01,  5.16831456e-03,  5.09353161e-01,
        1.61675465e+00,  6.42039478e-01, -1.07160270e+00, -2.34255135e-01,
       -7.27983773e-01,  1.20267116e-01, -1.11912894e+00,  1.49096262e+00,
       -1.48015752e-01,  6.85670376e-02, -1.70197403e+00,  2.16349974e-01,
        1.32302952e+00,  5.39037228e-01, -8.35760951e-01, -7.43441284e-01,
        6.55625939e-01, -5.07541537e-01, -5.40877655e-02, -5.38533449e-01,
       -2.57937461e-01,  8.67499232e-01, -6.53150141e-01, -1.32043970e+00,
       -5.84588587e-01,  1.24599323e-01, -8.35753500e-01, -2.68954426e-01,
        3.67542468e-02,  1.61010170e+00,  7.27127492e-01,  1.35515738e+00,
       -2.76694775e-01,  2.69006938e-01,  4.81265247e-01, -6.30314708e-01,
       -3.66074532e-01,  3.03934813e-01,  1.92417920e+00,  4.67498928e-01,
       -1.83004290e-01,  1.01947844e+00, -5.52489638e-01,  1.59275869e-03,
        4.84914184e-01,  1.33545566e+00, -9.75372076e-01,  2.25273356e-01,
        6.02540433e-01,  7.07564950e-01,  1.36330187e-01, -4.34346311e-02,
        4.53452200e-01,  1.58401883e+00, -6.68083191e-01, -1.30876124e+00,
       -1.19713686e-01, -9.80615169e-02, -2.04207993e+00,  8.29822361e-01,
       -4.08902228e-01, -4.70339246e-02,  1.00982547e+00,  1.64084151e-01,
        4.62104648e-01, -2.28677273e-01, -5.95047355e-01, -2.71069705e-01,
        6.27930462e-01, -8.85554433e-01, -1.79520398e-01, -3.44800770e-01],
      dtype=float32)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;43-查看近义词&#34;&gt;4.3 查看近义词&lt;/h3&gt;
&lt;p&gt;通过给定词语，查看其近义词，可以了解模型训练的好坏。语义捕捉的合理，说明语料合理，模型训练的好。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# 列表中可以传入任意多个词，这里大邓偷懒，都只传入了一两个词&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;w2v&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;most_similar&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;([&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;女汉纸&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;topn&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;20&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;[(&amp;#39;汉纸&amp;#39;, 0.8672055602073669),
 (&amp;#39;腹黑&amp;#39;, 0.8662189841270447),
 (&amp;#39;文艺清新&amp;#39;, 0.849425733089447),
 (&amp;#39;闷骚&amp;#39;, 0.8427557945251465),
 (&amp;#39;神经大条&amp;#39;, 0.8329920768737793),
 (&amp;#39;汉子&amp;#39;, 0.8232208490371704),
 (&amp;#39;宅基&amp;#39;, 0.8224843144416809),
 (&amp;#39;猥琐大叔&amp;#39;, 0.8214939832687378),
 (&amp;#39;偶是&amp;#39;, 0.8164061307907104),
 (&amp;#39;腐宅&amp;#39;, 0.8117423057556152),
 (&amp;#39;宅女腐女&amp;#39;, 0.8073472380638123),
 (&amp;#39;软妹&amp;#39;, 0.7999386787414551),
 (&amp;#39;萌妹&amp;#39;, 0.7999064326286316),
 (&amp;#39;小女生&amp;#39;, 0.7998836040496826),
 (&amp;#39;天蝎女&amp;#39;, 0.7971166372299194),
 (&amp;#39;傲娇受&amp;#39;, 0.7964810132980347),
 (&amp;#39;天蝎&amp;#39;, 0.7957624197006226),
 (&amp;#39;天蝎座&amp;#39;, 0.7915034890174866),
 (&amp;#39;女纸&amp;#39;, 0.7912994623184204),
 (&amp;#39;双鱼座&amp;#39;, 0.7900263667106628)]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;w2v&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;most_similar&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;([&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;犯二&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;topn&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;20&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;[(&amp;#39;脱线&amp;#39;, 0.8355404734611511),
 (&amp;#39;神经质&amp;#39;, 0.8035165667533875),
 (&amp;#39;神经大条&amp;#39;, 0.7816897630691528),
 (&amp;#39;发神经&amp;#39;, 0.780509352684021),
 (&amp;#39;人来疯&amp;#39;, 0.7794896960258484),
 (&amp;#39;精分&amp;#39;, 0.7705598473548889),
 (&amp;#39;毒舌&amp;#39;, 0.7692195773124695),
 (&amp;#39;犯病&amp;#39;, 0.7659722566604614),
 (&amp;#39;闷骚&amp;#39;, 0.7620697617530823),
 (&amp;#39;迷糊&amp;#39;, 0.7608135342597961),
 (&amp;#39;智商在线&amp;#39;, 0.7525709867477417),
 (&amp;#39;抽疯&amp;#39;, 0.7491970658302307),
 (&amp;#39;欢脱&amp;#39;, 0.7444456219673157),
 (&amp;#39;深井&amp;#39;, 0.7416326403617859),
 (&amp;#39;抽风&amp;#39;, 0.7321327924728394),
 (&amp;#39;精分患者&amp;#39;, 0.7319940328598022),
 (&amp;#39;装嫩&amp;#39;, 0.7267141342163086),
 (&amp;#39;蒙圈&amp;#39;, 0.7262043952941895),
 (&amp;#39;神经&amp;#39;, 0.7257982492446899),
 (&amp;#39;假正经&amp;#39;, 0.7215201258659363)]

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;w2v&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;most_similar&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;([&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;内向&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;topn&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;20&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;[(&amp;#39;外向&amp;#39;, 0.8474423885345459),
 (&amp;#39;慢热&amp;#39;, 0.8272333741188049),
 (&amp;#39;不爱说话&amp;#39;, 0.8249834775924683),
 (&amp;#39;不善言辞&amp;#39;, 0.80635666847229),
 (&amp;#39;腼腆&amp;#39;, 0.7940059304237366),
 (&amp;#39;孤僻&amp;#39;, 0.7929618954658508),
 (&amp;#39;开朗&amp;#39;, 0.7585728168487549),
 (&amp;#39;闷骚&amp;#39;, 0.745791494846344),
 (&amp;#39;神经质&amp;#39;, 0.7454176545143127),
 (&amp;#39;多愁善感&amp;#39;, 0.7348753809928894),
 (&amp;#39;胆小&amp;#39;, 0.7213962078094482),
 (&amp;#39;沉默寡言&amp;#39;, 0.7145323157310486),
 (&amp;#39;随和&amp;#39;, 0.7115553617477417),
 (&amp;#39;敏感&amp;#39;, 0.7103193402290344),
 (&amp;#39;水瓶座&amp;#39;, 0.7092751264572144),
 (&amp;#39;大大咧咧&amp;#39;, 0.7085798382759094),
 (&amp;#39;高冷&amp;#39;, 0.7084994912147522),
 (&amp;#39;性格开朗&amp;#39;, 0.7064590454101562),
 (&amp;#39;耿直&amp;#39;, 0.7048951983451843),
 (&amp;#39;做作&amp;#39;, 0.704330325126648)]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;br&gt;
&lt;h2 id=&#34;五获取资源&#34;&gt;五、获取资源&lt;/h2&gt;
&lt;p&gt;内容整理不易， 本文内容分免费和付费部分。 免费部分可以直接下载数据、构建语料、使用 word2vec 模型。&lt;/p&gt;
&lt;p&gt;付费部分主要是 cntext，用于训练 word2vec 模型。 如果对本文感兴趣，可加微信 372335839， 备注「姓名-学校-专业」&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;- 免费     1亿用户数据集 https://www.kaggle.com/datasets/beats0/bilibili-user

- 免费     B站签名-Word2Vec.200.15.bin  链接: https://pan.baidu.com/s/1ILVwu6gGGGP0IHv-vsjgvw?pwd=em99 提取码: em99
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;相关内容&#34;&gt;相关内容&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://cntext.readthedocs.io/&#34;&gt;文本分析库 cntext 使用手册 https://cntext.readthedocs.io/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2025-03-28-train_a_glove_model_on_chinese_corpus_using_stanfordnlp/&#34;&gt;实验 | 使用 Stanford Glove 代码训练中文语料的 Glove 模型&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-12-28-train-word2vec-using-renmin-gov-leader-board-dataset/&#34;&gt;词向量 | 使用人民网领导留言板语料训练 Word2Vec 模型&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-11-20-word2vec-by-year-by-province/&#34;&gt;使用 5000w 专利申请数据集按年份(按省份)训练词向量&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2024-04-16-douban-movie-1000w-ratings-comments-dataset/&#34;&gt;使用 1000w 条豆瓣影评训练 Word2Vec&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-03-15-39faq-about-word-embeddings-for-social-science/&#34;&gt;词嵌入技术在社会科学领域进行数据挖掘常见 39 个 FAQ 汇总&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2022-04-07-word-embeddings-in-social-science/&#34;&gt;转载|大数据时代下社会科学研究方法的拓展——基于词嵌入技术的文本分析的应用&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-11-03-organization-science-with-word-embeddings/&#34;&gt;OS2022 | 概念空间 | 词嵌入模型如何为组织科学中的测量和理论提供信息&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
      <content:encoded><![CDATA[<h2 id="一用户签名">一、用户签名</h2>
<p>1 亿 B 站用户群体十分庞大，文本中蕴含着这个群体的认知信息(如兴趣、身份、座右铭等)，如果能用签名训练 word2vec 词向量模型，说不定就有利用这个模型，对每个用户签名进行量化, 对用户进行分类。 本文要解决</p>
<ul>
<li>构建语料训练出模型</li>
<li>简单看看模型训练效果</li>
</ul>
<p><br><br></p>
<h2 id="二准备语料">二、准备语料</h2>
<p>Kaggle 网有 1 亿 B 站用户数据集，下载地址</p>
<blockquote>
<p><a href="https://www.kaggle.com/datasets/beats0/bilibili-user">https://www.kaggle.com/datasets/beats0/bilibili-user</a></p>
</blockquote>
<p>之前分享过 <a href="https://textdata.cn/blog/2023-05-10-100m-bilibili-user-info-dataset/">数据集 | 哔哩哔哩 1 亿用户数据</a> ， 阅读此文可以熟悉 pandas 的一些基本操作，如数据读取、文本操作等。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1"># 从kaggle下载B站1亿用户数据</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="c1"># 查看前5行</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;User.csv&#39;</span><span class="p">,</span> <span class="n">nrows</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
<span class="n">df</span>
</code></pre></div><p>Run</p>
<p><img loading="lazy" src="img/df2.png" alt=""  />
</p>
<br>
<p>将 9093092 个非空签名汇总到 <strong><em>B 站用户签名语料.txt</em></strong></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s1">&#39;B站签名.txt&#39;</span><span class="p">,</span> <span class="s1">&#39;w&#39;</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
    <span class="n">raw_text</span> <span class="o">=</span> <span class="s1">&#39;</span><span class="se">\n</span><span class="s1">&#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;sign&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">))</span>
    <span class="n">f</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">raw_text</span><span class="p">)</span>
</code></pre></div><p>代码运行后，得到 320M 的 <strong><em>B 站签名.txt</em></strong> 。</p>
<p><br><br></p>
<h2 id="三训练-word2vec">三、训练 Word2Vec</h2>
<h3 id="31-安装-cntext">3.1 安装 cntext</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">pip3 install cntext --upgrade
</code></pre></div><br>
<h3 id="32-训练-word2vec">3.2 训练 word2vec</h3>
<p>cntext 训练时候 Word2Vec 模型参数</p>
<ul>
<li><strong><em>corpus_file</em></strong> 语料 txt 文件路径； 刚刚准备的 <strong><em>B 站用户签名语料.txt</em></strong></li>
<li><strong><em>window_size</em></strong> 上下文窗口大小(上下文语义)</li>
<li><strong><em>vector_size</em></strong> 向量维度数</li>
<li><strong><em>chunksize</em></strong> 每次语料 txt 文件中读取的行数</li>
<li><strong><em>lang</em></strong> 语言的语言</li>
</ul>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1"># cntext2.1.6未公开，获取2.1.6请阅读文末获取方式</span>
<span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="n">model</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">Word2Vec</span><span class="p">(</span><span class="n">corpus_file</span><span class="o">=</span><span class="s1">&#39;B站用户签名语料.txt&#39;</span><span class="p">,</span>
                    <span class="n">window</span> <span class="o">=</span> <span class="mi">15</span><span class="p">,</span>
                    <span class="n">vector_size</span> <span class="o">=</span> <span class="mi">200</span><span class="p">,</span>
                    <span class="n">window_size</span> <span class="o">=</span> <span class="mi">15</span>
                    <span class="n">lang</span><span class="o">=</span><span class="s1">&#39;chinese&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">Mac(Linux) System, Enable Parallel Processing
Cache output/B站签名_cache.txt Not Found or Empty, Preprocessing Corpus

Reading Preprocessed Corpus from output/B站签名_cache.txt
Start Training Word2Vec
Word2Vec Training Cost 275 s.
Output Saved To: output/B站签名-Word2Vec.200.15.bin
</code></pre></div><p>耗时 275s， 模型训练完成！需要注意， output 文件夹内有</p>
<ul>
<li><strong><em>B 站签名-Word2Vec.200.15.bin</em></strong> 模型文件</li>
<li><strong><em>B 站签名_cache.txt</em></strong> 训练缓存文件</li>
</ul>
<br>
<h3 id="33-评估模型">3.3 评估模型</h3>
<p>使用近义法和类比法， 判断模型的表现。详情可查看<a href="https://cntext.readthedocs.io/zh-cn/latest/model.html">文档</a></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">ct</span><span class="o">.</span><span class="n">evaluate_similarity</span><span class="p">(</span><span class="n">model</span><span class="p">)</span>

<span class="n">ct</span><span class="o">.</span><span class="n">evaluate_analogy</span><span class="p">(</span><span class="n">model</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">近义测试: similarity.txt
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/cntext/model/evaluate_data/similarity.txt

评估结果：
+----------+------------+----------------------------+
| 发现词语 | 未发现词语 | Spearman&#39;s Rank Coeficient |
+----------+------------+----------------------------+
|   434    |    103     |            0.34            |
+----------+------------+----------------------------+


类比测试: analogy.txt
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/cntext/model/evaluate_data/analogy.txt
Processing Analogy Test: 100%|██████████████| 1198/1198 [00:11&lt;00:00, 99.91it/s]

评估结果：
+--------------------+----------+------------+------------+----------+
|      Category      | 发现词语 | 未发现词语 | 准确率 (%) | 平均排名 |
+--------------------+----------+------------+------------+----------+
| CapitalOfCountries |   360    |    317     |   25.56    |   4.02   |
|   CityInProvince   |   175    |     0      |   33.71    |   4.64   |
| FamilyRelationship |   240    |     32     |   44.17    |   1.93   |
|   SocialScience    |    2     |     68     |    0.00    |   NaN    |
+--------------------+----------+------------+------------+----------+
</code></pre></div><p><strong>近义测试</strong>: Spearman&rsquo;s Rank Coeficient 系数取值[-1, 1], 取值越大， 说明模型表现越好。</p>
<br>
<p><strong>类比测试</strong>:</p>
<ul>
<li>CapitalOfCountries B 站用户签名语料在此项表现大于 0，说明很多签名里会出现国家首都这类信息。</li>
<li>CityInProvince B 站用户签名语料在此项表现大于 0，说明很多签名里会出现省份省会这类信息。考虑到用户几乎全为中国人，所以此项准确率高于 CapitalOfCountries。</li>
<li>FamilyRelationship B 站用户签名语料体现的是一个个鲜活的中国人，签名中必然含有更多的人际关系， 所以此项准确率是四个项目中最高的。</li>
<li>SocialScience B 站用户签名语料在此项表现最差， 应该是语料中常见的社会科学词语提及很少。</li>
</ul>
<p>整体而言，模型效果一般，但是不是算法代码问题，而是语料出的问题。毕竟每个用户的签名一般都是一句话，太短，信息太少。</p>
<p><br><br></p>
<h2 id="四使用-word2vec">四、使用 word2vec</h2>
<h3 id="41-读取模型">4.1 读取模型</h3>
<p>使用 gensim 录入模型 <strong><em>B 站用户签名语料-Word2Vec.100.15.bin</em></strong> ,</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">gensim.models</span> <span class="kn">import</span> <span class="n">KeyedVectors</span>
<span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="n">w2v</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">load_w2v</span><span class="p">(</span><span class="s1">&#39;output/B站签名-Word2Vec.200.15.bin&#39;</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;模型词汇量: &#39;</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">w2v</span><span class="p">))</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">模型词汇量:  244491
</code></pre></div><br>
<h3 id="42-查询某词的词向量">4.2 查询某词的词向量</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">w2v[&#39;高冷&#39;]
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">array([ 1.05914783e+00,  4.51383203e-01, -1.34764791e+00, -9.42894161e-01,
        5.28594255e-01,  8.05936933e-01, -1.59555584e-01,  2.42719814e-01,
       -6.04722261e-01, -9.25606042e-02,  9.69056904e-01,  8.85407850e-02,
       -1.67851341e+00,  3.26303959e-01,  6.52321458e-01,  5.77043407e-02,
       -4.24268842e-02, -2.64299393e-01,  5.24512887e-01,  2.15208486e-01,
       -2.09263057e-01, -4.55661058e-01,  8.78976703e-01, -1.24363959e+00,
       -1.71196852e-02, -9.03965294e-01, -6.52690083e-02,  2.47650072e-02,
       -2.82155067e-01,  9.09134224e-02,  9.13890541e-01, -1.40862179e+00,
       -1.31956196e+00, -5.29659569e-01,  1.23605825e-01, -4.00647372e-01,
        4.94630456e-01,  2.81695575e-01,  1.71391249e-01,  1.23341233e-01,
       -7.70617545e-01,  5.81079908e-02, -4.89788234e-01,  2.14924827e-01,
       -7.73121595e-01, -6.66803181e-01, -1.31617844e+00,  1.18301921e-01,
        6.22543573e-01, -8.07524860e-01, -4.36694354e-01,  2.95946062e-01,
        3.10503364e-01, -4.93252903e-01,  1.27962172e-01,  1.97043195e-01,
        6.61175609e-01, -1.80842638e-01,  1.13270843e+00, -5.34760773e-01,
        9.13145125e-01,  5.48191011e-01,  7.68198539e-03,  1.17955339e+00,
       -1.96015276e-02,  9.14144278e-01, -9.06695664e-01,  4.39731702e-02,
       -3.87832075e-01,  4.72544342e-01,  4.95476156e-01, -1.21628530e-01,
       -4.41256445e-03,  1.82375580e-01, -7.00045705e-01,  4.34259921e-01,
        2.00862193e+00, -5.61490715e-01, -7.67120644e-02,  5.78972995e-01,
       -7.80492842e-01, -5.01321375e-01, -5.50926566e-01, -8.99926543e-01,
       -1.66289490e-02,  1.77679747e-01,  4.23889339e-01,  1.40111005e+00,
       -7.63866380e-02, -8.86032939e-01, -1.08106744e+00,  3.31989765e-01,
        3.78885448e-01, -1.23718023e+00,  2.09680721e-01,  2.39727721e-01,
        2.46049106e-01,  2.32866824e-01, -6.65583909e-02,  1.09542537e+00,
       -5.44713318e-01,  7.68220305e-01, -1.56612769e-02,  3.48719925e-01,
        2.91741371e-01,  1.88722059e-01, -2.12467611e-01,  8.20825279e-01,
       -1.74725935e-01, -8.05535197e-01, -1.41250715e-01, -7.84179568e-01,
       -8.00660312e-01, -1.12991728e-01, -2.16052849e-02, -1.07448053e+00,
        2.53552765e-01, -1.28611282e-01, -1.16868567e+00, -6.08788371e-01,
        4.30017859e-02, -5.11076570e-01,  6.43583059e-01,  3.11966389e-01,
       -1.63116843e-01,  3.58751595e-01,  5.16831456e-03,  5.09353161e-01,
        1.61675465e+00,  6.42039478e-01, -1.07160270e+00, -2.34255135e-01,
       -7.27983773e-01,  1.20267116e-01, -1.11912894e+00,  1.49096262e+00,
       -1.48015752e-01,  6.85670376e-02, -1.70197403e+00,  2.16349974e-01,
        1.32302952e+00,  5.39037228e-01, -8.35760951e-01, -7.43441284e-01,
        6.55625939e-01, -5.07541537e-01, -5.40877655e-02, -5.38533449e-01,
       -2.57937461e-01,  8.67499232e-01, -6.53150141e-01, -1.32043970e+00,
       -5.84588587e-01,  1.24599323e-01, -8.35753500e-01, -2.68954426e-01,
        3.67542468e-02,  1.61010170e+00,  7.27127492e-01,  1.35515738e+00,
       -2.76694775e-01,  2.69006938e-01,  4.81265247e-01, -6.30314708e-01,
       -3.66074532e-01,  3.03934813e-01,  1.92417920e+00,  4.67498928e-01,
       -1.83004290e-01,  1.01947844e+00, -5.52489638e-01,  1.59275869e-03,
        4.84914184e-01,  1.33545566e+00, -9.75372076e-01,  2.25273356e-01,
        6.02540433e-01,  7.07564950e-01,  1.36330187e-01, -4.34346311e-02,
        4.53452200e-01,  1.58401883e+00, -6.68083191e-01, -1.30876124e+00,
       -1.19713686e-01, -9.80615169e-02, -2.04207993e+00,  8.29822361e-01,
       -4.08902228e-01, -4.70339246e-02,  1.00982547e+00,  1.64084151e-01,
        4.62104648e-01, -2.28677273e-01, -5.95047355e-01, -2.71069705e-01,
        6.27930462e-01, -8.85554433e-01, -1.79520398e-01, -3.44800770e-01],
      dtype=float32)
</code></pre></div><br>
<h3 id="43-查看近义词">4.3 查看近义词</h3>
<p>通过给定词语，查看其近义词，可以了解模型训练的好坏。语义捕捉的合理，说明语料合理，模型训练的好。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1"># 列表中可以传入任意多个词，这里大邓偷懒，都只传入了一两个词</span>
<span class="n">w2v</span><span class="o">.</span><span class="n">most_similar</span><span class="p">([</span><span class="s1">&#39;女汉纸&#39;</span><span class="p">],</span> <span class="n">topn</span><span class="o">=</span><span class="mi">20</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[(&#39;汉纸&#39;, 0.8672055602073669),
 (&#39;腹黑&#39;, 0.8662189841270447),
 (&#39;文艺清新&#39;, 0.849425733089447),
 (&#39;闷骚&#39;, 0.8427557945251465),
 (&#39;神经大条&#39;, 0.8329920768737793),
 (&#39;汉子&#39;, 0.8232208490371704),
 (&#39;宅基&#39;, 0.8224843144416809),
 (&#39;猥琐大叔&#39;, 0.8214939832687378),
 (&#39;偶是&#39;, 0.8164061307907104),
 (&#39;腐宅&#39;, 0.8117423057556152),
 (&#39;宅女腐女&#39;, 0.8073472380638123),
 (&#39;软妹&#39;, 0.7999386787414551),
 (&#39;萌妹&#39;, 0.7999064326286316),
 (&#39;小女生&#39;, 0.7998836040496826),
 (&#39;天蝎女&#39;, 0.7971166372299194),
 (&#39;傲娇受&#39;, 0.7964810132980347),
 (&#39;天蝎&#39;, 0.7957624197006226),
 (&#39;天蝎座&#39;, 0.7915034890174866),
 (&#39;女纸&#39;, 0.7912994623184204),
 (&#39;双鱼座&#39;, 0.7900263667106628)]
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">w2v</span><span class="o">.</span><span class="n">most_similar</span><span class="p">([</span><span class="s1">&#39;犯二&#39;</span><span class="p">],</span> <span class="n">topn</span><span class="o">=</span><span class="mi">20</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[(&#39;脱线&#39;, 0.8355404734611511),
 (&#39;神经质&#39;, 0.8035165667533875),
 (&#39;神经大条&#39;, 0.7816897630691528),
 (&#39;发神经&#39;, 0.780509352684021),
 (&#39;人来疯&#39;, 0.7794896960258484),
 (&#39;精分&#39;, 0.7705598473548889),
 (&#39;毒舌&#39;, 0.7692195773124695),
 (&#39;犯病&#39;, 0.7659722566604614),
 (&#39;闷骚&#39;, 0.7620697617530823),
 (&#39;迷糊&#39;, 0.7608135342597961),
 (&#39;智商在线&#39;, 0.7525709867477417),
 (&#39;抽疯&#39;, 0.7491970658302307),
 (&#39;欢脱&#39;, 0.7444456219673157),
 (&#39;深井&#39;, 0.7416326403617859),
 (&#39;抽风&#39;, 0.7321327924728394),
 (&#39;精分患者&#39;, 0.7319940328598022),
 (&#39;装嫩&#39;, 0.7267141342163086),
 (&#39;蒙圈&#39;, 0.7262043952941895),
 (&#39;神经&#39;, 0.7257982492446899),
 (&#39;假正经&#39;, 0.7215201258659363)]

</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">w2v</span><span class="o">.</span><span class="n">most_similar</span><span class="p">([</span><span class="s1">&#39;内向&#39;</span><span class="p">],</span> <span class="n">topn</span><span class="o">=</span><span class="mi">20</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[(&#39;外向&#39;, 0.8474423885345459),
 (&#39;慢热&#39;, 0.8272333741188049),
 (&#39;不爱说话&#39;, 0.8249834775924683),
 (&#39;不善言辞&#39;, 0.80635666847229),
 (&#39;腼腆&#39;, 0.7940059304237366),
 (&#39;孤僻&#39;, 0.7929618954658508),
 (&#39;开朗&#39;, 0.7585728168487549),
 (&#39;闷骚&#39;, 0.745791494846344),
 (&#39;神经质&#39;, 0.7454176545143127),
 (&#39;多愁善感&#39;, 0.7348753809928894),
 (&#39;胆小&#39;, 0.7213962078094482),
 (&#39;沉默寡言&#39;, 0.7145323157310486),
 (&#39;随和&#39;, 0.7115553617477417),
 (&#39;敏感&#39;, 0.7103193402290344),
 (&#39;水瓶座&#39;, 0.7092751264572144),
 (&#39;大大咧咧&#39;, 0.7085798382759094),
 (&#39;高冷&#39;, 0.7084994912147522),
 (&#39;性格开朗&#39;, 0.7064590454101562),
 (&#39;耿直&#39;, 0.7048951983451843),
 (&#39;做作&#39;, 0.704330325126648)]
</code></pre></div><br>
<br>
<h2 id="五获取资源">五、获取资源</h2>
<p>内容整理不易， 本文内容分免费和付费部分。 免费部分可以直接下载数据、构建语料、使用 word2vec 模型。</p>
<p>付费部分主要是 cntext，用于训练 word2vec 模型。 如果对本文感兴趣，可加微信 372335839， 备注「姓名-学校-专业」</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">- 免费     1亿用户数据集 https://www.kaggle.com/datasets/beats0/bilibili-user

- 免费     B站签名-Word2Vec.200.15.bin  链接: https://pan.baidu.com/s/1ILVwu6gGGGP0IHv-vsjgvw?pwd=em99 提取码: em99
</code></pre></div><p><br><br></p>
<h2 id="相关内容">相关内容</h2>
<ul>
<li><a href="https://cntext.readthedocs.io/">文本分析库 cntext 使用手册 https://cntext.readthedocs.io/</a></li>
<li><a href="https://textdata.cn/blog/2025-03-28-train_a_glove_model_on_chinese_corpus_using_stanfordnlp/">实验 | 使用 Stanford Glove 代码训练中文语料的 Glove 模型</a></li>
<li><a href="https://textdata.cn/blog/2023-12-28-train-word2vec-using-renmin-gov-leader-board-dataset/">词向量 | 使用人民网领导留言板语料训练 Word2Vec 模型</a></li>
<li><a href="https://textdata.cn/blog/2023-11-20-word2vec-by-year-by-province/">使用 5000w 专利申请数据集按年份(按省份)训练词向量</a></li>
<li><a href="https://textdata.cn/blog/2024-04-16-douban-movie-1000w-ratings-comments-dataset/">使用 1000w 条豆瓣影评训练 Word2Vec</a></li>
<li><a href="https://textdata.cn/blog/2023-03-15-39faq-about-word-embeddings-for-social-science/">词嵌入技术在社会科学领域进行数据挖掘常见 39 个 FAQ 汇总</a></li>
<li><a href="https://textdata.cn/blog/2022-04-07-word-embeddings-in-social-science/">转载|大数据时代下社会科学研究方法的拓展——基于词嵌入技术的文本分析的应用</a></li>
<li><a href="https://textdata.cn/blog/2023-11-03-organization-science-with-word-embeddings/">OS2022 | 概念空间 | 词嵌入模型如何为组织科学中的测量和理论提供信息</a></li>
</ul>
]]></content:encoded>
    </item>
    
    <item>
      <title>实验 | 使用Stanford Glove代码训练中文语料的Glove模型</title>
      <link>https://textdata.cn/blog/2025-03-28-train_a_glove_model_on_chinese_corpus_using_stanfordnlp/</link>
      <pubDate>Fri, 28 Mar 2025 00:00:00 +0000</pubDate>
      
      <guid>/blog/2025-03-28-train_a_glove_model_on_chinese_corpus_using_stanfordnlp/</guid>
      <description>&lt;h2 id=&#34;一简介&#34;&gt;一、简介&lt;/h2&gt;
&lt;p&gt;&lt;a href=&#34;https://nlp.stanford.edu/projects/glove/&#34;&gt;Stanford GloVe&lt;/a&gt;（Global Vectors for Word Representation）算法作为一种融合全局统计信息与局部上下文窗口的词嵌入模型，相较于Word2Vec仅依赖局部上下文，GloVe利用全局统计信息，能更精准地反映词频分布特征。例如，在高维词向量（如200D）中，GloVe在词语类比任务中准确率达75%，并在命名实体识别任务中优于其他词嵌入模型。因其高效的语义表征能力，在社会学、管理学等领域展现出广泛的应用价值。 相关词嵌入文献资料可阅读&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-11-03-organization-science-with-word-embeddings/&#34;&gt;OS2022 | 概念空间 | 词嵌入模型如何为组织科学中的测量和理论提供信息&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2022-04-07-word-embeddings-in-social-science/&#34;&gt;转载|大数据时代下社会科学研究方法的拓展——基于词嵌入技术的文本分析的应用&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-03-15-39faq-about-word-embeddings-for-social-science/&#34;&gt;词嵌入技术在社会科学领域进行数据挖掘常见39个FAQ汇总&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2022-04-09-literature-about-embeddings/&#34;&gt;文献汇总 | 词嵌入 与 社会科学中的偏见(态度)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2022-04-01-embeddings-and-attitude/&#34;&gt;词嵌入测量不同群体对某概念的态度(偏见)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二环境准备&#34;&gt;二、环境准备&lt;/h2&gt;
&lt;p&gt;cntext2.x 内置了 GloVe 训练所需的环境，支持 win 和 mac。&lt;/p&gt;
&lt;p&gt;获取&lt;a href=&#34;https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/&#34;&gt;cntext2.x&lt;/a&gt; 的安装文件 &lt;em&gt;&lt;strong&gt;cntext-2.1.5-py3-none-any.whl&lt;/strong&gt;&lt;/em&gt;，并将该whl文件放置于桌面。执行以下安装命令&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;cd desktop
pip install cntext-2.1.5-py3-none-any.whl
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;GloVe&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;corpus_file&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;lang&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;chinese&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;dict_file&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;None&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;stopwords_file&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;None&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;vector_size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;100&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;window_size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;15&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;min_count&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;5&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;max_memory&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mf&#34;&gt;4.0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;max_iter&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;15&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;x_max&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;10&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;ul&gt;
&lt;li&gt;&lt;em&gt;&lt;strong&gt;corpus_file&lt;/strong&gt;&lt;/em&gt;: 输入语料文件路径（文本格式）。该文件为分词后的语料文件。&lt;/li&gt;
&lt;li&gt;&lt;em&gt;&lt;strong&gt;lang&lt;/strong&gt;&lt;/em&gt;: 语料文件的语言类型，默认为 &amp;lsquo;chinese&amp;rsquo;。&lt;/li&gt;
&lt;li&gt;&lt;em&gt;&lt;strong&gt;dict_file&lt;/strong&gt;&lt;/em&gt;: 自定义词典txt文件路径，默认为None。utf-8编码。&lt;/li&gt;
&lt;li&gt;&lt;em&gt;&lt;strong&gt;stopwords_file&lt;/strong&gt;&lt;/em&gt;: 停用词文件路径，默认为 None。utf-8编码。&lt;/li&gt;
&lt;li&gt;&lt;em&gt;&lt;strong&gt;vector_size&lt;/strong&gt;&lt;/em&gt;: 词向量维度，默认 100。&lt;/li&gt;
&lt;li&gt;&lt;em&gt;&lt;strong&gt;window_size&lt;/strong&gt;&lt;/em&gt;: 上下文窗口大小，默认 15。&lt;/li&gt;
&lt;li&gt;&lt;em&gt;&lt;strong&gt;min_count&lt;/strong&gt;&lt;/em&gt;: 忽略出现次数低于此值的单词，默认 5。&lt;/li&gt;
&lt;li&gt;&lt;em&gt;&lt;strong&gt;max_memory&lt;/strong&gt;&lt;/em&gt;: 可供使用的最大内存大小，单位为GB，默认 4;  该参数越大，训练越快。&lt;/li&gt;
&lt;li&gt;&lt;em&gt;&lt;strong&gt;max_iter&lt;/strong&gt;&lt;/em&gt;: 训练的最大迭代次数，默认 15。&lt;/li&gt;
&lt;li&gt;&lt;em&gt;&lt;strong&gt;x_max&lt;/strong&gt;&lt;/em&gt;: 共现矩阵中元素的最大计数值，默认 10。&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;三训练中文glove&#34;&gt;三、训练中文GloVe&lt;/h2&gt;
&lt;p&gt;我们其实只需要设置 &lt;em&gt;&lt;strong&gt;corpus_file&lt;/strong&gt;&lt;/em&gt; 和 &lt;em&gt;&lt;strong&gt;lang&lt;/strong&gt;&lt;/em&gt;， 但为了让大家知道&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;上下文的窗口大小 &lt;em&gt;&lt;strong&gt;window_size&lt;/strong&gt;&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;训练出模型词语的维度数 &lt;em&gt;&lt;strong&gt;vector_size&lt;/strong&gt;&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;cntext&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;ct&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;# 简化版调用。训练window_size=100维， vector_size=15&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;# glove_wv = ct.GloVe(corpus_file=&amp;#39;data/三体.txt&amp;#39;, lang=&amp;#39;chinese&amp;#39;)&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;# 正常调用。训练window_size=15维， vector_size=50&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;glove_wv&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;GloVe&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;corpus_file&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;data/三体.txt&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
                    &lt;span class=&#34;n&#34;&gt;lang&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;chinese&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
                    &lt;span class=&#34;n&#34;&gt;vector_size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;50&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
                    &lt;span class=&#34;n&#34;&gt;window_size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;15&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
                    &lt;span class=&#34;n&#34;&gt;only_binary&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;False&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;c1&#34;&gt;# 同时保存txt和bin两种格式的模型文件&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;glove_wv&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;Mac(Linux) System, Enable Parallel Processing
Cache output/三体_cache.txt Not Found or Empty, Preprocessing Corpus
Start Training GloVe
BUILDING VOCABULARY
Using vocabulary of size 6975.

COUNTING COOCCURRENCES
Merging cooccurrence files: processed 2106999 lines.

Using random seed 1743474106
SHUFFLING COOCCURRENCES
Merging temp files: processed 2106999 lines.

TRAINING MODEL
Read 2106999 lines.
Using random seed 1743474106
04/01/25 - 10:21.46AM, iter: 001, cost: 0.055981
04/01/25 - 10:21.46AM, iter: 002, cost: 0.050632
......
04/01/25 - 10:21.48AM, iter: 014, cost: 0.030047
04/01/25 - 10:21.48AM, iter: 015, cost: 0.029100

GloVe Training Cost 9 s. 
Output Saved To: output/三体-GloVe.50.15.txt
Output Saved To: output/三体-GloVe.50.15.bin
&amp;lt;gensim.models.keyedvectors.KeyedVectors at 0x331517440&amp;gt;

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/05-glove.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h2 id=&#34;四使用中文glove模型&#34;&gt;四、使用中文GloVe模型&lt;/h2&gt;
&lt;h3 id=&#34;41-加载模型&#34;&gt;4.1 加载模型&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;cntext&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;ct&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;# 加载word2vec模型.txt文件&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;wv_model&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;load_w2v&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;output/三体-GloVe.50.15.bin&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;wv_model&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;&amp;lt;gensim.models.keyedvectors.KeyedVectors at 0x336ff8dd0&amp;gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;42-keyedvectors的操作方法或属性&#34;&gt;4.2 KeyedVectors的操作方法(或属性)&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;方法&lt;/th&gt;
&lt;th&gt;描述&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;em&gt;&lt;strong&gt;KeyedVectors.index_to_key&lt;/strong&gt;&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;获取词汇表中的所有单词。&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;em&gt;&lt;strong&gt;KeyedVectors.key_to_index&lt;/strong&gt;&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;获取单词到索引的映射。&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;em&gt;&lt;strong&gt;KeyedVectors.vector_size&lt;/strong&gt;&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;获取GloVe模型中任意词向量的维度。&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;em&gt;&lt;strong&gt;KeyedVectors.get_vector(word)&lt;/strong&gt;&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;获取给定单词的词向量。&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;em&gt;&lt;strong&gt;KeyedVectors.similar_by_word(word, topn=10)&lt;/strong&gt;&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;获取某词语最相似的10个近义词。&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;em&gt;&lt;strong&gt;KeyedVectors.similar_by_vector(vector, topn=10)&lt;/strong&gt;&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;获取词向量最相似的10个近义词。&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&amp;hellip;&lt;/td&gt;
&lt;td&gt;&amp;hellip;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3 id=&#34;421-词表&#34;&gt;4.2.1 词表&lt;/h3&gt;
&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;wv_model&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;index_to_key&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;[&amp;#39;的&amp;#39;,
 &amp;#39;了&amp;#39;,
 &amp;#39;在&amp;#39;,
...
 &amp;#39;引力&amp;#39;,
 &amp;#39;所说&amp;#39;,
 &amp;#39;星际&amp;#39;,
 ...]

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;422-词表映射&#34;&gt;4.2.2 词表映射&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;wv_model&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;key_to_index&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;{&amp;#39;的&amp;#39;: 0,
 &amp;#39;了&amp;#39;: 1,
 &amp;#39;在&amp;#39;: 2,
...
 &amp;#39;引力&amp;#39;: 997,
 &amp;#39;所说&amp;#39;: 998,
 &amp;#39;星际&amp;#39;: 999,
 ...}
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;423-向量维度数&#34;&gt;4.2.3 向量维度数&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;sa&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;词表有 &lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;wv_model&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;key_to_index&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt; 个词&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;sa&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;向量是 &lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;wv_model&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;vector_size&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt; 维&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;词表有 4365 个词
向量是 50 维
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;424-获取词向量&#34;&gt;4.2.4 获取词向量&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# 查看「降临」的词向量&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;wv&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;get_vector&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;降临&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;array([ 0.672314,  0.020081,  0.653733,  0.598732, -0.680517, -0.049689,
       -0.16845 , -0.06759 , -0.147955,  0.024006,  0.264551, -0.050127,
        0.252063, -0.475633,  0.103722, -0.012481,  0.040755,  1.154912,
        0.742695,  0.048619, -0.514424, -1.184054,  0.515892, -0.1034  ,
        0.368755, -0.690357, -0.784287, -0.505814,  0.035807, -0.166354,
       -0.26149 ,  0.015089,  0.10626 , -0.215666, -0.374001, -0.123558,
        0.422617, -0.075277, -0.316387, -0.484295,  0.059687,  0.132621,
        0.192094, -0.591919,  0.236281,  0.164198, -0.058724,  1.285457,
        0.905606, -0.52032 ], dtype=float32)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;425-近义词&#34;&gt;4.2.5 近义词&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;wv&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;similar_by_word&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;三体&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;topn&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;10&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;[(&amp;#39;叛军&amp;#39;, 0.7699569463729858),
 (&amp;#39;更新&amp;#39;, 0.7687217593193054),
 (&amp;#39;地球&amp;#39;, 0.760529100894928),
 (&amp;#39;全集&amp;#39;, 0.7575182914733887),
 (&amp;#39;最快&amp;#39;, 0.7426372170448303),
 (&amp;#39;世界&amp;#39;, 0.7262137532234192),
 (&amp;#39;最新&amp;#39;, 0.7219281792640686),
 (&amp;#39;游戏&amp;#39;, 0.7180070877075195),
 (&amp;#39;危机&amp;#39;, 0.7020451426506042),
 (&amp;#39;教&amp;#39;, 0.7012627720832825)]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;426-计算多个词的中心向量&#34;&gt;4.2.6 计算多个词的中心向量&lt;/h3&gt;
&lt;p&gt;我们可以计算「三体」、「降临」、「组织」、「拯救」的中心向量eto_vector。 并试图寻找中心向量eto_vector的最相似的10个词。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;eto_vector&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;semantic_centroid&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;wv&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;wv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;words&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;三体&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;降临&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;组织&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;拯救&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;

&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;eto_vector&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;# 寻找 eto_vector 语义最相似的10个词&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;wv&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;similar_by_vector&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;eto_vector&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;topn&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;10&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;[ 0.6267875   0.08975425  0.48438451  0.405128   -0.49928901  0.11347825
 -0.90057975  0.11877625 -0.27053049  0.344603    0.4368495  -0.3839495
  0.02633176 -0.138534    0.2531555  -0.0060905  -0.48776849  0.75548999
  0.72575876 -0.446079   -0.30361701 -1.039792    0.457687   -0.4286315
  0.44577325 -0.39119426 -0.4783935  -0.2596135  -0.32513325 -0.10315975
 -0.42880575 -0.48328425  0.129438   -0.17085625 -0.13454625 -0.070053
  0.68060375  0.16736924 -0.15664874 -0.20528575  0.385481    0.206432
  0.18913225 -0.93453825  0.58597099  0.60727924  0.009064    0.87661726
  0.65814423 -0.356567  ]

[(&amp;#39;降临&amp;#39;, 0.8707027435302734),
 (&amp;#39;组织&amp;#39;, 0.8625670671463013),
 (&amp;#39;三体&amp;#39;, 0.8621653914451599),
 (&amp;#39;派&amp;#39;, 0.8343338966369629),
 (&amp;#39;拯救&amp;#39;, 0.8301094174385071),
 (&amp;#39;叛军&amp;#39;, 0.784512460231781),
 (&amp;#39;地球&amp;#39;, 0.7536635398864746),
 (&amp;#39;世界&amp;#39;, 0.7245718836784363),
 (&amp;#39;外部&amp;#39;, 0.7078365087509155),
 (&amp;#39;入侵&amp;#39;, 0.6962169408798218)]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;p&gt;熟悉三体的朋友应该能联想到背叛人类的ETO(地球三体组织)有两个派别，分别是拯救派和降临派。&lt;/p&gt;
&lt;p&gt;ETO开发了一款虚拟现实游戏，它向参与者展示了三体世界的真实情况，包括其恶劣的自然条件、三体文明的历史及其科技水平等。通过参与这个游戏，玩家们能够逐渐了解三体世界的真相，并最终决定是否要加入到支持三体文明入侵地球的行列中来。&lt;/p&gt;
&lt;p&gt;这个游戏不仅充当了信息传递的媒介，也是甄别志同道合者的工具，让那些对人类社会现状不满、渴望变革的人们找到了组织，进而成为了背叛人类的叛军一员。在这个过程中，“三体游戏”起到了关键的作用，是连接地球人与三体世界的重要桥梁。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;相关内容&#34;&gt;相关内容&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://cntext.readthedocs.io/&#34;&gt;文本分析库cntext使用手册 https://cntext.readthedocs.io/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-12-28-train-word2vec-using-renmin-gov-leader-board-dataset/&#34;&gt;词向量 | 使用人民网领导留言板语料训练Word2Vec模型&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-11-20-word2vec-by-year-by-province/&#34;&gt;使用 5000w 专利申请数据集按年份(按省份)训练词向量&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2024-04-16-douban-movie-1000w-ratings-comments-dataset/&#34;&gt;使用 1000w 条豆瓣影评训练 Word2Vec&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-03-15-39faq-about-word-embeddings-for-social-science/&#34;&gt;词嵌入技术在社会科学领域进行数据挖掘常见39个FAQ汇总&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2022-04-07-word-embeddings-in-social-science/&#34;&gt;转载|大数据时代下社会科学研究方法的拓展——基于词嵌入技术的文本分析的应用&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-11-03-organization-science-with-word-embeddings/&#34;&gt;OS2022 | 概念空间 | 词嵌入模型如何为组织科学中的测量和理论提供信息&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
      <content:encoded><![CDATA[<h2 id="一简介">一、简介</h2>
<p><a href="https://nlp.stanford.edu/projects/glove/">Stanford GloVe</a>（Global Vectors for Word Representation）算法作为一种融合全局统计信息与局部上下文窗口的词嵌入模型，相较于Word2Vec仅依赖局部上下文，GloVe利用全局统计信息，能更精准地反映词频分布特征。例如，在高维词向量（如200D）中，GloVe在词语类比任务中准确率达75%，并在命名实体识别任务中优于其他词嵌入模型。因其高效的语义表征能力，在社会学、管理学等领域展现出广泛的应用价值。 相关词嵌入文献资料可阅读</p>
<ul>
<li><a href="https://textdata.cn/blog/2023-11-03-organization-science-with-word-embeddings/">OS2022 | 概念空间 | 词嵌入模型如何为组织科学中的测量和理论提供信息</a></li>
<li><a href="https://textdata.cn/blog/2022-04-07-word-embeddings-in-social-science/">转载|大数据时代下社会科学研究方法的拓展——基于词嵌入技术的文本分析的应用</a></li>
<li><a href="https://textdata.cn/blog/2023-03-15-39faq-about-word-embeddings-for-social-science/">词嵌入技术在社会科学领域进行数据挖掘常见39个FAQ汇总</a></li>
<li><a href="https://textdata.cn/blog/2022-04-09-literature-about-embeddings/">文献汇总 | 词嵌入 与 社会科学中的偏见(态度)</a></li>
<li><a href="https://textdata.cn/blog/2022-04-01-embeddings-and-attitude/">词嵌入测量不同群体对某概念的态度(偏见)</a></li>
</ul>
<p><br><br></p>
<h2 id="二环境准备">二、环境准备</h2>
<p>cntext2.x 内置了 GloVe 训练所需的环境，支持 win 和 mac。</p>
<p>获取<a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/">cntext2.x</a> 的安装文件 <em><strong>cntext-2.1.5-py3-none-any.whl</strong></em>，并将该whl文件放置于桌面。执行以下安装命令</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">cd desktop
pip install cntext-2.1.5-py3-none-any.whl
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">ct</span><span class="o">.</span><span class="n">GloVe</span><span class="p">(</span><span class="n">corpus_file</span><span class="p">,</span> <span class="n">lang</span><span class="o">=</span><span class="s1">&#39;chinese&#39;</span><span class="p">,</span> <span class="n">dict_file</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span> <span class="n">stopwords_file</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span> <span class="n">vector_size</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">window_size</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span> <span class="n">min_count</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">max_memory</span><span class="o">=</span><span class="mf">4.0</span><span class="p">,</span> <span class="n">max_iter</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span> <span class="n">x_max</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
</code></pre></div><ul>
<li><em><strong>corpus_file</strong></em>: 输入语料文件路径（文本格式）。该文件为分词后的语料文件。</li>
<li><em><strong>lang</strong></em>: 语料文件的语言类型，默认为 &lsquo;chinese&rsquo;。</li>
<li><em><strong>dict_file</strong></em>: 自定义词典txt文件路径，默认为None。utf-8编码。</li>
<li><em><strong>stopwords_file</strong></em>: 停用词文件路径，默认为 None。utf-8编码。</li>
<li><em><strong>vector_size</strong></em>: 词向量维度，默认 100。</li>
<li><em><strong>window_size</strong></em>: 上下文窗口大小，默认 15。</li>
<li><em><strong>min_count</strong></em>: 忽略出现次数低于此值的单词，默认 5。</li>
<li><em><strong>max_memory</strong></em>: 可供使用的最大内存大小，单位为GB，默认 4;  该参数越大，训练越快。</li>
<li><em><strong>max_iter</strong></em>: 训练的最大迭代次数，默认 15。</li>
<li><em><strong>x_max</strong></em>: 共现矩阵中元素的最大计数值，默认 10。</li>
</ul>
<p><br><br></p>
<h2 id="三训练中文glove">三、训练中文GloVe</h2>
<p>我们其实只需要设置 <em><strong>corpus_file</strong></em> 和 <em><strong>lang</strong></em>， 但为了让大家知道</p>
<ul>
<li>上下文的窗口大小 <em><strong>window_size</strong></em></li>
<li>训练出模型词语的维度数 <em><strong>vector_size</strong></em></li>
</ul>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="c1"># 简化版调用。训练window_size=100维， vector_size=15</span>
<span class="c1"># glove_wv = ct.GloVe(corpus_file=&#39;data/三体.txt&#39;, lang=&#39;chinese&#39;)</span>

<span class="c1"># 正常调用。训练window_size=15维， vector_size=50</span>
<span class="n">glove_wv</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">GloVe</span><span class="p">(</span><span class="n">corpus_file</span><span class="o">=</span><span class="s1">&#39;data/三体.txt&#39;</span><span class="p">,</span> 
                    <span class="n">lang</span><span class="o">=</span><span class="s1">&#39;chinese&#39;</span><span class="p">,</span>
                    <span class="n">vector_size</span><span class="o">=</span><span class="mi">50</span><span class="p">,</span>
                    <span class="n">window_size</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span>
                    <span class="n">only_binary</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span> <span class="c1"># 同时保存txt和bin两种格式的模型文件</span>

<span class="n">glove_wv</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">Mac(Linux) System, Enable Parallel Processing
Cache output/三体_cache.txt Not Found or Empty, Preprocessing Corpus
Start Training GloVe
BUILDING VOCABULARY
Using vocabulary of size 6975.

COUNTING COOCCURRENCES
Merging cooccurrence files: processed 2106999 lines.

Using random seed 1743474106
SHUFFLING COOCCURRENCES
Merging temp files: processed 2106999 lines.

TRAINING MODEL
Read 2106999 lines.
Using random seed 1743474106
04/01/25 - 10:21.46AM, iter: 001, cost: 0.055981
04/01/25 - 10:21.46AM, iter: 002, cost: 0.050632
......
04/01/25 - 10:21.48AM, iter: 014, cost: 0.030047
04/01/25 - 10:21.48AM, iter: 015, cost: 0.029100

GloVe Training Cost 9 s. 
Output Saved To: output/三体-GloVe.50.15.txt
Output Saved To: output/三体-GloVe.50.15.bin
&lt;gensim.models.keyedvectors.KeyedVectors at 0x331517440&gt;

</code></pre></div><p><img loading="lazy" src="img/05-glove.png" alt=""  />
</p>
<br>
<h2 id="四使用中文glove模型">四、使用中文GloVe模型</h2>
<h3 id="41-加载模型">4.1 加载模型</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="c1"># 加载word2vec模型.txt文件</span>
<span class="n">wv_model</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">load_w2v</span><span class="p">(</span><span class="s1">&#39;output/三体-GloVe.50.15.bin&#39;</span><span class="p">)</span>
<span class="n">wv_model</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">&lt;gensim.models.keyedvectors.KeyedVectors at 0x336ff8dd0&gt;
</code></pre></div><br>
<h3 id="42-keyedvectors的操作方法或属性">4.2 KeyedVectors的操作方法(或属性)</h3>
<table>
<thead>
<tr>
<th>方法</th>
<th>描述</th>
</tr>
</thead>
<tbody>
<tr>
<td><em><strong>KeyedVectors.index_to_key</strong></em></td>
<td>获取词汇表中的所有单词。</td>
</tr>
<tr>
<td><em><strong>KeyedVectors.key_to_index</strong></em></td>
<td>获取单词到索引的映射。</td>
</tr>
<tr>
<td><em><strong>KeyedVectors.vector_size</strong></em></td>
<td>获取GloVe模型中任意词向量的维度。</td>
</tr>
<tr>
<td><em><strong>KeyedVectors.get_vector(word)</strong></em></td>
<td>获取给定单词的词向量。</td>
</tr>
<tr>
<td><em><strong>KeyedVectors.similar_by_word(word, topn=10)</strong></em></td>
<td>获取某词语最相似的10个近义词。</td>
</tr>
<tr>
<td><em><strong>KeyedVectors.similar_by_vector(vector, topn=10)</strong></em></td>
<td>获取词向量最相似的10个近义词。</td>
</tr>
<tr>
<td>&hellip;</td>
<td>&hellip;</td>
</tr>
</tbody>
</table>
<h3 id="421-词表">4.2.1 词表</h3>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">wv_model</span><span class="o">.</span><span class="n">index_to_key</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[&#39;的&#39;,
 &#39;了&#39;,
 &#39;在&#39;,
...
 &#39;引力&#39;,
 &#39;所说&#39;,
 &#39;星际&#39;,
 ...]

</code></pre></div><br>
<h3 id="422-词表映射">4.2.2 词表映射</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">wv_model</span><span class="o">.</span><span class="n">key_to_index</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">{&#39;的&#39;: 0,
 &#39;了&#39;: 1,
 &#39;在&#39;: 2,
...
 &#39;引力&#39;: 997,
 &#39;所说&#39;: 998,
 &#39;星际&#39;: 999,
 ...}
</code></pre></div><br>
<h3 id="423-向量维度数">4.2.3 向量维度数</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;词表有 </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">wv_model</span><span class="o">.</span><span class="n">key_to_index</span><span class="p">)</span><span class="si">}</span><span class="s1"> 个词&#39;</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;向量是 </span><span class="si">{</span><span class="n">wv_model</span><span class="o">.</span><span class="n">vector_size</span><span class="si">}</span><span class="s1"> 维&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">词表有 4365 个词
向量是 50 维
</code></pre></div><br>
<h3 id="424-获取词向量">4.2.4 获取词向量</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1"># 查看「降临」的词向量</span>
<span class="n">wv</span><span class="o">.</span><span class="n">get_vector</span><span class="p">(</span><span class="s1">&#39;降临&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">array([ 0.672314,  0.020081,  0.653733,  0.598732, -0.680517, -0.049689,
       -0.16845 , -0.06759 , -0.147955,  0.024006,  0.264551, -0.050127,
        0.252063, -0.475633,  0.103722, -0.012481,  0.040755,  1.154912,
        0.742695,  0.048619, -0.514424, -1.184054,  0.515892, -0.1034  ,
        0.368755, -0.690357, -0.784287, -0.505814,  0.035807, -0.166354,
       -0.26149 ,  0.015089,  0.10626 , -0.215666, -0.374001, -0.123558,
        0.422617, -0.075277, -0.316387, -0.484295,  0.059687,  0.132621,
        0.192094, -0.591919,  0.236281,  0.164198, -0.058724,  1.285457,
        0.905606, -0.52032 ], dtype=float32)
</code></pre></div><br>
<h3 id="425-近义词">4.2.5 近义词</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">wv</span><span class="o">.</span><span class="n">similar_by_word</span><span class="p">(</span><span class="s1">&#39;三体&#39;</span><span class="p">,</span> <span class="n">topn</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[(&#39;叛军&#39;, 0.7699569463729858),
 (&#39;更新&#39;, 0.7687217593193054),
 (&#39;地球&#39;, 0.760529100894928),
 (&#39;全集&#39;, 0.7575182914733887),
 (&#39;最快&#39;, 0.7426372170448303),
 (&#39;世界&#39;, 0.7262137532234192),
 (&#39;最新&#39;, 0.7219281792640686),
 (&#39;游戏&#39;, 0.7180070877075195),
 (&#39;危机&#39;, 0.7020451426506042),
 (&#39;教&#39;, 0.7012627720832825)]
</code></pre></div><br>
<h3 id="426-计算多个词的中心向量">4.2.6 计算多个词的中心向量</h3>
<p>我们可以计算「三体」、「降临」、「组织」、「拯救」的中心向量eto_vector。 并试图寻找中心向量eto_vector的最相似的10个词。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">eto_vector</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">semantic_centroid</span><span class="p">(</span><span class="n">wv</span><span class="o">=</span><span class="n">wv</span><span class="p">,</span> <span class="n">words</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;三体&#39;</span><span class="p">,</span> <span class="s1">&#39;降临&#39;</span><span class="p">,</span> <span class="s1">&#39;组织&#39;</span><span class="p">,</span> <span class="s1">&#39;拯救&#39;</span><span class="p">])</span>

<span class="nb">print</span><span class="p">(</span><span class="n">eto_vector</span><span class="p">)</span>
<span class="c1"># 寻找 eto_vector 语义最相似的10个词</span>
<span class="n">wv</span><span class="o">.</span><span class="n">similar_by_vector</span><span class="p">(</span><span class="n">eto_vector</span><span class="p">,</span> <span class="n">topn</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[ 0.6267875   0.08975425  0.48438451  0.405128   -0.49928901  0.11347825
 -0.90057975  0.11877625 -0.27053049  0.344603    0.4368495  -0.3839495
  0.02633176 -0.138534    0.2531555  -0.0060905  -0.48776849  0.75548999
  0.72575876 -0.446079   -0.30361701 -1.039792    0.457687   -0.4286315
  0.44577325 -0.39119426 -0.4783935  -0.2596135  -0.32513325 -0.10315975
 -0.42880575 -0.48328425  0.129438   -0.17085625 -0.13454625 -0.070053
  0.68060375  0.16736924 -0.15664874 -0.20528575  0.385481    0.206432
  0.18913225 -0.93453825  0.58597099  0.60727924  0.009064    0.87661726
  0.65814423 -0.356567  ]

[(&#39;降临&#39;, 0.8707027435302734),
 (&#39;组织&#39;, 0.8625670671463013),
 (&#39;三体&#39;, 0.8621653914451599),
 (&#39;派&#39;, 0.8343338966369629),
 (&#39;拯救&#39;, 0.8301094174385071),
 (&#39;叛军&#39;, 0.784512460231781),
 (&#39;地球&#39;, 0.7536635398864746),
 (&#39;世界&#39;, 0.7245718836784363),
 (&#39;外部&#39;, 0.7078365087509155),
 (&#39;入侵&#39;, 0.6962169408798218)]
</code></pre></div><br>
<p>熟悉三体的朋友应该能联想到背叛人类的ETO(地球三体组织)有两个派别，分别是拯救派和降临派。</p>
<p>ETO开发了一款虚拟现实游戏，它向参与者展示了三体世界的真实情况，包括其恶劣的自然条件、三体文明的历史及其科技水平等。通过参与这个游戏，玩家们能够逐渐了解三体世界的真相，并最终决定是否要加入到支持三体文明入侵地球的行列中来。</p>
<p>这个游戏不仅充当了信息传递的媒介，也是甄别志同道合者的工具，让那些对人类社会现状不满、渴望变革的人们找到了组织，进而成为了背叛人类的叛军一员。在这个过程中，“三体游戏”起到了关键的作用，是连接地球人与三体世界的重要桥梁。</p>
<p><br><br></p>
<h2 id="相关内容">相关内容</h2>
<ul>
<li><a href="https://cntext.readthedocs.io/">文本分析库cntext使用手册 https://cntext.readthedocs.io/</a></li>
<li><a href="https://textdata.cn/blog/2023-12-28-train-word2vec-using-renmin-gov-leader-board-dataset/">词向量 | 使用人民网领导留言板语料训练Word2Vec模型</a></li>
<li><a href="https://textdata.cn/blog/2023-11-20-word2vec-by-year-by-province/">使用 5000w 专利申请数据集按年份(按省份)训练词向量</a></li>
<li><a href="https://textdata.cn/blog/2024-04-16-douban-movie-1000w-ratings-comments-dataset/">使用 1000w 条豆瓣影评训练 Word2Vec</a></li>
<li><a href="https://textdata.cn/blog/2023-03-15-39faq-about-word-embeddings-for-social-science/">词嵌入技术在社会科学领域进行数据挖掘常见39个FAQ汇总</a></li>
<li><a href="https://textdata.cn/blog/2022-04-07-word-embeddings-in-social-science/">转载|大数据时代下社会科学研究方法的拓展——基于词嵌入技术的文本分析的应用</a></li>
<li><a href="https://textdata.cn/blog/2023-11-03-organization-science-with-word-embeddings/">OS2022 | 概念空间 | 词嵌入模型如何为组织科学中的测量和理论提供信息</a></li>
</ul>
]]></content:encoded>
    </item>
    
    <item>
      <title>资源 | 不同版本Chrome适配的chromedriver下载链接</title>
      <link>https://textdata.cn/blog/2025-03-24-setting-chromedriver-environment-for-selenium/</link>
      <pubDate>Mon, 24 Mar 2025 00:00:00 +0000</pubDate>
      
      <guid>/blog/2025-03-24-setting-chromedriver-environment-for-selenium/</guid>
      <description>&lt;h2 id=&#34;一chrome版本号&#34;&gt;一、Chrome版本号&lt;/h2&gt;
&lt;p&gt;将Chrome的版本号划分为&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;古代版本 70.0.3538.16 ~ 114.0.5735.90&lt;/li&gt;
&lt;li&gt;现代版本 127.0.6533.88 ~ 134.0.6998.165&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二chromedriver下载&#34;&gt;二、chromedriver下载&lt;/h2&gt;
&lt;h3 id=&#34;21-古代版本&#34;&gt;2.1 古代版本&lt;/h3&gt;
&lt;p&gt;Chrome版本号在 70.0.3538.16 ~ 114.0.5735.90 之间的， 可以通过如下方式下载&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;平台      下载链接
https://chromedriver.storage.googleapis.com/index.html

Linux64   https://chromedriver.storage.googleapis.com/{版本号}/chromedriver_linux64.zip
Mac       https://chromedriver.storage.googleapis.com/{版本号}/chromedriver_mac64.zip
Win32     https://chromedriver.storage.googleapis.com/{版本号}/chromedriver_win32.zip
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;22-近代版本&#34;&gt;2.2 近代版本&lt;/h3&gt;
&lt;p&gt;Chrome版本号在 115.0.5739.0 ~ 127.0.6533.72 之间的， 可以通过如下方式下载chromedriver&lt;/p&gt;
&lt;p&gt;&lt;a href=&#34;https://chromedriver.storage.googleapis.com/115.0.5739.0/chromedriver_linux64.zip&#34;&gt;https://chromedriver.storage.googleapis.com/115.0.5739.0/chromedriver_linux64.zip&lt;/a&gt;&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;平台          下载链接

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;23-现代版本&#34;&gt;2.3 现代版本&lt;/h3&gt;
&lt;p&gt;Chrome版本号在 127.0.6533.88 ~ 134.0.6998.165 之间的， 可以通过如下方式下载chromedriver&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;
平台          下载链接
Linux64	      https://storage.googleapis.com/chrome-for-testing-public/{版本号}/linux64/chromedriver-linux64.zip

Mac(M芯片)	   https://storage.googleapis.com/chrome-for-testing-public/{版本号}/mac-arm64/chromedriver-mac-arm64.zip

Mac            https://storage.googleapis.com/chrome-for-testing-public/{版本号}/mac-x64/chromedriver-mac-x64.zip

win32	       https://storage.googleapis.com/chrome-for-testing-public/{版本号}/win32/chromedriver-win32.zip

win64	       https://storage.googleapis.com/chrome-for-testing-public/{版本号}/win64/chromedriver-win64.zip
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;p&gt;此外， 更新的开发版可以在此处 &lt;a href=&#34;https://googlechromelabs.github.io/chrome-for-testing/#stable&#34;&gt;https://googlechromelabs.github.io/chrome-for-testing/#stable&lt;/a&gt; 找到下载资源&lt;/p&gt;
&lt;br&gt;
</description>
      <content:encoded><![CDATA[<h2 id="一chrome版本号">一、Chrome版本号</h2>
<p>将Chrome的版本号划分为</p>
<ul>
<li>古代版本 70.0.3538.16 ~ 114.0.5735.90</li>
<li>现代版本 127.0.6533.88 ~ 134.0.6998.165</li>
</ul>
<p><br><br></p>
<h2 id="二chromedriver下载">二、chromedriver下载</h2>
<h3 id="21-古代版本">2.1 古代版本</h3>
<p>Chrome版本号在 70.0.3538.16 ~ 114.0.5735.90 之间的， 可以通过如下方式下载</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">平台      下载链接
https://chromedriver.storage.googleapis.com/index.html

Linux64   https://chromedriver.storage.googleapis.com/{版本号}/chromedriver_linux64.zip
Mac       https://chromedriver.storage.googleapis.com/{版本号}/chromedriver_mac64.zip
Win32     https://chromedriver.storage.googleapis.com/{版本号}/chromedriver_win32.zip
</code></pre></div><br>
<h3 id="22-近代版本">2.2 近代版本</h3>
<p>Chrome版本号在 115.0.5739.0 ~ 127.0.6533.72 之间的， 可以通过如下方式下载chromedriver</p>
<p><a href="https://chromedriver.storage.googleapis.com/115.0.5739.0/chromedriver_linux64.zip">https://chromedriver.storage.googleapis.com/115.0.5739.0/chromedriver_linux64.zip</a></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">平台          下载链接

</code></pre></div><br>
<h3 id="23-现代版本">2.3 现代版本</h3>
<p>Chrome版本号在 127.0.6533.88 ~ 134.0.6998.165 之间的， 可以通过如下方式下载chromedriver</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">
平台          下载链接
Linux64	      https://storage.googleapis.com/chrome-for-testing-public/{版本号}/linux64/chromedriver-linux64.zip

Mac(M芯片)	   https://storage.googleapis.com/chrome-for-testing-public/{版本号}/mac-arm64/chromedriver-mac-arm64.zip

Mac            https://storage.googleapis.com/chrome-for-testing-public/{版本号}/mac-x64/chromedriver-mac-x64.zip

win32	       https://storage.googleapis.com/chrome-for-testing-public/{版本号}/win32/chromedriver-win32.zip

win64	       https://storage.googleapis.com/chrome-for-testing-public/{版本号}/win64/chromedriver-win64.zip
</code></pre></div><br>
<p>此外， 更新的开发版可以在此处 <a href="https://googlechromelabs.github.io/chrome-for-testing/#stable">https://googlechromelabs.github.io/chrome-for-testing/#stable</a> 找到下载资源</p>
<br>
]]></content:encoded>
    </item>
    
    <item>
      <title>新闻数据集(中文) | 含 人民日报/光明日报/参考消息/经济日报 等 120 家媒体(2025.03)</title>
      <link>https://textdata.cn/blog/2023-12-14-daily-news-dataset/</link>
      <pubDate>Sat, 22 Mar 2025 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-12-14-daily-news-dataset/</guid>
      <description>日报数据集研究价值大， 您可从中提取丰富的指标，包括但不限于经济政策不确定性指数EPU 、 媒体关注度指数、文本相似度、情感分析。而且可训练词向量，构建新的词典，开发新的指标指数。计算机自然语言处理、经济学、管理学、新闻传播学、公共管理等领域均可使用。</description>
      <content:encoded><![CDATA[<h2 id="本文声明">本文声明</h2>
<p>科研用途；如有问题， 请加微信372335839，备注「姓名-学校-专业」</p>
<p><br><br></p>
<h2 id="一中文新闻报刊数据集概况">一、「中文新闻报刊数据集」概况</h2>
<p>报纸(数字版)数据集，媒体源 120 家，</p>
<ul>
<li>35家国级，如 人民日报、光明日报、经济日报、人民政协报、中国青年报等</li>
<li>85家省市级报刊(覆盖30个省份) ， 新华日报(江苏)、扬子晚报(江苏)；河北日报、燕赵晚报；天津日报、今晚报；宁波日报、青岛日报、杭州日报等</li>
</ul>
<blockquote>
<p>需要注意，一般日报是偏正式、严肃。而晚报、商报、都市报，内容更多样，风格较为轻松。 大家使用前注意下内容风格。</p>
</blockquote>
<br>
<blockquote>
<p>如Excel打开csv乱码， 请百度搜【在 Excel 中正确打开 CSV UTF-8 文件】</p>
</blockquote>
<br>
<h2 id="11-国家级报刊">1.1 国家级报刊</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">+------+----------------+-------------------------+---------+-----------+
| 省份 |      报刊       |         起止日期          |  记录数  |    体积   |
+------+----------------+-------------------------+---------+-----------+
|      |    新闻联播    | 2016-02-04 ~ 2025-03-22 |   44623  |  164 M  |
|      |    人民日报    | 1946-05-15 ~ 2025-03-22 | 2045970 | 3968.50 M |
|      |    光明日报    | 1985-01-01 ~ 2024-06-22 |  862987 |  4022.7 M |
|      |    经济日报    | 2008-01-27 ~ 2025-03-22 |  380934 |  1230 M |
|      |   中国青年报   | 2005-01-01 ~ 2025-03-22 |  340700 | 1075.73 M |
|      |    农民日报    | 2011-01-01 ~ 2025-03-22 |  219795 | 1057.30 M |
|      |   人民政协报   | 2008-01-02 ~ 2024-05-24 |  346525 |  734.6 M  |
|      |   中国消费报   | 2010-01-01 ~ 2025-03-15 |  110451 |  732.40 M |
|      |    参考消息    | 1957-03-09 ~ 2002-12-31 |  528545 |  633.15 M |
|      |   经济参考报   | 2015-01-05 ~ 2025-03-22 |  97196  |  646.16 M |
|      |   人民法院报   | 2010-01-01 ~ 2025-03-22 |  165747 |  435.58 M |
|      |    工人日报    | 2014-01-01 ~ 2025-03-22 |  206364 |  412.78 M |
|      |   中国气象报   | 1989-01-16 ~ 2025-03-22 |  234826 |  366.86 M |
|      |  中国经济导报  | 2012-09-01 ~ 2024-06-22 |  51371   |  312.6 M |
|      |    法治日报    | 2021-01-01 ~ 2024-06-22 |  60984  |  290.40 M |
|      |   中国贸易报   | 2011-01-25 ~ 2025-03-22 |  75419  |  145.80 M |
|      |   中国工业报   | 2012-02-23 ~ 2024-05-24 |  90987  |  170.18 M |
|      |    消费日报    | 2019-10-08 ~ 2025-03-22 |   6321  |  166.85 M  |
|      |  每日经济新闻  | 2018-02-01 ~ 2025-03-21 |  47215  |  178.981 M |
|      |   中国工商报   | 2016-01-05 ~ 2024-05-24 |  70673  |  126.33 M |
|      |   中国财经报   | 2017-11-11 ~ 2025-03-22 |  53533  |  142.52 M |
|       |   中国企业报   | 2011-04-01 ~ 2025-03-22 |  49735  |  124.15 M |
|      |   中国经营报   | 2022-01-03 ~ 2025-03-22 |   11614  |  114.73 M |
|      |    检察日报    | 2022-01-01 ~ 2025-03-22 |  47743  |  133.30 M  |
|      |   中国城市报   | 2021-01-04 ~ 2025-03-22 |   8325  |  31.58 M  |
|      |   中国教育报   | 2021-01-01 ~ 2025-03-22 |  24060  |  92.02 M  |
|      |    科技日报    | 2021-01-04 ~ 2025-03-22 |  39811  |  106.54 M  |
|      |   中国妇女报   | 2021-01-20 ~ 2025-03-22 |  33005  |  93.51 M  |
|      |   中国能源报   | 2019-01-07 ~ 2025-03-22 |  20436  |  61.92 M  |
|      | 中国政府采购报 | 2017-11-17 ~ 2025-03-22 |  25206  |  77.98 M  |
|      |   中国电影报   | 2019-05-29 ~ 2025-03-22 |  14591  |  60.28 M  |
|      |    科普时报    | 2018-01-05 ~ 2025-03-22 |  14188  |  48.59 M  |
+------+----------------+-------------------------+---------+-----------+
</code></pre></div><b>
<h2 id="12-省市级">1.2 省市级</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">+--------+--------------+-------------------------+--------+-----------+
|  省份  |     报刊     |         起止日期        | 记录数 |    体积   |
+--------+--------------+-------------------------+--------+-----------+
|  北京  |    新京报    | 2012-01-01 ~ 2024-05-24 | 117047 |  402.23 M |
|  北京  |   北京日报   | 2021-01-01 ~ 2025-03-22 | 90313  |  333.57 M |
|  北京  |   北京晚报   | 2020-07-13 ~ 2025-03-22 | 100659  |  305.71 M |
|  上海  |    文汇报    | 2019-01-01 ~ 2025-03-22 | 84888  |  381.06 M |
|  上海  |   解放日报   | 2023-01-01 ~ 2024-12-15 | 34388  |  138.69 M |
|  上海  |   新民晚报   | 2018-12-28 ~ 2025-03-22 | 13488  |  75.56 M |
|  天津  |   天津日报   | 2022-09-01 ~ 2025-03-22 | 61052  |  166.44 M  |
|  天津  |    今晚报    | 2023-12-25 ~ 2025-03-22 | 50457  |  83.49 M  |
|  重庆  |   重庆日报   | 2022-01-01 ~ 2025-03-22 | 49056  |  210.46 M |
|  重庆  |   重庆晚报   | 2023-01-03 ~ 2025-03-22 |  12075  |  26.27 M  |
|  辽宁  |   辽宁日报   | 2019-01-01 ~ 2025-03-07 | 116778 |  292.69 M |
|  辽宁  |   辽沈晚报   | 2018-09-05 ~ 2025-03-21 | 69911  |  200.74 M |
|  辽宁  |   半岛晨报   | 2017-02-04 ~ 2023-04-14	 | 101453 |  221.38 M |
|  吉林  |   吉林日报   | 2022-01-01 ~ 2025-03-22 | 39130  |  122.34 M  |
|  吉林  |   城市晚报   | 2016-11-14 ~ 2025-03-22 | 82037  |  192.84 M |
| 黑龙江 |    生活报    | 2020-08-22 ~ 2025-03-22 | 37611  |  77.85 M |
| 黑龙江 |  黑龙江日报  | 2020-12-06 ~ 2025-03-22 | 54712  |  171.98 M  |
|  山东  |   齐鲁晚报   | 2012-01-01 ~ 2025-01-27 | 836741 |  941.76 M |
|  山东  |  半岛都市报  | 2017-01-01 ~ 2024-05-24 | 191003 |  830.37 M |
|  山东  |   大众日报   | 2021-01-01 ~ 2025-03-22 | 72185  |  344.34 M |
|  山东  |   济南日报   | 2022-11-01 ~ 2025-03-22 | 26183  |  214.10 M  |
|  山东  |   济南时报   | 2022-11-01 ~ 2025-03-22 | 22651  |  131.12 M  |
|  山东  |  经济观察报  | 2006-01-02 ~ 2025-03-22 | 63673  |  320.74 M |
|  山东  |   青岛日报   | 2022-05-29 ~ 2025-03-22 | 41530  |  150.73 M  |
|x  山东  |   青岛晚报   | 2018-01-01 ~ 2020-04-18 | 58930  |  76.73 M  |
|  江苏  |   新华日报   | 2021-12-01 ~ 2025-03-22 | 93524  |  349.58 M |
|  江苏  |   南京日报   | 2024-01-01 ~ 2025-03-22 |  22462  |  70.37 M  |
|  江苏  |   扬子晚报   | 2020-08-01 ~ 2025-03-22 | 86034  |  216.76 M |
|  浙江  |   杭州日报   | 2022-01-01 ~ 2025-03-22 | 65039  |  216.11 M |
|  浙江  |   钱江晚报   | 2006-01-01 ~ 2025-03-22 | 680553 | 1522.17 M |
|  浙江  |   每日商报   | 2022-01-01 ~ 2025-03-22 | 54631  |  140.01 M |
|  浙江  |   浙江日报   | 2006-01-01 ~ 2025-03-22 | 444705 |  1200 M |
|  浙江  |   宁波日报   | 2014-01-01 ~ 2025-03-22 | 156861 |  485.16 M |
|  浙江  |   都市快报   | 2022-01-01 ~ 2025-03-22 | 61844  |  186.12 M |
|  河北  |   河北日报   | 2018-01-02 ~ 2025-03-22 | 155344 |  527.95 M |
|  河北  |   燕赵晚报   | 2021-01-01 ~ 2025-03-22 | 47526  |  138.88 M |
|  河南  |    大河报    | 2010-06-09 ~ 2024-05-23 | 300201 | 1273.86 M |
|  河南  |   河南商报   | 2007-11-20 ~ 2024-05-17 | 98273  |  468.26 M |
|  河南  |   郑州晚报   | 2008-06-02 ~ 2024-05-24 | 474628 |  1553.1 M |
|  安徽  |   安徽商报   | 2007-03-28 ~ 2025-03-22 | 97426  |  221.47 M |
|  安徽  |   安徽日报   | 2023-06-25 ~ 2025-03-22 | 35021  |  90.93 M  |
|  安徽  |   新安晚报   | 2022-01-04 ~ 2025-03-22 | 33270  |  79.97 M  |
|  安徽  |   合肥晚报   | 2023-06-25 ~ 2025-03-20 | 21646  |  55.13 M  |
|  安徽  |   合肥日报   | 2023-06-25 ~ 2025-03-22 | 20175  |  52.19 M  |
|  湖北  |  楚天都市报  | 2023-01-01 ~ 2025-03-22 | 25889  |  75.64 M  |
|  湖北  |   湖北日报   | 2023-01-01 ~ 2025-03-22 | 40700  |  129.34 M |
|  湖南  |   湖南日报   | 2021-01-01 ~ 2025-03-22 | 93529  |  307.3 M  |
|  湖南  |   潇湘晨报   | 2008-01-01 ~ 2024-05-24 | 267006 |  536.57 M |
|  江西  |   江西日报   | 2018-09-01 ~ 2025-03-22 | 139453 |  372.83 M |
|  福建  |   福建日报   | 2023-04-01 ~ 2025-03-22 | 36185  |  116.32 M  |
|  福建  |   福州日报   | 2021-04-24 ~ 2025-03-22 | 52508  |  133.53 M  |
|  福建  |   福州晚报   | 2023-01-01 ~ 2025-03-22 | 31196  |  60.22 M  |
|  福建  |   厦门日报   | 2022-08-01 ~ 2025-03-22 | 40189  |  124.79 M  |
|  福建  |  海峡都市报  | 2022-08-12 ~ 2025-03-22 | 25663  |  80.72 M  |
|  福建  |   厦门晚报   | 2022-08-01 ~ 2025-03-22 | 31102  |   67.4 M  |
|  广东  |   南方周末   | 2008-01-02 ~ 2023-05-31 | 75734  |  872.59 M |
|  广东  |   羊城晚报   | 2018-01-01 ~ 2024-05-24 | 208619 |  863.59 M |
|  广东  |  深圳特区报  | 2017-05-01 ~ 2025-03-22 | 178568 |  836.4 M  |
|  广东  |  珠海特区报  | 2018-01-01 ~ 2025-03-22 | 145925 |  523.58 M |
|  广东  |  南方都市报  | 2020-01-01 ~ 2025-03-22 | 67300  |  522.24 M |
|  广东  |   南方日报  | 2023-01-01 ~ 2025-03-22 | 73086  |  405.51 M |
|  广东  |   深圳晚报   | 2017-05-02 ~ 2025-03-22 | 106239 |  390.9 M  |
|  广东  |   珠江晚报   | 2018-01-01 ~ 2025-03-22 | 90142  |  130.94 M  |
|  广东  |   广州日报   | 2022-05-29 ~ 2025-03-22 | 33159  |  174.92 M  |
|  广西  |   广西日报   | 2020-01-01 ~ 2025-03-22 | 196384 |  403.17 M |
|  海南  |   海南日报   | 2008-03-01 ~ 2025-03-22 | 532478 |  1200.39 M |
|  海南  |  南国都市报  | 2013-01-01 ~ 2025-03-22 | 310493 |  539.18 M |
|  云南  |   云南日报   | 2021-05-15 ~ 2025-03-22 | 75128  |  192.66 M |
|  云南  |   春城晚报   | 2019-01-02 ~ 2025-03-22 | 72508  |  173.62 M |
|  贵州  |   贵州日报   | 2022-01-01 ~ 2025-03-22 | 82836  |  236.38 M |
|  四川  |  华西都市报  | 2009-01-01 ~ 2025-03-22 | 274883 | 683.56 M |
|  四川  |   四川日报   | 2022-01-01 ~ 2025-03-22 | 39919  |  141.44 M  |
|  甘肃  |   甘肃日报   | 2018-01-01 ~ 2025-03-22 | 131932 |  406.6 M  |
|  甘肃  | 甘肃经济日报  | 2017-04-06 ~ 2025-03-22 | 88153  |  213.59 M |
|  陕西  |   陕西日报   | 2020-01-01 ~ 2025-03-22 | 77204  |  268.79 M |
|  陕西  |   西安日报   | 2019-06-10 ~ 2025-03-22 | 91032  |  261.26 M |
|  陕西  |   西安晚报   | 2019-06-10 ~ 2025-03-12 | 85106  |  219.2 M  |
|  山西  |   山西晚报   | 2021-01-01 ~ 2025-03-12 | 38930  |  108.47 M |
|  山西  |   山西日报   | 2022-08-01 ~ 2025-03-22 | 52260  |  97.54 M  |
|  宁夏  |   宁夏日报   | 2022-02-01 ~ 2025-03-22 | 49040  |  148.85 M  |
| 内蒙古 |  内蒙古日报  | 2017-01-01 ~ 2025-03-22 | 118069 |  355.55 M |
|  新疆  |   新疆日报   | 2018-01-01 ~ 2025-03-22 | 99149  |  325.84 M |
|  西藏  |   西藏日报   | 2019-12-01 ~ 2025-03-22 | 65555  |  233.25 M |
|  青海  |   青海日报   | 2022-01-01 ~ 2025-03-22 | 51042  |  172.16 M  |
|  青海  |  西海都市报  | 2022-01-01 ~ 2025-03-22 | 34222  |  93.34 M  |
+--------+--------------+-------------------------+--------+-----------+
</code></pre></div><br>
<h2 id="13-数据格式">1.3 数据格式</h2>
<p>所有数据均为 <em><strong>csv</strong></em> 文件，所含字段<em><strong>date</strong></em>、<em><strong>title</strong></em>、<em><strong>content</strong></em> 。数据集总体积 40+G。</p>
<blockquote>
<p>少数几个媒体，只含date、content， 如人民日报、光明日报、中国青年报、中国政协报</p>
</blockquote>
<p><br><br></p>
<h2 id="二查看数据">二、查看数据</h2>
<h3 id="21-经济日报">2.1 经济日报</h3>
<p>国家级媒体， 如人民日报、光明日报、中国青年报、中国政协报， 时间跨度长， 记录量大。 特别适合构建诸如经济政策不确定性 EPU 指数。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;经济日报.csv&#39;</span><span class="p">)</span>
<span class="n">df</span>
</code></pre></div><p><img loading="lazy" src="img/01-jjrb-df.png" alt=""  />
</p>
<br>
<h3 id="22-海南日报">2.2 海南日报</h3>
<p>省级日报中相对数据量比较大的日报， 覆盖日期 2008~2025。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;海南日报.csv&#39;</span><span class="p">)</span>
<span class="n">df</span>
</code></pre></div><p><img loading="lazy" src="img/02-hnrb.png" alt=""  />
</p>
<br>
<h3 id="23-钱江晚报">2.3 钱江晚报</h3>
<p>浙江省的省级都市报，记录数挺多的， 覆盖日期 2006~2025。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;钱江晚报.csv&#39;</span><span class="p">)</span>
<span class="n">df</span>
</code></pre></div><p><img loading="lazy" src="img/03-df.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="三数据用途">三、数据用途</h2>
<p><a href="https://www.yunzhan365.com/newspapers/catalog.html">中文新闻报刊类</a>数据集 可提取丰富的指标，包括但不限于 <strong>经济政策不确定性指数</strong> 、<strong>环境政策不确定性</strong>、 <strong>媒体关注度指数</strong>、<strong>文本相似度</strong>、<strong>情感分析</strong>。此外， 可训练词向量，开发新的概念词典。数据带时间， 参照前面指标， 依主体、日期、指标进行计算， 可构造面板数据，构建新的指标指数。因此在经济学、管理学、新闻传播学、公共管理、社会学等领域均有较高的研究价值。</p>
<p>相关参考文献</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[1]洪永淼,刘俸奇,薛涧坡.政府与市场心理因素的经济影响及其测度[J].管理世界,2023,39(03):30-51.
[2]刘景江,郑畅然,洪永淼.机器学习如何赋能管理学研究？——国内外前沿综述和未来展望[J].管理世界,2023,39(09):191-216.
[3]张一帆,林建浩,樊嘉诚.新闻文本大数据与消费增速实时预测——基于叙事经济学的视角[J].金融研究,2023,(05):152-169.
[4]Huang, Yun, and Paul Luk. &#34;Measuring economic policy uncertainty in China.&#34; China Economic Review 59 (2020): 101367
[5]欧阳资生,陈世丽,杨希特,刘凤根,周学伟.经济政策不确定性、网络舆情与金融机构系统性风险[J].管理科学学报,2023,26(04):62-86.
[6]逯东,宋昕倍.媒体报道、上市公司年报可读性与融资约束[J].管理科学学报,2021,24(12):45-61.
[7]彭涛,黄福广,孙凌霞.经济政策不确定性与风险承担:基于风险投资的证据[J].管理科学学报,2021,24(03):98-114.
[8]庞锐.采纳与内化：多重制度压力如何影响河长制创新扩散——基于省级政府的定向配对事件史分析[J].公共管理学报,2023,20(02):25-37+165-166.
</code></pre></div><br>
<br>
<h2 id="四相关内容">四、相关内容</h2>
<ul>
<li>
<p><a href="https://textdata.cn/blog/2023-11-17-how-handle-mega-csv-that-far-exceed-memory/">代码 | 如何处理远超电脑内存的csv文件</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-12-18-how-to-generate-panel-data-from-daily-news-dataset/"><strong>代码 | 使用「新闻数据」构造概念词提及量「面板数据」</strong></a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-12-28-visualize-the-culture-change-using-people-daily-dataset/"><strong>可视化 | 人民日报语料反映七十年文化演变</strong></a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-12-20-measure-china-economic-policy-uncertainty/">代码 | 使用「新闻数据」测量 「经济政策不确定性EPU」指标</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/datasets_available_for_management_science/"><strong>LIST | 可供社科(经管)领域使用的数据集汇总</strong></a></p>
</li>
</ul>
]]></content:encoded>
    </item>
    
    <item>
      <title>数据集 | 5513w条外文电影评论数据(1900~2021.9)</title>
      <link>https://textdata.cn/blog/2025-03-17-the_mother_of_all_movie_review_datasets/</link>
      <pubDate>Mon, 17 Mar 2025 00:00:00 +0000</pubDate>
      
      <guid>/blog/2025-03-17-the_mother_of_all_movie_review_datasets/</guid>
      <description>数据集采集自Rotten Tomatoes网站， 含 10500部电影，5600万&#43; 用户评价！其中有 100 万&#43;为精选评论！ 电影从20世纪初到2024年的都有！英语、法语、日语、 Hindi 以及许多其他语言的电影！ 该数据集的用途包括计算机科学自然语言处理，社会学文化演变、刻板印象，传播学等。</description>
      <content:encoded><![CDATA[<h2 id="一数据集概况">一、数据集概况</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">数据集名: 电影评论数据
数据来源: https://www.rottentomatoes.com/
电影年份: 1902 ~ 2024
评论日期: 1996-01-19 ~ 2024-07-17
评论数量: 55130430 (5513w)
评论人数: 8766682 (876w)
电影数量: 10411(9026英文，其余为各种语言)
所含字段: 电影id、评论者id、评论文本、评分、电影上映日期等。
数据格式: csv
下载数据: https://www.kaggle.com/datasets/bwandowando/rotten-tomatoes-9800-movie-critic-and-user-reviews
本文声明: 科研用途； 如分享有问题，可加微信372335839，备注「姓名-学校-专业」
</code></pre></div><p><br><br></p>
<h2 id="二查看数据">二、查看数据</h2>
<h3 id="21-读取数据">2.1 读取数据</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pa</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;user_reviews.csv&#39;</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">))</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">55130430
CPU times: user 1min 29s, sys: 14.4 s, total: 1min 43s
Wall time: 1min 47s
</code></pre></div><p><img loading="lazy" src="img/01-df.png" alt=""  />
</p>
<br>
<h3 id="22-覆盖日期">2.2 覆盖日期</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s1">&#39;creationDate&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;creationDate&#39;</span><span class="p">])</span>

<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;评论覆盖日期: &#39;</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;creationDate&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">min</span><span class="p">()</span><span class="o">.</span><span class="n">date</span><span class="p">(),</span> <span class="s1">&#39;~&#39;</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;creationDate&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">()</span><span class="o">.</span><span class="n">date</span><span class="p">())</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">评论覆盖日期: 1996-01-19 ~ 2024-07-17
</code></pre></div><br>
<h3 id="23-电影数量">2.3 电影数量</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="o">.</span><span class="n">movieId</span><span class="o">.</span><span class="n">nunique</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">10411
</code></pre></div><br>
<h3 id="24-评论人数">2.4 评论人数</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="o">.</span><span class="n">userId</span><span class="o">.</span><span class="n">nunique</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">8766682
</code></pre></div><br>]]></content:encoded>
    </item>
    
    <item>
      <title>推荐 | 文本分析库 cntext 使用手册</title>
      <link>https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/</link>
      <pubDate>Fri, 14 Mar 2025 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-04-27-cntext2x-usage-tutorial/</guid>
      <description>社会学、经济学、管理学等学科领域，中文文本分析Python库cntext2x。</description>
      <content:encoded><![CDATA[<h2 id="cntext面向社会科学研究的中文文本分析工具库">cntext：面向社会科学研究的中文文本分析工具库</h2>
<p>cntext 是专为<strong>社会科学实证研究者</strong>设计的中文文本分析 Python 库。它不止于词频统计式的传统情感分析，还拥有词嵌入训练、语义投影计算，<strong>可从大规模非结构化文本中测量抽象构念</strong>——如态度、认知、文化观念与心理状态。</p>
<p>🎯 <strong>你能用它做什么</strong></p>
<ol>
<li>
<p>构建结构化研究数据集</p>
<ul>
<li>汇总多个文本文件（txt/pdf/docx/csv）为 DataFrame：<code>ct.read_files()</code></li>
<li>提取上市公司年报中的“管理层讨论与分析”（MD&amp;A）：<code>ct.extract_mda()</code></li>
<li>计算文本可读性指标（如Flesch指数）：<code>ct.readability()</code></li>
</ul>
</li>
<li>
<p><strong>基础文本分析(传统方法)</strong></p>
<ul>
<li>词频统计与关键词提取：<code>ct.word_count()</code></li>
<li>情感分析（可选hownet、dutir等内置词典）：<code>ct.sentiment()</code></li>
<li>文本相似度计算（余弦距离）：<code>ct.cosine_sim()</code></li>
</ul>
</li>
<li>
<p><strong>测量内隐态度与文化变迁</strong></p>
<ul>
<li>两行代码训练领域专用词向量（Word2Vec/GloVe）：<code>ct.Word2Vec()</code></li>
<li>构建概念语义轴（如“创新 vs 守旧”）：<code>ct.generate_concept_axis()</code></li>
<li>通过语义投影量化刻板印象、组织文化偏移：<code>ct.project_text()</code></li>
<li>计算文本对应的词嵌入投影得分WEPA：<code>ct.wepa()</code></li>
</ul>
</li>
<li>
<p><strong>融合大模型进行结构化分析</strong></p>
<ul>
<li>调用 LLM 对文本进行语义解析，返回结构化结果（如情绪维度、意图分类）：<code>ct.llm()</code></li>
</ul>
</li>
</ol>
<p>cntext 不追求黑箱预测，而致力于让文本成为理论驱动的科学测量工具。 开源免费，欢迎学界同仁使用、验证与共建。</p>
<p><br><br></p>
<h2 id="安装-cntext">安装 cntext</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">pip3 install cntext --upgrade
</code></pre></div><br>
<p>需要注意， <strong>cntext 使用环境为 Python3.9 ~ 3.12</strong>,如安装失败，问题可能出在 python 版本问题；</p>
<br>
<p>如cntext安装成功，但是导入失败，可先将scipy、numpy、gensim降到指定版本，看能否解决问题。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">pip install scipy==1.12.0
pip install numpy==1.26.4
pip install gensim==4.3.3
</code></pre></div><p><br><br></p>
<h2 id="功能模块">功能模块</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>
<span class="n">ct</span><span class="o">.</span><span class="n">hello</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/01-hello.jpg" alt=""  />
</p>
<p>cntext 含 io、model、stats、mind 五个模块</p>
<ol>
<li>导入数据用 io</li>
<li>训练模型扩展词典用 model</li>
<li>统计词频、情感分析、相似度等用 stats</li>
<li>可视化模块 plot</li>
<li>态度认知文化变迁用 mind</li>
<li>大模型 LLM</li>
</ol>
<p>函数部分加粗的为常用函数。</p>
<table>
<thead>
<tr>
<th>模块</th>
<th>函数</th>
<th>功能</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>io</strong></td>
<td><strong><em>ct.get_cntext_path()</em></strong></td>
<td>查看 cntext 安装路径</td>
</tr>
<tr>
<td><strong>io</strong></td>
<td><strong><em>ct.get_dict_list()</em></strong></td>
<td>查看 cntext 内置词典</td>
</tr>
<tr>
<td><strong>io</strong></td>
<td><code>ct.get_files(fformat)</code></td>
<td>查看符合 fformat 路径规则的所有的文件</td>
</tr>
<tr>
<td><strong>io</strong></td>
<td><code>ct.detect_encoding(file, num_lines=100)</code></td>
<td>诊断 txt、csv 编码格式</td>
</tr>
<tr>
<td><strong>io</strong></td>
<td><strong><em>ct.read_yaml_dict(yfile)</em></strong></td>
<td>读取内置 yaml 词典</td>
</tr>
<tr>
<td><strong>io</strong></td>
<td><strong><em>ct.read_pdf(file)</em></strong></td>
<td>读取 PDF 文件</td>
</tr>
<tr>
<td><strong>io</strong></td>
<td><strong><em>ct.read_docx(file)</em></strong></td>
<td>读取 docx 文件</td>
</tr>
<tr>
<td><strong>io</strong></td>
<td><strong><em>ct.read_file(file, encodings)</em></strong></td>
<td>读取文件</td>
</tr>
<tr>
<td><strong>io</strong></td>
<td><strong><em>ct.read_files(fformat, encoding)</em></strong></td>
<td>读取符合 fformat 路径规则的所有的文件，返回 df</td>
</tr>
<tr>
<td><strong>io</strong></td>
<td><strong><em>ct.extract_mda(text, kws_pattern)</em></strong></td>
<td>提取 A 股年报中的 MD&amp;A 文本内容。如果返回'',则提取失败。</td>
</tr>
<tr>
<td><strong>io</strong></td>
<td><strong><em>ct.traditional2simple(text)</em></strong></td>
<td>繁体转简体</td>
</tr>
<tr>
<td><strong>io</strong></td>
<td><strong><em>ct.clean_text(text, lang=&lsquo;chinese&rsquo;)</em></strong></td>
<td>根据指定语言对文本进行标准化清洗。</td>
</tr>
<tr>
<td><strong>io</strong></td>
<td><strong><em>ct.fix_text(text)</em></strong></td>
<td>将不正常的、混乱编码的文本转化为正常的文本。例如全角转半角</td>
</tr>
<tr>
<td><strong>io</strong></td>
<td><code>ct.fix_contractions(text)</code></td>
<td>英文缩写(含俚语表达)处理， 如 you&rsquo;re -&gt; you are</td>
</tr>
<tr>
<td><strong>model</strong></td>
<td><strong><em>ct.Word2Vec(corpus_file, encoding, lang=&lsquo;chinese&rsquo;, &hellip;)</em></strong></td>
<td>训练 Word2Vec</td>
</tr>
<tr>
<td><strong>model</strong></td>
<td><strong><em>ct.GloVe(corpus_file, encoding, lang=&lsquo;chinese&rsquo;, &hellip;)</em></strong></td>
<td>GloVe, 底层使用的 <a href="https://github.com/standfordnlp/GloVe">Standfordnlp/GloVe</a></td>
</tr>
<tr>
<td><strong>model</strong></td>
<td><strong><em>ct.evaluate_similarity(wv, file=None)</em></strong></td>
<td>使用近义法评估模型表现，默认使用内置的数据进行评估。</td>
</tr>
<tr>
<td><strong>model</strong></td>
<td><strong><em>ct.evaluate_analogy(wv, file=None)</em></strong></td>
<td>使用类比法评估模型表现，默认使用内置的数据进行评估。</td>
</tr>
<tr>
<td><strong>model</strong></td>
<td><strong><em>ct.glove2word2vec(glove_file, word2vec_file)</em></strong></td>
<td>将 GLoVe 模型.txt 文件转化为 Word2Vec 模型.txt 文件； 一般很少用到</td>
</tr>
<tr>
<td><strong>model</strong></td>
<td><strong><em>ct.load_w2v(wv_path)</em></strong></td>
<td>读取 cntext2.x 训练出的 Word2Vec/GloVe 模型文件</td>
</tr>
<tr>
<td><strong>model</strong></td>
<td><strong><em>ct.expand_dictionary(wv, seeddict, topn=100)</em></strong></td>
<td>扩展词典, 结果保存到路径[output/Word2Vec]中</td>
</tr>
<tr>
<td><strong>model</strong></td>
<td><code>ct.SoPmi(corpus_file, seed_file, lang='chinese')</code></td>
<td>共现法扩展词典</td>
</tr>
<tr>
<td><strong>stats</strong></td>
<td><code>ct.word_count(text, lang='chinese')</code></td>
<td>词频统计</td>
</tr>
<tr>
<td><strong>stats</strong></td>
<td><code>readability(text, lang='chinese', syllables=3)</code></td>
<td>文本可读性</td>
</tr>
<tr>
<td><strong>stats</strong></td>
<td><strong><em>ct.sentiment(text, diction, lang=&lsquo;chinese&rsquo;)</em></strong></td>
<td>无(等)权重词典的情感分析</td>
</tr>
<tr>
<td><strong>stats</strong></td>
<td><code>ct.sentiment_by_valence(text, diction, lang='chinese')</code></td>
<td>带权重的词典的情感分析</td>
</tr>
<tr>
<td><strong>stats</strong></td>
<td><strong><em>ct.word_in_context(text, keywords, window=3, lang=&lsquo;chinese&rsquo;)</em></strong></td>
<td>在 text 中查找 keywords 出现的上下文内容(窗口 window)，返回 df</td>
</tr>
<tr>
<td><strong>stats</strong></td>
<td><strong><em>ct.epu()</em></strong></td>
<td>使用新闻文本数据计算经济政策不确定性 EPU，返回 df</td>
</tr>
<tr>
<td><strong>stats</strong></td>
<td><strong><em>ct.fepu(text, ep_pattern='', u_pattern='')</em></strong></td>
<td>使用 md&amp;a 文本数据计算企业不确定性感知 FEPU</td>
</tr>
<tr>
<td><strong>stats</strong></td>
<td><strong><em>ct.semantic_brand_score(text, brands, lang=&lsquo;chinese&rsquo;)</em></strong></td>
<td>衡量品牌（个体、公司、品牌、关键词等）的重要性</td>
</tr>
<tr>
<td><strong>stats</strong></td>
<td><strong><em>ct.cosine_sim(text1, text2, lang=&lsquo;chinese&rsquo;)</em></strong></td>
<td>余弦相似度</td>
</tr>
<tr>
<td><strong>stats</strong></td>
<td><code>ct.jaccard_sim(text1, text2, lang='chinese')</code></td>
<td>Jaccard 相似度</td>
</tr>
<tr>
<td><strong>stats</strong></td>
<td><code>ct.minedit_sim(text1, text2, lang='chinese')</code></td>
<td>最小编辑距离</td>
</tr>
<tr>
<td><strong>stats</strong></td>
<td><code>ct.word_hhi(text)</code></td>
<td>文本的赫芬达尔-赫希曼指数</td>
</tr>
<tr>
<td><strong><em>plot</em></strong></td>
<td><code>ct.matplotlib_chinese()</code></td>
<td>支持 matplotlib 中文绘图</td>
</tr>
<tr>
<td><strong>plot</strong></td>
<td><code>ct.lexical_dispersion_plot1(text, targets_dict, lang, title, figsize)</code></td>
<td>对某一个文本 text， 可视化不同目标类别词 targets_dict 在文本中出现位置</td>
</tr>
<tr>
<td><strong>plot</strong></td>
<td><code>ct.lexical_dispersion_plot2(texts_dict, targets, lang, title, figsize)</code></td>
<td>对某几个文本 texts_dict， 可视化某些目标词 targets 在文本中出现相对位置(0~100)</td>
</tr>
<tr>
<td><strong>mind</strong></td>
<td><code>ct.generate_concept_axis(wv, poswords, negwords)</code></td>
<td>生成概念轴向量。</td>
</tr>
<tr>
<td><strong>mind</strong></td>
<td><strong><em>tm = ct.Text2Mind(wv)</em></strong><br></td>
<td>单个 word2vec 内挖掘潜在的态度偏见、刻板印象等。tm 含多重方法</td>
</tr>
<tr>
<td><strong>mind</strong></td>
<td><code>sematic_projection(wv, words, poswords, negwords, return_full=False, cosine=False)</code></td>
<td>测量语义投影</td>
</tr>
<tr>
<td><strong>mind</strong></td>
<td><code>ct.project_word(wv, a, b, cosine=False)</code></td>
<td>计算词语 a 在词语 b 上的投影</td>
</tr>
<tr>
<td><strong>mind</strong></td>
<td><code>ct.project_text(wv, text, axis, lang='chinese', cosine=False)</code></td>
<td>计算词语文本text在概念轴向量axis上的投影值</td>
</tr>
<tr>
<td><strong>mind</strong></td>
<td><code>ct.wepa(wv, text, poswords, negwords, lang='chinese')</code></td>
<td>计算文本在概念轴上的投影得分，返回wepa得分</td>
</tr>
<tr>
<td><strong>mind</strong></td>
<td><code>ct.sematic_distance(wv, words1, words2)</code></td>
<td>测量语义距离</td>
</tr>
<tr>
<td><strong>mind</strong></td>
<td><code>ct.divergent_association_task(wv, words)</code></td>
<td>测量发散思维(创造力)</td>
</tr>
<tr>
<td><strong>mind</strong></td>
<td><code>ct.discursive_diversity_score(wv, words)</code></td>
<td>测量语言差异性(认知差异性)</td>
</tr>
<tr>
<td><strong>mind</strong></td>
<td><strong>ct.procrustes_align(base_wv, other_wv)</strong></td>
<td>两个 word2vec 进行语义对齐，可反应随时间的社会语义变迁</td>
</tr>
<tr>
<td><strong><em>LLM</em></strong></td>
<td><strong>ct.llm(text, prompt, output_format, task, backend, base_url, api_key, model_name, temperature)</strong></td>
<td>调用大模型执行结构化文本分析任务（如情感分析、关键词提取、分类等）。</td>
</tr>
</tbody>
</table>
<p><br><br></p>
<h2 id="quickstart">QuickStart</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;当前cntext版本: &#39;</span><span class="p">,</span> <span class="n">ct</span><span class="o">.</span><span class="n">__version__</span><span class="p">)</span>
<span class="n">help</span><span class="p">(</span><span class="n">ct</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-go" data-lang="go"><span class="nx">当前cntext版本</span><span class="p">:</span> <span class="mf">2.1.7</span>

<span class="nx">Help</span> <span class="nx">on</span> <span class="kn">package</span> <span class="nx">cntext</span><span class="p">:</span>

<span class="nx">NAME</span>
    <span class="nx">cntext</span>

<span class="nx">PACKAGE</span> <span class="nx">CONTENTS</span>
    <span class="nx">io</span>
    <span class="nx">mind</span>
    <span class="nx">model</span>
    <span class="nx">stats</span>
    <span class="nx">llm</span>
<span class="o">...</span>
</code></pre></div><br>
<br>
<h2 id="一io-模块">一、IO 模块</h2>
<table>
<thead>
<tr>
<th>模块</th>
<th>函数</th>
<th>功能</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>io</strong></td>
<td><strong><em>ct.get_dict_list()</em></strong></td>
<td>查看 cntext 内置词典</td>
</tr>
<tr>
<td><strong>io</strong></td>
<td><strong><em>ct.read_yaml_dict(yfile)</em></strong></td>
<td>读取内置 yaml 词典</td>
</tr>
<tr>
<td><strong>io</strong></td>
<td><code>ct.detect_encoding(file, num_lines=100)</code></td>
<td>诊断 txt、csv 编码格式</td>
</tr>
<tr>
<td><strong>io</strong></td>
<td><code>ct.get_files(fformat)</code></td>
<td>查看符合 fformat 路径规则的所有的文件</td>
</tr>
<tr>
<td><strong>io</strong></td>
<td><strong><em>ct.read_yaml_dict(yfile)</em></strong></td>
<td>读取内置 yaml 词典</td>
</tr>
<tr>
<td><strong>io</strong></td>
<td><strong><em>ct.read_pdf(file)</em></strong></td>
<td>读取 PDF 文件</td>
</tr>
<tr>
<td><strong>io</strong></td>
<td><strong><em>ct.read_file(file, encoding)</em></strong></td>
<td>读取文件</td>
</tr>
<tr>
<td><strong>io</strong></td>
<td><strong><em>ct.read_files(fformat, encoding)</em></strong></td>
<td>读取符合 fformat 路径规则的所有的文件，返回 df</td>
</tr>
<tr>
<td><strong>io</strong></td>
<td><strong><em>ct.extract_mda(text, kws_pattern)</em></strong></td>
<td>提取 A 股年报中的 MD&amp;A 文本内容。如果返回'',则提取失败。</td>
</tr>
<tr>
<td><strong>io</strong></td>
<td><strong><em>ct.traditional2simple(text)</em></strong></td>
<td>繁体转简体</td>
</tr>
<tr>
<td><strong>io</strong></td>
<td><strong><em>ct.fix_text(text)</em></strong></td>
<td>将不正常的、混乱编码的文本转化为正常的文本。例如全角转半角</td>
</tr>
<tr>
<td><strong>io</strong></td>
<td><code>ct.fix_contractions(text)</code></td>
<td>英文缩写(含俚语表达)处理， 如 you&rsquo;re -&gt; you are</td>
</tr>
</tbody>
</table>
<h3 id="11-get_dict_list">1.1 get_dict_list()</h3>
<p>查看 cntext 内置词典</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="n">ct</span><span class="o">.</span><span class="n">get_dict_list</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[&#39;zh_common_NTUSD.yaml&#39;,
 &#39;zh_common_DUTIR.yaml&#39;,
 &#39;enzh_common_StopWords.yaml&#39;,
 &#39;en_valence_Concreteness.yaml&#39;,
 &#39;en_common_LoughranMcDonald.yaml&#39;,
 &#39;zh_common_FinanceSenti.yaml&#39;,
 &#39;zh_common_FLS.yaml&#39;,
 &#39;zh_common_TsinghuaPraiseDegrade.yaml&#39;,
 &#39;zh_common_FEPU.yaml&#39;,
 &#39;en_common_ANEW.yaml&#39;,
 &#39;en_common_NRC.yaml&#39;,
 &#39;zh_valence_ChineseEmoBank.yaml&#39;,
 &#39;zh_valence_SixSemanticDimensionDatabase.yaml&#39;,
 &#39;zh_common_FinacialFormalUnformal.yaml&#39;,
 &#39;zh_common_LoughranMcDonald.yaml&#39;,
 &#39;enzh_common_AdvConj.yaml&#39;,
 &#39;en_common_SentiWS.yaml&#39;,
 &#39;zh_common_Digitalization.yaml&#39;,
 &#39;en_common_LSD2015.yaml&#39;,
 &#39;zh_common_HowNet.yaml&#39;,
 &#39;zh_common_EPU.yaml&#39;]
</code></pre></div><h3 id="12-内置-yaml-词典">1.2 内置 yaml 词典</h3>
<table>
<thead>
<tr>
<th>pkl 文件</th>
<th>词典</th>
<th>语言</th>
<th>功能</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong><em>zh_valence_ChineseEmoBank.yaml</em></strong></td>
<td>中文情感词典，含<code>效价valence</code>和<code>唤醒度arousal</code>。在 cntext 中，我们只使用了 CVAW 词表(单词)，其他词典如 CVAP, CVAS, CVAT 没有纳入到 ChineseEmoBank.pkl.</td>
<td>Chinese</td>
<td><code>效价valence</code>和<code>唤醒度arousal</code></td>
</tr>
<tr>
<td><strong><em>zh_common_DUTIR.yaml</em></strong></td>
<td>大连理工大学情感本体库</td>
<td>中文</td>
<td>七大类情绪，<code>哀, 好, 惊, 惧, 乐, 怒, 恶</code></td>
</tr>
<tr>
<td><strong><em>zh_common_HowNet.yaml</em></strong></td>
<td>知网 Hownet 词典</td>
<td>中文</td>
<td>正面词、负面词</td>
</tr>
<tr>
<td><code>en_common_SentiWS.yaml</code></td>
<td>SentimentWortschatz (SentiWS)</td>
<td>德文</td>
<td>正面词、负面词；<br></td>
</tr>
<tr>
<td><strong><em>zh_common_FinacialFormalUnformal.yaml</em></strong></td>
<td>金融领域正式、非正式；积极消极</td>
<td>中文</td>
<td>formal-pos、<br>formal-neg；<br>unformal-pos、<br>unformal-neg</td>
</tr>
<tr>
<td><code>en_common_ANEW.yaml</code></td>
<td>英语单词的情感规范 Affective Norms for English Words (ANEW)</td>
<td>英文</td>
<td>pleasure, arousal, dominance</td>
</tr>
<tr>
<td><code>en_common_LSD2015.yaml</code></td>
<td>Lexicoder Sentiment Dictionary (2015)</td>
<td>英文</td>
<td>正面词、负面词</td>
</tr>
<tr>
<td><code>en_common_NRC.yaml</code></td>
<td>NRC Word-Emotion Association Lexicon</td>
<td>英文</td>
<td>细粒度情绪词；</td>
</tr>
<tr>
<td><strong><em>zh_valence_SixSemanticDimensionDatabase.yaml</em></strong></td>
<td><a href="https://textdata.cn/blog/2023-03-20-nature-six-semantic-dimension-database/"><strong>通用中英文六维语义情感词典</strong></a>, 含 17940 个中文词的六维度词库， 且每个维度有权重。</td>
<td>中文</td>
<td>vision、socialness、emotion、time、space、motor</td>
</tr>
<tr>
<td><code>enzh_common_AdvConj.yaml</code></td>
<td>副词连词</td>
<td>中、英</td>
<td></td>
</tr>
<tr>
<td><strong><em>enzh_common_StopWords.yaml</em></strong></td>
<td>中英文停用词</td>
<td>中、英</td>
<td>停用词</td>
</tr>
<tr>
<td><strong><em>en_valence_Concreteness.yaml</em></strong></td>
<td><a href="https://textdata.cn/blog/jcr_concreteness_computation/">英文具体性词典</a></td>
<td>English</td>
<td>word &amp; concreateness score</td>
</tr>
<tr>
<td><strong><em>zh_common_LoughranMcDonald.yaml</em></strong></td>
<td>中文 LoughranMcDonald 词典</td>
<td>中文</td>
<td>正面、负面词</td>
</tr>
<tr>
<td><strong><em>zh_common_Digitalization.yaml</em></strong></td>
<td><a href="https://textdata.cn/blog/2022-11-03-mda-measure-digitalization/">管理世界|吴非(2021)数字化词典</a></td>
<td>中文</td>
<td>含人工智能技术、大数据技术、云计算技术、区块链技术、数字技术应用等关键词列表。</td>
</tr>
<tr>
<td><strong><em>en_common_LoughranMcDonald.yaml</em></strong></td>
<td>英文 LoughranMcDonald 词典</td>
<td>英文</td>
<td>金融 LM 情绪词典 2018 年版本，含七个词表，分别是 Negative, Positive, Uncertainty, Litigious, StrongModal, WeakModal, Constraining</td>
</tr>
<tr>
<td><strong><em>zh_common_FLS.yaml</em></strong></td>
<td><a href="https://textdata.cn/blog/2023-09-08-earnings-communication-conference-forward-looking-statements-information/"><strong>业绩说明会前瞻性词典集</strong></a></td>
<td>中文</td>
<td>含 174 个词语</td>
</tr>
<tr>
<td><strong><em>zh_common_RhetoricalNationalism.yaml</em></strong></td>
<td>修辞民族主义</td>
<td>中文</td>
<td>含四个维度，民族自豪感、民族复兴、企业角色、排外主义，每个维度 100 个词。</td>
</tr>
</tbody>
</table>
<br>
<h3 id="13-read_dict_yaml">1.3 read_dict_yaml()</h3>
<p>使用 cntext 读取 <strong><em>.yaml</em></strong> 词典文件； 返回的信息包括</p>
<ul>
<li>Name 词典的名字</li>
<li>Desc 词典的含义、概念解释</li>
<li>Refer 词典文献出处</li>
<li>Category 词典 Dictionary 的关键词</li>
<li>Dictionary 词典, python 字典格式</li>
</ul>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>
<span class="nb">print</span><span class="p">(</span><span class="n">ct</span><span class="o">.</span><span class="n">read_yaml_dict</span><span class="p">(</span><span class="s1">&#39;zh_common_Digitalization.yaml&#39;</span><span class="p">))</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">{&#39;Name&#39;: &#39;中文数字化词典&#39;,
&#39;Desc&#39;: &#39;基于这篇论文，构建了中文数字化词典，含人工智能技术、大数据技术、云计算技术、区块链技术、数字技术应用等关键词列表。 &#39;, &#39;Refer&#39;: &#39;吴非,胡慧芷,林慧妍,任晓怡. 企业数字化转型与资本市场表现——来自股票流动性的经验证据[J]. 管理世界,2021,37(07):130-144+10.&#39;,
&#39;Category&#39;: [&#39;Artificial_Intelligence&#39;, &#39;Big_Data&#39;, &#39;Cloud_Computing&#39;, &#39;Block_Chains&#39;, &#39;Usage_of_Digitalization&#39;],

&#39;Dictionary&#39;:
    {&#39;Artificial_Intelligence&#39;: [&#39;人工智能&#39;, &#39;商业智能&#39;, &#39;图像理解&#39;, &#39;投资决策辅助系统&#39;, &#39;智能数据分析&#39;, &#39;智能机器人&#39;, &#39;机器学习&#39;, &#39;深度学习&#39;, &#39;语义搜索&#39;, &#39;生物识别技术&#39;, &#39;人脸识别&#39;, &#39;语音识别&#39;, &#39;身份验证&#39;, &#39;自动驾驶&#39;, &#39;自然语言处理&#39;],
    &#39;Big_Data&#39;: [&#39;大数据&#39;, &#39;数据挖掘&#39;, &#39;文本挖掘&#39;, &#39;数据可视化&#39;, &#39;异构数据&#39;, &#39;征信&#39;, &#39;增强现实&#39;, &#39;混合现实&#39;, &#39;虚拟现实&#39;],
    &#39;Cloud_Computing&#39;: [&#39;云计算&#39;, &#39;流计算&#39;, &#39;图计算&#39;, &#39;内存计算&#39;, &#39;多方安全计算&#39;, &#39;类脑计算&#39;, &#39;绿色计算&#39;, &#39;认知计算&#39;, &#39;融合架构&#39;, &#39;亿级并发&#39;, &#39;EB级存储&#39;, &#39;物联网&#39;, &#39;信息物理系统&#39;],
    &#39;Block_Chains&#39;: [&#39;区块链&#39;, &#39;数字货币&#39;, &#39;分布式计算&#39;, &#39;差分隐私技术&#39;, &#39;智能金融合约&#39;],
    &#39;Usage_of_Digitalization&#39;: [&#39;移动互联网&#39;, &#39;工业互联网&#39;, &#39;移动互联&#39;, &#39;互联网医疗&#39;, &#39;电子商务&#39;, &#39;移动支付&#39;, &#39;第三方支付&#39;, &#39;NFC支付&#39;, &#39;智能能源&#39;, &#39;B2B&#39;, &#39;B2C&#39;, &#39;C2B&#39;, &#39;C2C&#39;, &#39;O2O&#39;, &#39;网联&#39;, &#39;智能穿戴&#39;, &#39;智慧农业&#39;, &#39;智能交通&#39;, &#39;智能医疗&#39;, &#39;智能客服&#39;, &#39;智能家居&#39;, &#39;智能投顾&#39;, &#39;智能文旅&#39;, &#39;智能环保&#39;, &#39;智能电网&#39;, &#39;智能营销&#39;, &#39;数字营销&#39;, &#39;无人零售&#39;, &#39;互联网金融&#39;, &#39;数字金融&#39;, &#39;Fintech&#39;, &#39;金融科技&#39;, &#39;量化金融&#39;, &#39;开放银行&#39;]}}
</code></pre></div><br>
<h3 id="14-detect_encoding">1.4 detect_encoding()</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">ct.detect_encoding(file)
</code></pre></div><p>通过读取前 num_lines 来识别 txt/csv 文件的编码格式</p>
<ul>
<li><strong><em>file</em></strong> 文件路径</li>
</ul>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="c1">#读取data文件夹下的【三体.txt】</span>
<span class="c1">#识别编码方式</span>
<span class="n">ct</span><span class="o">.</span><span class="n">detect_encoding</span><span class="p">(</span><span class="n">file</span><span class="o">=</span><span class="s1">&#39;data/三体.txt&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">utf-8
</code></pre></div><br>
<h3 id="15-get_filesfformat">1.5 get_files(fformat)</h3>
<ul>
<li><strong>fformat</strong> fformat 格式支持 txt/pdf/docx/xlsx/csv 等。 <code>*</code>表示通配符</li>
</ul>
<p>查看符合 fformat 路径规则的所有的文件， fformat 格式支持 txt/pdf/docx/xlsx/csv 等。 <code>*</code>表示通配符</p>
<table>
<thead>
<tr>
<th>fformat 格式</th>
<th>识别的文件</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>*.txt</code></td>
<td>匹配当前代码所在路径内的所有 txt</td>
</tr>
<tr>
<td><code>*.pdf</code></td>
<td>匹配当前代码所在路径内的所有 pdf</td>
</tr>
<tr>
<td><code>data/*.txt</code></td>
<td>匹配「文件夹 data」内所有的 txt</td>
</tr>
<tr>
<td><br></td>
<td></td>
</tr>
</tbody>
</table>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#查看【文件夹data】内所有的 txt文件。</span>
<span class="n">ct</span><span class="o">.</span><span class="n">get_files</span><span class="p">(</span><span class="n">fformat</span><span class="o">=</span><span class="s1">&#39;data/*.txt&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[&#39;data/三体.txt&#39;,
 &#39;data/santi.txt&#39;,
 &#39;data/w2v_corpus.txt&#39;,
 &#39;data/sopmi_corpus.txt&#39;,
 &#39;data/brown_corpus.txt&#39;,
 &#39;data/sopmi_seed_words.txt&#39;]
</code></pre></div><br>
<h3 id="16-read_pdf">1.6 read_pdf</h3>
<p>读取 PDF，返回文本内容</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">ct</span><span class="o">.</span><span class="n">read_pdf</span><span class="p">(</span><span class="n">file</span><span class="p">)</span>
</code></pre></div><ul>
<li><strong><em>file</em></strong> PDF 文件路径</li>
</ul>
<p>点击 <a href="https://textdata.cn/data/%E6%A0%BC%E5%8A%9B%E7%94%B5%E5%99%A82023.pdf"><strong>格力电器 2023.pdf</strong></a></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="n">text</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">read_pdf</span><span class="p">(</span><span class="s1">&#39;格力电器2023.pdf&#39;</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">珠海格力电器股份有限公司 2023年年度报告全文
珠海格力电器股份有限公司
2023年年度报告


二〇二四年四月
珠海格力电器股份有限公司 2023年年度报告全文
 第 2 页 共 249 页 第一节 重要提示、目录和释义
公司董事会、监事会及董事、监事、高级管理人员保证年度报告内容
的真实、准确、完整，不存在虚假记载、误导性陈述或重大遗漏，并承担
个别和连带的法律
......
</code></pre></div><br>
<h3 id="17-read_docx">1.7 read_docx</h3>
<p>读取 docx，返回文本内容</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">ct</span><span class="o">.</span><span class="n">read_docx</span><span class="p">(</span><span class="n">file</span><span class="p">)</span>
</code></pre></div><ul>
<li><strong><em>file</em></strong> docx 文件路径</li>
</ul>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="n">text</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">read_docx</span><span class="p">(</span><span class="s1">&#39;test.docx&#39;</span><span class="p">)</span>
<span class="n">text</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">这是来自test.docx里内容
</code></pre></div><br>
<h3 id="18-read_file">1.8 read_file()</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">ct.read_file(file, encoding=&#39;utf-8&#39;)
</code></pre></div><ul>
<li><strong>file</strong> 待读取的文件路径； 支持 txt、pdf、docx、xlsx、xls， 返回 DataFrame(含 doc 和 file 两个字段)。</li>
<li><strong>encoding</strong> 待读取文件的编码方式</li>
</ul>
<p>以 <code>data/三体.txt</code> 为例</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="c1">#默认encoding=&#39;utf-8&#39;</span>
<span class="c1">#sdf = ct.read_file(file=&#39;data/三体.txt&#39;)</span>

<span class="n">sdf</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">read_file</span><span class="p">(</span><span class="n">file</span><span class="o">=</span><span class="s1">&#39;data/三体.txt&#39;</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8&#39;</span><span class="p">)</span>
<span class="n">sdf</span>
</code></pre></div><p><img loading="lazy" src="img/01-san_ti_df.png" alt=""  />
</p>
<br>
<h3 id="19-read_files">1.9 read_files()</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">ct.read_files(fformat, encoding=&#39;utf-8&#39;）
</code></pre></div><p>批量读取符合 fformat 格式的所有文件数据，返回 DataFrame(含 doc 和 file 两个字段)。</p>
<p>读取[文件夹 data 里所有 txt]</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="c1">#默认encoding=&#39;utf-8&#39;</span>
<span class="c1">#ddf = ct.read_files(fformat=&#39;data/*.txt&#39;)</span>

<span class="n">ddf</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">read_files</span><span class="p">(</span><span class="n">fformat</span><span class="o">=</span><span class="s1">&#39;data/*.txt&#39;</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8&#39;</span><span class="p">)</span>
<span class="n">ddf</span>
</code></pre></div><p><img loading="lazy" src="img/02-ddf.png" alt=""  />
</p>
<br>
<h3 id="110-extract_mda">1.10 extract_mda</h3>
<p>提取 A 股年报中的 MD&amp;A 文本内容。如果返回'',则提取失败。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">ct.extract_mda(text, kws_pattern=&#39;&#39;)
</code></pre></div><ul>
<li>text 中国 A 股年报原始文本</li>
<li>kws_pattern 管理层讨论与分析章节识别关键词的模板。cntext 内置的 kws_pattern 内容如下</li>
</ul>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">kws_pattern = &#39;董事会报告|董事会报告与管理讨论|企业运营与管理评述|经营总结与分析|管理层评估与未来展望|董事局报告|管理层讨论与分析|经营情况讨论与分析|经营业绩分析|业务回顾与展望|公司经营分析|管理层评论与分析|执行摘要与业务回顾|业务运营分析&#39;
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="n">text</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">read_pdf</span><span class="p">(</span><span class="s1">&#39;格力电器2023.pdf&#39;</span><span class="p">)</span>
<span class="n">mda_text</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">extract_mda</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">mda_text</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">&#39;管理层讨论与分析  \n一、报告期内公司所处行业情况  \n（一）行业发展现状  \n1.消费领域 ——家电行业稳定增长，空调市场恢复明显  \n2023年，中国经济保持了整体恢复向好的态势，激发消费是稳增长的重中之重。国家鼓励和推动消费品以旧换\n新，促进消费经济大循环，加速更新需求释放，推动高能效产品设备销售和出口增长，进一步激发绿色消费潜力。  \n1）家电行业稳定增长  \n2023年，国内经济恢复明显，家电行业稳定增长。根据全国家用电器工业信息中心发布的《 2023年中国家电\n行业年度报告》，家电行业外销明显增长，出口规模为 6,174亿元，同比增长 9.9%；国内市场实现稳步增长，销售\n规模为7&#39;
.......
.......
</code></pre></div><br>
<p>以<a href="https://textdata.cn/blog/2023-03-23-china-a-share-market-dataset-mda-from-01-to-21/">2001 年~2023 会计年度报告数据集</a>为例， 查看 <strong><em>extract_mda</em></strong> 的抽取 mda 的能力。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">glob</span>
<span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;extract_mda识别能力&#39;</span><span class="p">)</span>
<span class="k">for</span> <span class="n">year</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">2001</span><span class="p">,</span> <span class="mi">2024</span><span class="p">):</span>
    <span class="n">num</span> <span class="o">=</span> <span class="mi">0</span>
    <span class="k">for</span> <span class="n">file</span> <span class="ow">in</span> <span class="n">glob</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;年报txt/</span><span class="si">{</span><span class="n">year</span><span class="si">}</span><span class="s1">/*.txt&#39;</span><span class="p">):</span>
        <span class="n">mda_text</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">extract_mda</span><span class="p">(</span><span class="nb">open</span><span class="p">(</span><span class="n">file</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">())</span>
        <span class="k">if</span> <span class="n">mda_text</span><span class="o">!=</span><span class="s1">&#39;&#39;</span><span class="p">:</span>
            <span class="n">num</span> <span class="o">=</span> <span class="n">num</span> <span class="o">+</span> <span class="mi">1</span>

    <span class="n">volume</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">glob</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;年报txt/</span><span class="si">{</span><span class="n">year</span><span class="si">}</span><span class="s1">/*.txt&#39;</span><span class="p">))</span>
    <span class="n">ratio</span> <span class="o">=</span> <span class="n">num</span><span class="o">/</span><span class="n">volume</span>

    <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;</span><span class="si">{</span><span class="n">year</span><span class="si">}</span><span class="s1">: </span><span class="si">{</span><span class="n">ratio</span><span class="si">:</span><span class="s1">.2f</span><span class="si">}</span><span class="s1">&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">2001: 0.24
2002: 0.37
2003: 0.43
2004: 0.70
2005: 0.77
2006: 0.78
2007: 0.79
2008: 0.77
2009: 0.79
2010: 0.82
2011: 0.84
2012: 0.96
2013: 0.95
2014: 0.98
2015: 0.98
2016: 0.99
2017: 0.98
2018: 0.98
2019: 0.99
2020: 0.97
2021: 0.98
2022: 0.99
2023: 0.99
</code></pre></div><p>建议各位用最近 10 年的年报数据，通过 extract_mda 提取 mda 文本，或者直接购买 [数据集 | 2001-2023 年 A 股上市公司年报&amp;管理层讨论与分析](数据集 | 2001-2023 年 A 股上市公司年报&amp;管理层讨论与分析)</p>
<br>
<h3 id="111-traditional2simple">1.11 traditional2simple()</h3>
<p>繁体转简体</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">ct.traditional2simple(text, mode=&#39;t2s&#39;)
</code></pre></div><ul>
<li><strong><em>text</em></strong> 待转换的文本</li>
<li><strong><em>mode</em></strong> 转换模式， 默认 mode=&lsquo;t2s&rsquo;繁转简; mode 还支持 s2t</li>
</ul>
 <br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="n">text</span> <span class="o">=</span> <span class="s1">&#39;簡體漢字&#39;</span>
<span class="n">ct</span><span class="o">.</span><span class="n">traditional2simple</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">&#39;简体汉字&#39;
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">text</span> <span class="o">=</span> <span class="s1">&#39;简体汉字&#39;</span>
<span class="n">ct</span><span class="o">.</span><span class="n">traditional2simple</span><span class="p">(</span><span class="n">text</span><span class="p">,</span> <span class="n">mode</span><span class="o">=</span><span class="s1">&#39;s2t&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">&#39;簡體漢字&#39;
</code></pre></div><br>
<h3 id="112-fix_text">1.12 fix_text()</h3>
<p>将不正常的、混乱编码的文本转化为正常的文本。例如全角转半角</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="n">raw_text</span> <span class="o">=</span> <span class="s1">&#39;今日起可中遇到技术问题，可以拨打电话０３７１－６６３２１９９１、６６３２１９７３咨询。&#39;</span>

<span class="n">text</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">fix_text</span><span class="p">(</span><span class="n">raw_text</span><span class="p">)</span>
<span class="n">text</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">今日起可中遇到技术问题，可以拨打电话0371-66321991、66321973咨询。
</code></pre></div><br>
<h3 id="113-fix_contractionstext">1.13 fix_contractions(text)</h3>
<p>将英文缩写(含俚语表达)转化为完整的表达，如如</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">- you&#39;re -&gt; you are
- yall  -&gt; you all
- gotta  -&gt; got to
...
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="n">raw_text</span> <span class="o">=</span> <span class="s2">&#34;yall&#39;re happy now&#34;</span>

<span class="n">text</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">fix_contractions</span><span class="p">(</span><span class="n">raw_text</span><span class="p">)</span>
<span class="n">text</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">&#34;you all are happy now&#34;
</code></pre></div><br>
<h3 id="114-clean_texttext">1.14 clean_text(text)</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">ct</span><span class="o">.</span><span class="n">clean_text</span><span class="p">(</span><span class="n">text</span><span class="p">,</span> <span class="n">lang</span><span class="o">=</span><span class="s1">&#39;chinese&#39;</span><span class="p">)</span>
</code></pre></div><ul>
<li><strong><em>text</em></strong> 待处理的文本</li>
<li><strong><em>lang</em></strong> 语言类型， 默认 lang=&lsquo;chinese&rsquo;, 支持&quot;english&quot;、&ldquo;chinese&rdquo;</li>
</ul>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="n">chinese_text</span> <span class="o">=</span> <span class="p">(</span><span class="s2">&#34;今天的训练很棒！跑了5.6公里，心率稳定。&#34;</span>
                <span class="s2">&#34;查看 https://example.com/data 😊 #健身打卡&#34;</span><span class="p">)</span>

<span class="nb">print</span><span class="p">(</span><span class="s2">&#34;&gt;&gt;&gt; 中文清洗&#34;</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">&#34;原始:&#34;</span><span class="p">,</span> <span class="nb">repr</span><span class="p">(</span><span class="n">chinese_text</span><span class="p">))</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">&#34;清洗:&#34;</span><span class="p">,</span> <span class="nb">repr</span><span class="p">(</span><span class="n">ct</span><span class="o">.</span><span class="n">clean_text</span><span class="p">(</span><span class="n">chinese_text</span><span class="p">,</span> <span class="n">lang</span><span class="o">=</span><span class="s2">&#34;chinese&#34;</span><span class="p">)))</span>
<span class="nb">print</span><span class="p">()</span>

    <span class="c1"># 英文测试</span>
<span class="n">english_text</span> <span class="o">=</span> <span class="p">(</span><span class="s2">&#34;Great workout today! Ran 5.6 miles, HR stable. &#34;</span>
                <span class="s2">&#34;Check https://example.com/data 😊 #Fitness&#34;</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">&#34;&gt;&gt;&gt; 英文清洗&#34;</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">&#34;原始:&#34;</span><span class="p">,</span> <span class="nb">repr</span><span class="p">(</span><span class="n">english_text</span><span class="p">))</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">&#34;清洗:&#34;</span><span class="p">,</span> <span class="nb">repr</span><span class="p">(</span><span class="n">ct</span><span class="o">.</span><span class="n">clean_text</span><span class="p">(</span><span class="n">english_text</span><span class="p">,</span> <span class="n">lang</span><span class="o">=</span><span class="s2">&#34;english&#34;</span><span class="p">)))</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">&gt;&gt;&gt; 中文清洗
原始: &#39;今天的训练很棒！跑了5.6公里，心率稳定。查看 https://example.com/data 😊 #健身打卡&#39;
清洗: &#39;今天的训练很棒！跑了数字公里，心率稳定。查看   健身打卡&#39;

&gt;&gt;&gt; 英文清洗
原始: &#39;Great workout today! Ran 5.6 miles, HR stable. Check https://example.com/data 😊 #Fitness&#39;
清洗: &#39;great workout today! ran NUMBER miles, hr stable. check  😊 #fitness&#39;
</code></pre></div><p><br><br></p>
<h2 id="二stats-模块">二、Stats 模块</h2>
<table>
<thead>
<tr>
<th>模块</th>
<th>函数</th>
<th>功能</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>stats</strong></td>
<td><code>ct.word_count(text, lang='chinese')</code></td>
<td>词频统计</td>
</tr>
<tr>
<td><strong>stats</strong></td>
<td><code>ct.readability(text, lang='chinese')</code></td>
<td>文本可读性</td>
</tr>
<tr>
<td><strong>stats</strong></td>
<td><strong><em>ct.sentiment(text, diction, lang=&lsquo;chinese&rsquo;)</em></strong></td>
<td>无(等)权重词典的情感分析</td>
</tr>
<tr>
<td><strong>stats</strong></td>
<td><code>ct.sentiment_by_valence(text, diction, lang='chinese')</code></td>
<td>带权重的词典的情感分析</td>
</tr>
<tr>
<td><strong>stats</strong></td>
<td><strong><em>ct.word_in_context(text, keywords, window=3, lang=&lsquo;chinese&rsquo;)</em></strong></td>
<td>在 text 中查找 keywords 出现的上下文内容(窗口 window)，返回 df</td>
</tr>
<tr>
<td><strong>stats</strong></td>
<td><strong><em>ct.epu(text, e_pattern, p_pattern, u_pattern)</em></strong></td>
<td>使用新闻文本数据计算经济政策不确定性 EPU，返回 df</td>
</tr>
<tr>
<td><strong>stats</strong></td>
<td><strong><em>ct.fepu(text, ep_pattern='&rsquo;, u_pattern='')</em></strong></td>
<td>使用 md&amp;a 文本数据计算企业不确定性感知 FEPU</td>
</tr>
<tr>
<td><strong>stats</strong></td>
<td><strong><em>ct.semantic_brand_score(text, brands, lang=&lsquo;chinese&rsquo;)</em></strong></td>
<td>衡量品牌（个体、公司、品牌、关键词等）的重要性</td>
</tr>
<tr>
<td><strong>stats</strong></td>
<td><strong><em>ct.cosine_sim(text1, text2, lang=&lsquo;chinese&rsquo;)</em></strong></td>
<td>余弦相似度</td>
</tr>
<tr>
<td><strong>stats</strong></td>
<td><code>ct.jaccard_sim(text1, text2, lang='chinese')</code></td>
<td>Jaccard 相似度</td>
</tr>
<tr>
<td><strong>stats</strong></td>
<td><code>ct.minedit_sim(text1, text2, lang='chinese')</code></td>
<td>最小编辑距离</td>
</tr>
<tr>
<td><strong>stats</strong></td>
<td><code>ct.word_hhi(text)</code></td>
<td>文本的赫芬达尔-赫希曼指数</td>
</tr>
</tbody>
</table>
<br>
<h3 id="21-word_count">2.1 word_count()</h3>
<p>统计词频， 返回 Counter(类似于 python 字典) ； 支持中英文</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">ct.word_count(text, lang=&#39;chinese&#39;, return_df=False)
</code></pre></div><ul>
<li><strong>text</strong> 待分析的文本字符串</li>
<li><strong>lang</strong> 文本的语言类型， 中文 chinese、英文 english，默认中文。</li>
<li><strong>return_df</strong> 返回结果是否为 dataframe，默认 False</li>
</ul>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="n">text</span> <span class="o">=</span> <span class="s1">&#39;致力于致力于以零文章处理费或订阅费发布优质研究软件。&#39;</span>

<span class="c1">#ct.word_count(text, lang=&#39;chinese&#39;)</span>
<span class="n">ct</span><span class="o">.</span><span class="n">word_count</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">Counter({&#39;致力于&#39;: 2,
         &#39;文章&#39;: 1,
         &#39;处理费&#39;: 1,
         &#39;订阅费&#39;: 1,
         &#39;发布&#39;: 1,
         &#39;优质&#39;: 1,
         &#39;研究&#39;: 1,
         &#39;软件&#39;: 1})
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">ct</span><span class="o">.</span><span class="n">word_count</span><span class="p">(</span><span class="n">text</span><span class="p">,</span> <span class="n">return_df</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/09-term_freq.png" alt=""  />
</p>
<br>
<h3 id="22-readability">2.2 readability()</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">ct.readability(text, lang=&#39;chinese&#39;, syllables=3, return_series=False)
</code></pre></div><p>计算文本可读性常见指标； 含 Gunning Fog Index、 SMOG Index、Coleman Liau Index、 Automated Readability Index(ARI)、Readability Index(Rix)； 指标越大，复杂度越高，文本的可读性越差。</p>
<ul>
<li><strong>text</strong> 待分析的文本字符串</li>
<li><strong>lang</strong> 文本的语言类型， 中文 chinese、英文 english，默认中文。</li>
<li><strong>syllables</strong> 音节数(汉字数)大于等于 syllables 为复杂词. 默认值为 3</li>
<li><strong>return_series</strong>: 计算结果是否输出为 pd.Series 类型，默认为 False</li>
</ul>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">Gunning Fog Index = 0.4 * (Total_Words/Total_Sentences + 100 * Complex_Words/Total_Words)
SMOG Index = 1.0430 * sqrt(Complex_Words/Total_Sentences) * 30 + 3.1291
Coleman-Liau Index = 0.0588 * (100*Total_Letters/Total_Words) -0.296*(100*Total_Sentences/Total_Words) - 15.8
Automated Readability Index(ARI) = 4.71 * (Total_Characters/Total_Words) + 0.5*(Total_Words/Total_Sentences) - 21.43
Readability Index(RIX) = Complex_Words * (6 + Total_characters) / Total_Sentences
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="n">text</span> <span class="o">=</span> <span class="s1">&#39;致力于以零文章处理费或订阅费发布优质研究软件。&#39;</span>

<span class="n">ct</span><span class="o">.</span><span class="n">readability</span><span class="p">(</span><span class="n">text</span><span class="p">,</span> <span class="n">lang</span><span class="o">=</span><span class="s1">&#39;chinese&#39;</span><span class="p">,</span> <span class="n">syllables</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">{&#39;fog_index&#39;: 120.4,
 &#39;flesch_kincaid_grade_level&#39;: 20.2,
 &#39;smog_index&#39;: 57.32,
 &#39;coleman_liau_index&#39;: 83.96,
 &#39;ari&#39;: 87.4,
 &#39;rix&#39;: 87.0}
</code></pre></div><br>
<h3 id="23-sentimenttext-diction-lang">2.3 sentiment(text, diction, lang)</h3>
<p>常见的情感分析默认情绪词无(等)权重， 通过统计词语个数来反应情感信息。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">sentiment(text, diction, lang=&#39;chinese&#39;, return_series=False)
</code></pre></div><ul>
<li><strong>text</strong> 待分析的文本字符串</li>
<li><strong>diction</strong> 格式为 Python 字典类型。形如下面的案例</li>
<li><strong>lang</strong> 文本的语言类型， 中文 chinese、英文 english，默认中文。</li>
<li><strong>return_series</strong> 计算结果是否输出为 pd.Series 类型，默认为 False</li>
</ul>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="n">diction</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;pos&#39;</span><span class="p">:</span> <span class="p">[</span><span class="s1">&#39;高兴&#39;</span><span class="p">,</span> <span class="s1">&#39;快乐&#39;</span><span class="p">,</span> <span class="s1">&#39;分享&#39;</span><span class="p">],</span>
           <span class="s1">&#39;neg&#39;</span><span class="p">:</span> <span class="p">[</span><span class="s1">&#39;难过&#39;</span><span class="p">,</span> <span class="s1">&#39;悲伤&#39;</span><span class="p">],</span>
           <span class="s1">&#39;adv&#39;</span><span class="p">:</span> <span class="p">[</span><span class="s1">&#39;很&#39;</span><span class="p">,</span> <span class="s1">&#39;特别&#39;</span><span class="p">]}</span>

<span class="n">text</span> <span class="o">=</span> <span class="s1">&#39;我今天得奖了，很高兴，我要将快乐分享大家。&#39;</span>
<span class="n">ct</span><span class="o">.</span><span class="n">sentiment</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="n">text</span><span class="p">,</span>
             <span class="n">diction</span><span class="o">=</span><span class="n">diction</span><span class="p">,</span>
             <span class="n">lang</span><span class="o">=</span><span class="s1">&#39;chinese&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">{&#39;pos_num&#39;: 3,
 &#39;neg_num&#39;: 0,
 &#39;adv_num&#39;: 1,
 &#39;stopword_num&#39;: 8,
 &#39;word_num&#39;: 14,
 &#39;sentence_num&#39;: 1}
</code></pre></div><br>
<h3 id="24-sentiment_by_valence">2.4 sentiment_by_valence()</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">ct.sentiment_by_valence(text, diction, lang=&#39;chinese&#39;, return_series=False)
</code></pre></div><ul>
<li><strong>text</strong> 待分析的文本字符串</li>
<li><strong>diction</strong> 格式为 Python 字典类型。形如下面的案例</li>
<li><strong>lang</strong> 文本的语言类型， 中文 chinese、英文 english，默认中文。</li>
<li><strong>return_series</strong> 计算结果是否输出为 pd.Series 类型，默认为 False</li>
</ul>
<p>常见的情感分析是无(等)权重, 但实际上不同的词语所携带的情感信息的强度差异是很大的。据此学者们开发出很多带权重的词典，例如</p>
<ul>
<li>英文具体性词典 en_valence_Concreteness.yaml， 词典中每个词都有一个 concreteness 值</li>
<li>中文六维度语义词典 zh_valence_SixSemanticDimensionDatabase.yaml, 每个中文词有六个值。</li>
</ul>
<p>以具体性为例， <strong>语言具体性 Concreteness</strong>描述了一个词在多大程度上是指一个实际的、有形的或“真实的”实体，以一种更具体、更熟悉、更容易被眼睛或心灵感知的方式描述对象和行为（即，可想象或生动；Brysbaert, Warriner, and Kuperman 2014; Semin and Fiedler 1988)</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">concreteness_dict</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">read_yaml_dict</span><span class="p">(</span><span class="s1">&#39;en_valence_Concreteness.yaml&#39;</span><span class="p">)[</span><span class="s1">&#39;Dictionary&#39;</span><span class="p">]</span>
<span class="n">concreteness_dict</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">{&#39;roadsweeper&#39;: {&#39;concreteness&#39;: 4.85},
 &#39;traindriver&#39;: {&#39;concreteness&#39;: 4.54},
 &#39;tush&#39;: {&#39;concreteness&#39;: 4.45},
 &#39;hairdress&#39;: {&#39;concreteness&#39;: 3.93},
 &#39;pharmaceutics&#39;: {&#39;concreteness&#39;: 3.77},
 &#39;hoover&#39;: {&#39;concreteness&#39;: 3.76},
 &#39;shopkeeping&#39;: {&#39;concreteness&#39;: 3.18},
 &#39;pushiness&#39;: {&#39;concreteness&#39;: 2.48},
 ......
 }
</code></pre></div><p>可能 **<em>concreteness_dict</em>**不够直观， 如果整理转化一下大概类似于</p>
<p><img loading="lazy" src="img/11-concreteness_df.png" alt=""  />
</p>
<p><a href="https://textdata.cn/blog/jcr_concreteness_computation/"><strong>JCR2021 | 计算文本的语言具体性</strong></a> 文中提供了一个案例</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">reply</span> <span class="o">=</span> <span class="s2">&#34;I&#39;ll go look for that&#34;</span>

<span class="n">score</span><span class="o">=</span><span class="n">ct</span><span class="o">.</span><span class="n">sentiment_by_valence</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="n">reply</span><span class="p">,</span>
                              <span class="n">diction</span><span class="o">=</span><span class="n">concreteness_dict</span><span class="p">,</span>
                              <span class="n">lang</span><span class="o">=</span><span class="s1">&#39;english&#39;</span><span class="p">)</span>

<span class="n">score</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">{&#39;concreteness&#39;: 9.28,
&#39;word_num&#39;: 6}
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">employee_replys</span> <span class="o">=</span> <span class="p">[</span><span class="s2">&#34;I&#39;ll go look for that&#34;</span><span class="p">,</span>
                   <span class="s2">&#34;I&#39;ll go search for that&#34;</span><span class="p">,</span>
                   <span class="s2">&#34;I&#39;ll go search for that top&#34;</span><span class="p">,</span>
                   <span class="s2">&#34;I&#39;ll go search for that t-shirt&#34;</span><span class="p">,</span>
                   <span class="s2">&#34;I&#39;ll go look for that t-shirt in grey&#34;</span><span class="p">,</span>
                   <span class="s2">&#34;I&#39;ll go search for that t-shirt in grey&#34;</span><span class="p">]</span>

<span class="k">for</span> <span class="n">idx</span><span class="p">,</span> <span class="n">reply</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">employee_replys</span><span class="p">):</span>
    <span class="n">score</span><span class="o">=</span><span class="n">ct</span><span class="o">.</span><span class="n">sentiment_by_valence</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="n">reply</span><span class="p">,</span>
                                  <span class="n">diction</span><span class="o">=</span><span class="n">concreteness_dict</span><span class="p">,</span>
                                  <span class="n">lang</span><span class="o">=</span><span class="s1">&#39;english&#39;</span><span class="p">)</span>

    <span class="n">template</span> <span class="o">=</span> <span class="s2">&#34;Concreteness Score: </span><span class="si">{score:.2f}</span><span class="s2"> | Example-</span><span class="si">{idx}</span><span class="s2">: </span><span class="si">{exmaple}</span><span class="s2">&#34;</span>

    <span class="nb">print</span><span class="p">(</span><span class="n">template</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">score</span><span class="o">=</span><span class="n">score</span><span class="p">[</span><span class="s1">&#39;concreteness&#39;</span><span class="p">],</span>
                          <span class="n">idx</span><span class="o">=</span><span class="n">idx</span><span class="p">,</span>
                          <span class="n">exmaple</span><span class="o">=</span><span class="n">reply</span><span class="p">))</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">Concreteness Score: 9.28 | Example-0: I&#39;ll go look for that
Concreteness Score: 9.32 | Example-1: I&#39;ll go search for that
Concreteness Score: 13.25 | Example-2: I&#39;ll go search for that top
Concreteness Score: 14.25 | Example-3: I&#39;ll go search for that t-shirt
Concreteness Score: 21.32 | Example-4: I&#39;ll go look for that t-shirt in grey
Concreteness Score: 21.36 | Example-5: I&#39;ll go search for that t-shirt in grey
</code></pre></div><br>
<h3 id="25-word_in_context">2.5 word_in_context()</h3>
<p>You shall know a word by the company it keeps 通过一个单词所处的语境，我们可以了解该单词的含义。</p>
<p>在 text 中查找 keywords 出现的上下文内容(窗口 window)，返回 df。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">ct.word_in_context(text, keywords, window=3, lang=&#39;chinese&#39;)
</code></pre></div><ul>
<li><strong>text</strong> 待分析文本</li>
<li><strong>keywords</strong> 关键词列表</li>
<li><strong>window</strong> 关键词上下文窗口大小</li>
<li><strong>lang</strong> 文本的语言类型， 中文 chinese、英文 english，默认中文。</li>
</ul>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="c1">#测试代码，假设zh_text是年报文本，从找找出丝网词相关词的上下文</span>
<span class="n">zh_text</span> <span class="o">=</span> <span class="s2">&#34;&#34;&#34;
</span><span class="s2">【插入一条自家广告】大邓自己家的家，
</span><span class="s2">安平县多隆丝网制品，生产销售不锈钢轧花网、
</span><span class="s2">电焊网、石笼网、刀片刺绳、冲孔网等丝网制品。
</span><span class="s2">联系人 邓颖静 0318-7686899
</span><span class="s2">
</span><span class="s2">人生苦短，我学Python
</span><span class="s2">在社科中，可以用Python做文本分析
</span><span class="s2">Python是一门功能强大的编程语言，广泛应用在经管社科领域。
</span><span class="s2">可以做网络爬虫、文本分析、LDA话题模型、相似度分析等。
</span><span class="s2">
</span><span class="s2">今年经济不景气，形势异常严峻。
</span><span class="s2">由于疫情不景气，静默管理， 产品积压， 公司经营困难。
</span><span class="s2">保就业促就业，任务十分艰巨。
</span><span class="s2">&#34;&#34;&#34;</span>

<span class="c1">#【python】上下文</span>
<span class="n">ct</span><span class="o">.</span><span class="n">word_in_context</span><span class="p">(</span><span class="n">text</span> <span class="o">=</span> <span class="n">zh_text</span><span class="p">,</span>
                   <span class="n">keywords</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;python&#39;</span><span class="p">],</span>
                   <span class="n">window</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span>
                   <span class="n">lang</span><span class="o">=</span><span class="s1">&#39;chinese&#39;</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/20-word-in-context.png" alt=""  />
</p>
<br>
<h3 id="26-epu">2.6 epu()</h3>
<p><a href="https://textdata.cn/blog/2023-12-20-measure-china-economic-policy-uncertainty/"><strong>代码 | 使用新闻数据测量经济政策不确定性 EPU</strong></a></p>
<p><img loading="lazy" src="img/13-epu-plot.png" alt=""  />
</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">epu(df, freq=&#39;Y&#39;, e_pattern=&#39;&#39;, p_pattern=&#39;&#39;, u_pattern=&#39;&#39;)
</code></pre></div><ul>
<li><strong>df</strong> 新闻数据 DataFrame， 含 text 和 date 两个字段。 每一行代表一条新闻记录</li>
<li><strong>freq</strong> 字符串； 确定 EPU 指数的时间颗粒度； 如年 Y, 月 m, 日 d, 默认 freq=&lsquo;Y&rsquo;</li>
<li><strong>e_pattern</strong> 字符串；经济类词典，用<code>|</code>间隔词语，形如 <strong>e_pattern = ‘经济|金融’</strong></li>
<li><strong>p_pattern</strong> 字符串；政策词典，用<code>|</code>间隔词语，形如 <strong>p_pattern = ‘政策|治理|行政’</strong></li>
<li><strong>u_pattern</strong> 字符串；不确定性词典，用<code>|</code>间隔词语，形如 <strong>u_pattern = ‘风险|危机|难以预测’</strong></li>
</ul>
<p>准备如下图格式的数据 <strong><em>news_df</em></strong></p>
<p><img loading="lazy" src="img/12-news-df.png" alt=""  />
</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="c1">#省略，读取数据得到 news_df</span>

<span class="n">epu_df</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">epu</span><span class="p">(</span><span class="n">df</span><span class="o">=</span><span class="n">news_df</span><span class="p">,</span> <span class="n">freq</span><span class="o">=</span><span class="s1">&#39;m&#39;</span><span class="p">)</span>
<span class="n">epu_df</span>
</code></pre></div><p><img loading="lazy" src="img/13-epu-df.png" alt=""  />
</p>
<br>
<h3 id="27-fepu">2.7 fepu()</h3>
<p><a href="https://textdata.cn/blog/2024-04-25-firm-economic-policy-uncertainty/">使用管理层讨论与分析文本数据测量「企业感知不确定性」(Subjective perception of economic policy uncertainty, FEPU)</a></p>
<p><img loading="lazy" src="img/16-fepu-plot.png" alt=""  />
</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">ct.fepu(text, ep_pattern, u_pattern)
</code></pre></div><ul>
<li><strong><em>text</em></strong> ；某时期 t 某企业 i 的管理层讨论与分析 md&amp;a 文本</li>
<li><strong><em>ep_pattern</em></strong> 字符串；经济政策类词典，用<code>|</code>间隔词语，形如 <strong>ep_pattern = ‘经济|金融|政策|治理|行政’</strong></li>
<li><strong><em>u_pattern</em></strong> 字符串；不确定性词典，用<code>|</code>间隔词语，形如 <strong>u_pattern = ‘风险|危机|难以预测’</strong></li>
</ul>
<p>准备如下图格式的数据 <strong><em>mda_df</em></strong></p>
<p><img loading="lazy" src="img/14-mdadf.png" alt=""  />
</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="c1">#省略，读取数据得到 mda_df</span>

<span class="n">fepu_df</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;经营讨论与分析内容&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">ct</span><span class="o">.</span><span class="n">fepu</span><span class="p">)</span>
<span class="n">res_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">concat</span><span class="p">([</span><span class="n">df</span><span class="p">[[</span><span class="s1">&#39;会计年度&#39;</span><span class="p">,</span> <span class="s1">&#39;股票代码&#39;</span><span class="p">]],</span> <span class="n">fepu_df</span><span class="p">],</span>   <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">res_df</span>
</code></pre></div><p><img loading="lazy" src="img/15-fepu.png" alt=""  />
</p>
<br>
<br>
<h3 id="28-semantic_brand_score">2.8 semantic_brand_score()</h3>
<p><a href="https://textdata.cn/blog/2024-04-12-semantic-brand-score/">文献&amp;代码 | 使用 Python 计算语义品牌评分(Semantic Brand Score, SBS)</a> ， 通过 SBS 来衡量品牌（个体、公司、品牌、关键词等）的重要性。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">ct.semantic_brand_score(text, brands, lang=&#39;chinese&#39;)
</code></pre></div><ul>
<li><strong><em>text</em></strong> 待分析文本</li>
<li><strong><em>brands</em></strong> 词语列表；</li>
<li><strong><em>lang</em></strong> 语言类型，&ldquo;chinese&quot;或&quot;english&rdquo;，默认&quot;chinese&quot;</li>
</ul>
<p>以三体小说为例，通过测量品牌语义评分 SBS 来反映小说角色的重要性。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="n">brands</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;汪淼&#39;</span><span class="p">,</span> <span class="s1">&#39;史强&#39;</span><span class="p">,</span> <span class="s1">&#39;罗辑&#39;</span><span class="p">,</span> <span class="s1">&#39;叶文洁&#39;</span><span class="p">,</span> <span class="s1">&#39;伊文斯&#39;</span><span class="p">]</span>

<span class="c1">#准备santi_test_text</span>
<span class="c1">#小说等分20份， 读取第一份得到santi_test_text</span>

<span class="n">sbs_df</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">semantic_brand_score</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="n">santi_test_text</span><span class="p">,</span>
                               <span class="n">brands</span><span class="o">=</span><span class="n">brands</span><span class="p">,</span>
                               <span class="n">lang</span><span class="o">=</span><span class="s1">&#39;chinese&#39;</span><span class="p">)</span>
<span class="n">sbs_df</span>
</code></pre></div><p><img loading="lazy" src="img/19-1st-sbs.png" alt=""  />
</p>
<p>如果将三体小说分成 20 份， 每一份都测算出每个角色的 SBS，绘制出折线图如下图所示。</p>
<p><img loading="lazy" src="img/18-sbs-plot.png" alt=""  />
</p>
<h3 id="29-文本相似度">2.9 文本相似度</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">ct.cosine_sim(text1, text2, lang=&#39;chinese&#39;)   cos余弦相似
ct.jaccard_sim(text1, text2, lang=&#39;chinese&#39;)  jaccard相似
ct.minedit_sim(text1, text2, lang=&#39;chinese&#39;)  最小编辑距离相似度；
ct.simple_sim(text1, text2, lang=&#39;chinese&#39;)   更改变动算法
</code></pre></div><p>算法实现参考自 <code>Cohen, Lauren, Christopher Malloy, and Quoc Nguyen. Lazy prices. No. w25084. National Bureau of Economic Research, 2018.</code></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="n">text1</span> <span class="o">=</span> <span class="s1">&#39;编程真好玩编程真好玩&#39;</span>
<span class="n">text2</span> <span class="o">=</span> <span class="s1">&#39;游戏真好玩编程真好玩&#39;</span>

<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;cosine&#39;</span><span class="p">,</span> <span class="n">ct</span><span class="o">.</span><span class="n">cosine_sim</span><span class="p">(</span><span class="n">text1</span><span class="p">,</span> <span class="n">text2</span><span class="p">,</span> <span class="n">lang</span><span class="o">=</span><span class="s1">&#39;chinese&#39;</span><span class="p">))</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;jaccard&#39;</span><span class="p">,</span> <span class="n">ct</span><span class="o">.</span><span class="n">jaccard_sim</span><span class="p">(</span><span class="n">text1</span><span class="p">,</span> <span class="n">text2</span><span class="p">,</span> <span class="n">lang</span><span class="o">=</span><span class="s1">&#39;chinese&#39;</span><span class="p">))</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;minedit&#39;</span><span class="p">,</span> <span class="n">ct</span><span class="o">.</span><span class="n">minedit_sim</span><span class="p">(</span><span class="n">text1</span><span class="p">,</span> <span class="n">text2</span><span class="p">,</span> <span class="n">lang</span><span class="o">=</span><span class="s1">&#39;chinese&#39;</span><span class="p">))</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;simple&#39;</span><span class="p">,</span> <span class="n">ct</span><span class="o">.</span><span class="n">simple_sim</span><span class="p">(</span><span class="n">text1</span><span class="p">,</span> <span class="n">text2</span><span class="p">,</span> <span class="n">lang</span><span class="o">=</span><span class="s1">&#39;chinese&#39;</span><span class="p">))</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">cosine  0.82
jaccard 0.67
minedit 1.00
simple 0.84
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>


<span class="n">text1</span> <span class="o">=</span> <span class="s1">&#39;Programming is fun!&#39;</span>
<span class="n">text2</span> <span class="o">=</span> <span class="s1">&#39;Programming is interesting!&#39;</span>

<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;cosine&#39;</span><span class="p">,</span> <span class="n">ct</span><span class="o">.</span><span class="n">cosine_sim</span><span class="p">(</span><span class="n">text1</span><span class="p">,</span> <span class="n">text2</span><span class="p">,</span> <span class="n">lang</span><span class="o">=</span><span class="s1">&#39;english&#39;</span><span class="p">))</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;jaccard&#39;</span><span class="p">,</span> <span class="n">ct</span><span class="o">.</span><span class="n">jaccard_sim</span><span class="p">(</span><span class="n">text1</span><span class="p">,</span> <span class="n">text2</span><span class="p">,</span> <span class="n">lang</span><span class="o">=</span><span class="s1">&#39;english&#39;</span><span class="p">))</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;minedit&#39;</span><span class="p">,</span> <span class="n">ct</span><span class="o">.</span><span class="n">minedit_sim</span><span class="p">(</span><span class="n">text1</span><span class="p">,</span> <span class="n">text2</span><span class="p">,</span> <span class="n">lang</span><span class="o">=</span><span class="s1">&#39;english&#39;</span><span class="p">))</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;simple&#39;</span><span class="p">,</span> <span class="n">ct</span><span class="o">.</span><span class="n">simple_sim</span><span class="p">(</span><span class="n">text1</span><span class="p">,</span> <span class="n">text2</span><span class="p">,</span> <span class="n">lang</span><span class="o">=</span><span class="s1">&#39;english&#39;</span><span class="p">))</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">cosine  0.67
jaccard 0.50
minedit 1.00
simple 0.78
</code></pre></div><br>
<h3 id="210-word_hhi">2.10 word_hhi</h3>
<p>文本的赫芬达尔-赫希曼指数。ct.word_hhi(text, lang=&lsquo;chinese&rsquo;)</p>
<br>
<p><strong>赫芬达尔-赫希曼指数</strong>(<strong>Herfindahl-Hirschman Index</strong>)作为一种衡量市场集中度的经济指标，通常用于分析产业或市场中企业份额的分布情况。</p>
<p><img loading="lazy" src="img/word-hhi-algo.png" alt=""  />
</p>
<p>前人类比市场集中程度，用于测量专利质量(知识宽度)。 那放在文本语言中，我们是否可能利用 HHI 来量化某个语料库中不同词汇的使用频率分布，以此来分析个人、群体或时代的语言风格、词汇丰富度、或是语言标准化与变化的趋势。</p>
<ul>
<li>如果词汇分布非常均匀，表明语言使用中的词汇多样性高，HHI 值就会较低；</li>
<li>反之，如果少数词汇占据了大部分文本空间，表明词汇使用集中，HHI 值则较高。</li>
</ul>
<p>结合其他语言学指标一起使用，比如 TTR（Type-Token Ratio，类型-标记比率）、Shannon entropy（香农熵）等，共同评估语言表达的复杂度和多样性。不过，这类研究的文献相对较少，因为语言学领域有自己一套成熟且专业的分析工具和方法，HHI 更多地被视为跨学科应用的一个创新尝试。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="n">personA</span> <span class="o">=</span> <span class="s1">&#39;这场音乐会太嗨了&#39;</span>
<span class="n">personB</span> <span class="o">=</span> <span class="s1">&#39;这场音乐会说出来令你不敢相信，主办方策划有方，群众激情满满，我印象深刻，体验感拉满&#39;</span>


<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;A-hhi&#39;</span><span class="p">,</span> <span class="n">ct</span><span class="o">.</span><span class="n">word_hhi</span><span class="p">(</span><span class="n">personA</span><span class="p">))</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;B-hhi&#39;</span><span class="p">,</span> <span class="n">ct</span><span class="o">.</span><span class="n">word_hhi</span><span class="p">(</span><span class="n">personB</span><span class="p">))</span>

<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;A词汇多样性&#39;</span><span class="p">,</span> <span class="mi">1</span> <span class="o">-</span> <span class="n">ct</span><span class="o">.</span><span class="n">word_hhi</span><span class="p">(</span><span class="n">personA</span><span class="p">))</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;B词汇多样性&#39;</span><span class="p">,</span> <span class="mi">1</span> <span class="o">-</span> <span class="n">ct</span><span class="o">.</span><span class="n">word_hhi</span><span class="p">(</span><span class="n">personB</span><span class="p">))</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">A-hhi 0.20000000000000004
B-hhi 0.07024793388429751

A词汇多样性 0.7999999999999999
B词汇多样性 0.9297520661157025
</code></pre></div><br>
<br>
<h2 id="三plot-模块">三、Plot 模块</h2>
<table>
<thead>
<tr>
<th>模块</th>
<th>函数</th>
<th>功能</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>plot</strong></td>
<td><code>ct.matplotlib_chinese()</code></td>
<td>支持 matplotlib 中文绘图</td>
</tr>
<tr>
<td><strong>plot</strong></td>
<td><code>ct.lexical_dispersion_plot1(text, targets_dict, lang, title, figsize)</code></td>
<td>对某一个文本 text， 可视化不同目标类别词 targets_dict 在文本中出现位置</td>
</tr>
<tr>
<td><strong>plot</strong></td>
<td><code>ct.lexical_dispersion_plot2(texts_dict, targets, lang, title, figsize)</code></td>
<td>对某几个文本 texts_dict， 可视化某些目标词 targets 在文本中出现相对位置(0~100)</td>
</tr>
</tbody>
</table>
<br>
<h3 id="31-matplotlib_chinese">3.1 matplotlib_chinese()</h3>
<p>matplotlib 默认不支持中文可视化， cntext 新增该函数，可以解决中文可视化问题</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="n">plt</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">matplotlib_chinese</span><span class="p">()</span>
<span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">7</span><span class="p">,</span> <span class="mi">4</span><span class="p">))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">],</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">9</span><span class="p">,</span> <span class="mi">16</span><span class="p">])</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">&#39;中文图表&#39;</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/27-chinese-matplotlib.png" alt=""  />
</p>
<br>
<h3 id="32-lexical_dispersion_plot1">3.2 lexical_dispersion_plot1()</h3>
<p>词汇分散图可视化， 对某一个文本 text， 可视化不同目标类别词 targets_dict 在文本中出现位置</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">ct</span><span class="o">.</span><span class="n">lexical_dispersion_plot1</span><span class="p">(</span><span class="n">text</span><span class="p">,</span> <span class="n">targets_dict</span><span class="p">,</span> <span class="n">lang</span><span class="o">=</span><span class="s1">&#39;chinese&#39;</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">12</span><span class="p">,</span> <span class="mi">6</span><span class="p">),</span> <span class="n">title</span><span class="o">=</span><span class="s1">&#39;特定词汇在不同文本来源的相对离散图&#39;</span><span class="p">,</span> <span class="n">prop</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</code></pre></div><ul>
<li><strong><em>text</em></strong>: 文本数据</li>
<li><strong><em>targets_dict</em></strong>: 目标类别词字典； targets_dict={&lsquo;pos&rsquo;: [&lsquo;开心&rsquo;, &lsquo;快乐&rsquo;], &lsquo;neg&rsquo;: [&lsquo;悲伤&rsquo;, &lsquo;难过&rsquo;]}</li>
<li><strong><em>lang</em></strong>: 文本数据 texts_dict 的语言类型，默认&rsquo;chinese'.</li>
<li><strong><em>figsize</em></strong>: 图的长宽尺寸. 默认 (8, 5).</li>
<li><strong><em>title</em></strong> : 图的标题；</li>
<li><strong><em>prop</em></strong>: 横坐标字符位置是否为相对位置. 默认 True，横坐标索引值取值范围 0 ~ 100</li>
</ul>
<br>
<p>点击下载 <a href="https://textdata.cn/data/%E4%B8%89%E4%BD%93.txt"><strong>三体.txt</strong></a>、<a href="https://textdata.cn/data/%E5%9F%BA%E5%9C%B0.txt"><strong>基地.txt</strong></a>两本小说文件。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="n">roles_dict</span> <span class="o">=</span> <span class="p">{</span>
    <span class="s2">&#34;汪淼&#34;</span><span class="p">:</span> <span class="p">[</span><span class="s1">&#39;汪淼&#39;</span><span class="p">],</span>
    <span class="s2">&#34;叶文洁&#34;</span><span class="p">:</span> <span class="p">[</span><span class="s1">&#39;叶文洁&#39;</span><span class="p">],</span>
    <span class="s2">&#34;罗辑&#34;</span><span class="p">:</span> <span class="p">[</span><span class="s1">&#39;罗辑&#39;</span><span class="p">]</span>
<span class="p">}</span>

<span class="n">santi_text</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s1">&#39;三体.txt&#39;</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>

<span class="n">ax</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">lexical_dispersion_plot1</span><span class="p">(</span><span class="n">text</span> <span class="o">=</span> <span class="n">santi_text</span><span class="p">,</span>  <span class="c1">#文本数据</span>
                            <span class="n">targets_dict</span> <span class="o">=</span> <span class="n">roles_dict</span><span class="p">,</span> <span class="c1">#角色</span>
                            <span class="n">figsize</span> <span class="o">=</span> <span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">4</span><span class="p">),</span>  <span class="c1">#尺寸大小</span>
                            <span class="n">lang</span> <span class="o">=</span> <span class="s1">&#39;chinese&#39;</span><span class="p">,</span>  <span class="c1">#中文数据</span>
                            <span class="n">title</span> <span class="o">=</span> <span class="s1">&#39;《三体》小说角色出现位置&#39;</span><span class="p">,</span> <span class="c1">#标题</span>
                            <span class="n">prop</span> <span class="o">=</span> <span class="kc">True</span><span class="p">)</span>    <span class="c1">#相对位置(横坐标轴取值范围0-100)</span>
<span class="n">ax</span>
</code></pre></div><p><img loading="lazy" src="img/23-lexical_dispersion_plot1-relative.png" alt=""  />
</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">ct</span><span class="o">.</span><span class="n">lexical_dispersion_plot1</span><span class="p">(</span><span class="n">text</span> <span class="o">=</span> <span class="n">santi_text</span><span class="p">,</span>  <span class="c1">#文本数据</span>
                            <span class="n">targets_dict</span> <span class="o">=</span> <span class="n">roles_dict</span><span class="p">,</span> <span class="c1">#角色</span>
                            <span class="n">figsize</span> <span class="o">=</span> <span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">4</span><span class="p">),</span>  <span class="c1">#尺寸大小</span>
                            <span class="n">lang</span> <span class="o">=</span> <span class="s1">&#39;chinese&#39;</span><span class="p">,</span>  <span class="c1">#中文数据</span>
                            <span class="n">title</span> <span class="o">=</span> <span class="s1">&#39;《三体》小说角色出现位置&#39;</span><span class="p">,</span> <span class="c1">#标题</span>
                            <span class="n">prop</span> <span class="o">=</span> <span class="kc">False</span><span class="p">)</span>    <span class="c1">#绝对位置(横坐标轴取值范围与小说文本长度有关)</span>
</code></pre></div><p><img loading="lazy" src="img/24-lexical_dispersion_plot1-absolute.png" alt=""  />
</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="c1"># diy了一个小词典</span>
<span class="n">senti_dict</span> <span class="o">=</span> <span class="p">{</span>
    <span class="s1">&#39;pos&#39;</span><span class="p">:</span> <span class="p">[</span><span class="s1">&#39;开心&#39;</span><span class="p">,</span> <span class="s1">&#39;幸福&#39;</span><span class="p">,</span> <span class="s1">&#39;快乐&#39;</span><span class="p">,</span> <span class="s1">&#39;安宁&#39;</span><span class="p">,</span> <span class="s1">&#39;希望&#39;</span><span class="p">],</span>
    <span class="s1">&#39;neg&#39;</span><span class="p">:</span> <span class="p">[</span><span class="s1">&#39;紧张&#39;</span><span class="p">,</span> <span class="s1">&#39;恐惧&#39;</span><span class="p">,</span> <span class="s1">&#39;害怕&#39;</span><span class="p">,</span> <span class="s1">&#39;绝望&#39;</span><span class="p">]</span>
<span class="p">}</span>

<span class="n">santi_text</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s1">&#39;三体.txt&#39;</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>

<span class="n">ax</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">lexical_dispersion_plot1</span><span class="p">(</span><span class="n">text</span> <span class="o">=</span> <span class="n">santi_text</span><span class="p">,</span>
                            <span class="n">targets_dict</span> <span class="o">=</span> <span class="n">senti_dict</span><span class="p">,</span>
                            <span class="n">figsize</span> <span class="o">=</span> <span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">2</span><span class="p">),</span>
                            <span class="n">lang</span> <span class="o">=</span> <span class="s1">&#39;chinese&#39;</span><span class="p">,</span>
                            <span class="n">title</span> <span class="o">=</span> <span class="s1">&#39;《三体》情绪词出现位置&#39;</span><span class="p">,</span>
                            <span class="n">prop</span> <span class="o">=</span> <span class="kc">True</span><span class="p">)</span>
<span class="n">ax</span>
</code></pre></div><p><img loading="lazy" src="img/25-santi_sentiment.png" alt=""  />
</p>
<br>
<h3 id="33-lexical_dispersion_plot2">3.3 lexical_dispersion_plot2()</h3>
<p>词汇分散图可视化， 对某几个文本 texts_dict， 可视化某些目标词 targets 在文本中出现相对位置(0~100)</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">ct</span><span class="o">.</span><span class="n">lexical_dispersion_plot2</span><span class="p">(</span><span class="n">texts_dict</span><span class="p">,</span> <span class="n">targets</span><span class="p">,</span> <span class="n">lang</span><span class="o">=</span><span class="s1">&#39;chinese&#39;</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">12</span><span class="p">,</span> <span class="mi">6</span><span class="p">),</span> <span class="n">title</span><span class="o">=</span><span class="s1">&#39;特定词汇在不同文本来源的相对离散图&#39;</span><span class="p">)</span>
</code></pre></div><ul>
<li><strong><em>texts_dict</em></strong>: 多个文本的字典数据。形如{&lsquo;source1&rsquo;: &lsquo;source1 的文本内容&rsquo;, &lsquo;source2&rsquo;: &lsquo;source2 的文本内容&rsquo;}</li>
<li><strong><em>targets</em></strong>: 目标词列表</li>
<li><strong><em>lang</em></strong>: 文本数据 texts_dict 的语言类型，默认&rsquo;chinese'.</li>
<li><strong><em>figsize</em></strong>: 图的长宽尺寸. 默认 (8, 5).</li>
<li><strong><em>title</em></strong> : 图的标题；</li>
</ul>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">targets</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;太空&#39;</span><span class="p">,</span> <span class="s1">&#39;宇宙&#39;</span><span class="p">]</span>

<span class="n">texts_dict</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;三体&#39;</span><span class="p">:</span> <span class="nb">open</span><span class="p">(</span><span class="s1">&#39;三体.txt&#39;</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">(),</span>
              <span class="s1">&#39;基地&#39;</span><span class="p">:</span> <span class="nb">open</span><span class="p">(</span><span class="s1">&#39;基地.txt&#39;</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()}</span>

<span class="n">ax</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">lexical_dispersion_plot2</span><span class="p">(</span><span class="n">texts_dict</span> <span class="o">=</span> <span class="n">texts_dict</span><span class="p">,</span>
                            <span class="n">targets</span> <span class="o">=</span> <span class="n">targets</span><span class="p">,</span>
                            <span class="n">figsize</span> <span class="o">=</span> <span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">2</span><span class="p">),</span>
                            <span class="n">title</span> <span class="o">=</span> <span class="s1">&#39;&#34;太空/宇宙&#34;词语出现位置&#39;</span><span class="p">,</span>
                            <span class="n">lang</span> <span class="o">=</span> <span class="s1">&#39;chinese&#39;</span><span class="p">)</span>
<span class="n">ax</span>
</code></pre></div><p><img loading="lazy" src="img/26-santi_base.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="四model-模块">四、Model 模块</h2>
<p>本部分主要内容是词嵌入模型相关技术， 包括 Word2Vec(GLove)的训练、读取、扩展词典。</p>
<table>
<thead>
<tr>
<th>模块</th>
<th>函数(类)</th>
<th>功能</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>model</strong></td>
<td><strong><em>ct.Word2Vec(corpus_file, encoding, lang, window_size, vector_size,&hellip;)</em></strong></td>
<td>训练 Word2Vec</td>
</tr>
<tr>
<td><strong>model</strong></td>
<td><strong><em>ct.GloVe(corpus_file, encoding, lang, window_size, vector_size, &hellip;)</em></strong></td>
<td>训练 GLove 模型。</td>
</tr>
<tr>
<td><strong>model</strong></td>
<td><strong>ct.evaluate_similarity(wv, file=None)</strong></td>
<td>使用近义法评估模型表现，默认使用内置的数据进行评估。</td>
</tr>
<tr>
<td><strong>model</strong></td>
<td><strong>ct.evaluate_analogy(wv, file=None)</strong></td>
<td>使用类比法评估模型表现，默认使用内置的数据进行评估。</td>
</tr>
<tr>
<td><strong>model</strong></td>
<td><strong><em>ct.load_w2v(wv_path)</em></strong></td>
<td>读取 cntext2.x 训练出的 Word2Vec/GloVe 模型文件</td>
</tr>
<tr>
<td><strong>model</strong></td>
<td><strong><em>ct.glove2word2vec(glove_file, word2vec_file)</em></strong></td>
<td>将 GLoVe 模型.txt 文件转化为 Word2Vec 模型.txt 文件；注意这里的 GLoVe 模型.txt 是通过<a href="https://github.com/standfordnlp/GloVe">Standfordnlp/GloVe</a> 训练得到的。</td>
</tr>
<tr>
<td><strong>model</strong></td>
<td><strong><em>ct.expand_dictionary(wv, seeddict, topn=100)</em></strong></td>
<td>扩展词典, 结果保存到路径[output/Word2Vec]中</td>
</tr>
<tr>
<td><strong>model</strong></td>
<td><code>ct.SoPmi(corpus_file, seed_file, lang='chinese')</code></td>
<td>共现法扩展词典</td>
</tr>
</tbody>
</table>
<h3 id="41-word2vec">4.1 Word2Vec()</h3>
<p>可直接对原始语料 txt 文件进行自动 Word2vec 训练。该函数会自动处理文本预处理(分词、去停词)、内存管理、参数调整等问题，确保训练过程顺利进行。</p>
<p>在 <strong><em>gensim.models.word2vec.Word2Vec</em></strong> 基础上，增加了中英文的预处理， 简化了代码使用。配置好 cntext2.x 环境， 可以做到</p>
<ul>
<li>
<ol>
<li>训练只用一行代码</li>
</ol>
</li>
<li>
<ol start="2">
<li>读取调用只用一行代码</li>
</ol>
</li>
</ul>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">ct.Word2Vec(corpus_file, lang=&#39;chinese&#39;, dict_file=None, stopwords_file=None, vector_size=100, window_size=6, min_count=5, max_iter=5, chunksize=10000, only_binary=True, **kwargs)
</code></pre></div><ul>
<li><strong><em>corpus_file</em></strong>: 语料库文件的路径。</li>
<li><strong><em>lang</em></strong>: 语言类型，支持 &lsquo;chinese&rsquo; 和 &lsquo;english&rsquo;，默认为 &lsquo;chinese&rsquo;。</li>
<li><strong><em>dict_file</em></strong>: 自定义词典 txt 文件路径，默认为 None。utf-8 编码。</li>
<li><strong><em>stopwords_file</em></strong>: 停用词文件路径，默认为 None。utf-8 编码。</li>
<li><strong><em>vector_size</em></strong>: 词向量的维度，默认为 50。</li>
<li><strong><em>window_size</em></strong>: 上下文窗口的大小，默认为 6。</li>
<li><strong><em>min_count</em></strong>: 最小词频，默认为 10。</li>
<li><strong><em>max_iter</em></strong>: 最大迭代次数，默认为 5。</li>
<li><strong><em>chunksize</em></strong>: 每次读取的行数。默认为 10000。越大速度越快。</li>
<li><strong><em>only_binary</em></strong> : 是否只保存模型为二进制文件。默认为 True， 保存为 bin。False 时只保存 bin、txt。</li>
<li><strong><em>kwargs</em></strong>: 其他 gensim 可选参数，如 negative、sample、hs 等。</li>
</ul>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="n">w2v</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">Word2Vec</span><span class="p">(</span><span class="n">corpus_file</span> <span class="o">=</span> <span class="s1">&#39;data/三体.txt&#39;</span><span class="p">,</span>
                  <span class="n">lang</span> <span class="o">=</span> <span class="s1">&#39;chinese&#39;</span><span class="p">,</span>
                  <span class="n">window_size</span> <span class="o">=</span> <span class="mi">6</span><span class="p">,</span>
                  <span class="n">vector_size</span> <span class="o">=</span> <span class="mi">50</span><span class="p">)</span>


<span class="n">w2v</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">Mac(Linux) System, Enable Parallel Processing
Cache output/三体_cache.txt Not Found or Empty, Preprocessing Corpus
Reading Preprocessed Corpus from output/三体_cache.txt
Start Training Word2Vec
Word2Vec Training Cost 10 s.
Output Saved To: output/Word2Vec/三体-Word2Vec.50.6.bin
</code></pre></div><p>[data/三体.txt]体积 2.7M， 训练时间 10s， 模型文件存储于 <strong><em>output/Word2Vec/三体-Word2Vec.50.6.bin</em></strong></p>
<p><img loading="lazy" src="img/03-word2vec.png" alt=""  />
</p>
<br>
<br>
<h3 id="42-glove">4.2 GloVe()</h3>
<p>使用 Stanford GloVe 代码工具训练 GloVe 模型。该函数会自动处理文本预处理、内存管理、参数调整等问题，确保训练过程顺利进行。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">ct.GloVe(corpus_file, lang=&#39;chinese&#39;, dict_file=None, stopwords_file=None, vector_size=100, window_size=15, min_count=5, max_memory=4.0, max_iter=15, x_max=10, only_binary=True, chunksize=10000)
</code></pre></div><ul>
<li><strong><em>corpus_file</em></strong>: 输入语料文件路径（文本格式）。该文件为分词后的语料文件。</li>
<li><strong><em>lang</em></strong>: 语料文件的语言类型，默认为 &lsquo;chinese&rsquo;。</li>
<li><strong><em>dict_file</em></strong>: 自定义词典 txt 文件路径，默认为 None。utf-8 编码。</li>
<li><strong><em>stopwords_file</em></strong>: 停用词文件路径，默认为 None。utf-8 编码。</li>
<li><strong><em>vector_size</em></strong>: 词向量维度，默认 100。</li>
<li><strong><em>window_size</em></strong>: 上下文窗口大小，默认 15。</li>
<li><strong><em>min_count</em></strong>: 忽略出现次数低于此值的单词，默认 5。</li>
<li><strong><em>max_memory</em></strong>: 可供使用的最大内存大小，单位为 GB，默认 4; 该参数越大，训练越快。</li>
<li><strong><em>max_iter</em></strong>: 训练的最大迭代次数，默认 15。</li>
<li><strong><em>x_max</em></strong>: 共现矩阵中元素的最大计数值，默认 10。</li>
<li><strong><em>chunksize</em></strong>: 每次读取的行数。默认为 10000。越大速度越快。</li>
<li><strong><em>only_binary</em></strong> : 是否只保存模型为二进制文件。默认为 True， 保存为 bin。False 时只保存 bin、txt。</li>
</ul>
<br>
<p>ct.GloVe 内置 <a href="https://nlp.stanford.edu/projects/glove/">Stanford GloVe</a>算法， 训练速度非常快。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="n">glove</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">GloVe</span><span class="p">(</span><span class="n">corpus_file</span><span class="o">=</span><span class="s1">&#39;data/三体.txt&#39;</span><span class="p">,</span>
                 <span class="n">lang</span><span class="o">=</span><span class="s1">&#39;chinese&#39;</span><span class="p">,</span>
                 <span class="n">vector_size</span><span class="o">=</span><span class="mi">50</span><span class="p">,</span>
                 <span class="n">window_size</span><span class="o">=</span><span class="mi">15</span><span class="p">)</span>

<span class="n">glove</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">Mac(Linux) System, Enable Parallel Processing
Cache output/三体_cache.txt Not Found or Empty, Preprocessing Corpus
Start Training GloVe
BUILDING VOCABULARY
Using vocabulary of size 6975.

COUNTING COOCCURRENCES
Merging cooccurrence files: processed 2106999 lines.

Using random seed 1743474106
SHUFFLING COOCCURRENCES
Merging temp files: processed 2106999 lines.

TRAINING MODEL
Read 2106999 lines.
Using random seed 1743474106
04/01/25 - 10:21.46AM, iter: 001, cost: 0.055981
04/01/25 - 10:21.46AM, iter: 002, cost: 0.050632
......
04/01/25 - 10:21.48AM, iter: 014, cost: 0.030047
04/01/25 - 10:21.48AM, iter: 015, cost: 0.029100

GloVe Training Cost 9 s.
Output Saved To: output/三体-GloVe.50.15.bin
&lt;gensim.models.keyedvectors.KeyedVectors at 0x331517440&gt;
</code></pre></div><p><img loading="lazy" src="img/05-glove.png" alt=""  />
</p>
<p>训练生成的 <code>output/GloVe/三体-GloVe.50.15.bin</code> 可用 <strong><em>ct.load_w2v</em></strong> 读取，在后面会有展示。</p>
<br>
<h3 id="43-evaluate_similarity">4.3 evaluate_similarity()</h3>
<p>评估词向量模型语义相似表现。 使用 Spearman&rsquo;s Rank Coeficient 作为评价指标， 取值[-1, 1], 1 完全相关，-1 完全负相关， 0 毫无相关性。</p>
<p>cntext2.x 内置 537 条近义实验数据， 可直接使用。</p>
<p><img loading="lazy" src="img/01-similar.png" alt=""  />
</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">ct</span><span class="o">.</span><span class="n">evaluate_similarity</span><span class="p">(</span><span class="n">wv</span><span class="p">,</span> <span class="n">file</span><span class="o">=</span><span class="kc">None</span><span class="p">)</span>
</code></pre></div><ul>
<li><strong>wv</strong> 语料 txt 文件路径</li>
<li><strong>file</strong> 评估数据文件，txt 格式，默认使用 cntext 内置的评估数据文件。 txt 文件每行两个词一个数字，如下所示</li>
</ul>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">足球	足球	4.98
老虎	老虎	4.8888888889
恒星	恒星	4.7222222222
入场券	门票	4.5962962963
空间	化学	0.9222222222
股票	电话	0.92
国王	车	0.9074074074
中午	字符串	0.6
收音机	工作	0.6
教授	黄瓜	0.5
自行车	鸟	0.5
蛋白质	文物	0.15
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="c1"># 可在 https://cntext.readthedocs.io/zh-cn/latest/embeddings.html 下载该模型文件</span>
<span class="n">dm_w2v</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">load_w2v</span><span class="p">(</span><span class="s1">&#39;output/douban-movie-1000w-Word2Vec.200.15.bin&#39;</span><span class="p">)</span>

<span class="c1"># 使用内置评估文件</span>
<span class="n">ct</span><span class="o">.</span><span class="n">evaluate_similarity</span><span class="p">(</span><span class="n">wv</span><span class="o">=</span><span class="n">dm_w2v</span><span class="p">)</span>
<span class="c1"># 使用自定义评估文件</span>
<span class="c1"># ct.evaluate_similarity(wv=dm_w2v, file=&#39;diy_similarity.txt&#39;)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">近义测试: similarity.txt
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/cntext/model/evaluate_data/similarity.txt
Processing Similarity Test: 100%|██████████| 537/537 [00:00&lt;00:00, 85604.55it/s]

评估结果：
+----------+------------+----------------------------+
| 发现词语 | 未发现词语 | Spearman&#39;s Rank Coeficient |
+----------+------------+----------------------------+
|   459    |     78     |            0.43            |
+----------+------------+----------------------------+
</code></pre></div><br>
<h3 id="44-evaluate_analogy">4.4 evaluate_analogy()</h3>
<p>用于评估词向量模型在类比测试（analogy test）中表现的函数。它通过读取指定的类比测试文件，计算模型对词语关系预测的准确性，并输出每个类别的准确率、发现词语数量、未发现词语数量以及平均排名等指标。</p>
<ul>
<li>雅典之于希腊，似如巴格达之于伊拉克。</li>
<li>哈尔滨之于黑龙江，似如长沙之于湖南。</li>
<li>国王之于王后，似如男人之于女人。</li>
</ul>
<p><img loading="lazy" src="img/02-analogy-woman.png" alt=""  />
</p>
<p>cntext2.x 内置 1194 条类比， 格式如下</p>
<p><img loading="lazy" src="img/03-analogy.png" alt=""  />
</p>
<p>类比测试的核心是解决形如 &ldquo;A : B :: C : D&rdquo; 的问题，翻译过来就是&quot;A 之于 B，似如 C 之于 D&quot;； 即通过 AB 类比关系，找到 C 的关系词 D。该函数通过词向量模型的相似性搜索功能，计算预测结果与真实答案的匹配程度。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">ct</span><span class="o">.</span><span class="n">evaluate_analogy</span><span class="p">(</span><span class="n">wv</span><span class="p">,</span> <span class="n">file</span><span class="o">=</span><span class="kc">None</span><span class="p">)</span>
</code></pre></div><ul>
<li><strong>wv</strong> 语料 txt 文件路径</li>
<li><strong>file</strong> 评估数据文件，txt 格式，默认使用 cntext 内置的评估数据文件。 txt 文件每行两个词一个数字，如下所示</li>
</ul>
<br>
<p>评估数据 txt 文件格式，如下</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">: CapitalOfCountries
雅典 希腊 巴格达 伊拉克
哈瓦那 古巴 马德里 西班牙
河内 越南 伦敦 英国
: CityInProvince
石家庄 河北 南昌 江西
沈阳 辽宁 南昌 江西
南京 江苏 郑州 河南
: FamilyRelationship
男孩 女孩 兄弟 姐妹
男孩 女孩 国王 王后
父亲 母亲 国王 王后
丈夫 妻子 叔叔 阿姨
: SocialScience-Concepts
社会 社会结构 家庭 家庭结构
文化 文化传承 语言 语言传承
群体 群体行为 组织 组织行为
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="c1"># 可在 https://cntext.readthedocs.io/zh-cn/latest/embeddings.html 下载该模型文件</span>
<span class="n">dm_w2v</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">load_w2v</span><span class="p">(</span><span class="s1">&#39;output/douban-movie-1000w-Word2Vec.200.15.bin&#39;</span><span class="p">)</span>

<span class="c1"># 使用内置评估文件</span>
<span class="n">ct</span><span class="o">.</span><span class="n">evaluate_analogy</span><span class="p">(</span><span class="n">wv</span><span class="o">=</span><span class="n">dm_w2v</span><span class="p">)</span>
<span class="c1"># 使用自定义评估文件</span>
<span class="c1"># ct.evaluate_analogy(wv=dm_w2v, file=&#39;diy_analogy.txt&#39;)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">类比测试: analogy.txt
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/cntext/model/evaluate_data/analogy.txt
Processing Analogy Test: 100%|█████████████| 1198/1198 [00:11&lt;00:00, 103.52it/s]

评估结果：
+--------------------+----------+------------+------------+----------+
|      Category      | 发现词语 | 未发现词语 | 准确率 (%) | 平均排名 |
+--------------------+----------+------------+------------+----------+
| CapitalOfCountries |   615    |     62     |   39.02    |   2.98   |
|   CityInProvince   |   175    |     0      |   28.57    |   4.74   |
| FamilyRelationship |   272    |     0      |   92.65    |   1.48   |
|   SocialScience    |    8     |     62     |   25.00    |   6.00   |
+--------------------+----------+------------+------------+----------+
</code></pre></div><p>豆瓣电影在 FamilyRelationship 评估中表现较好，大概率是因为电影主要反映的是人与人之间的关系，覆盖了绝大多数 FamilyRelationship 家庭类比关系，所以类比表现巨好，但在其他方面表现较差。</p>
<p>如果是维基百科语料，可能在 CapitalOfCountries、CityInProvince、SocialScience 中表现较好。</p>
<br>
<h3 id="45-sopmi">4.5 SoPmi()</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">ct</span><span class="o">.</span><span class="n">SoPmi</span><span class="p">(</span><span class="n">corpus_file</span><span class="p">,</span> <span class="n">seed_file</span><span class="p">)</span>       <span class="c1">#人工标注的初始种子词</span>
</code></pre></div><ul>
<li><strong>corpus_file</strong> 语料 txt 文件路径</li>
<li><strong>seed_file</strong> 初始种子词 txt 文件路径</li>
</ul>
<p>共现法</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="n">ct</span><span class="o">.</span><span class="n">SoPmi</span><span class="p">(</span><span class="n">corpus_file</span><span class="o">=</span><span class="s1">&#39;data/sopmi_corpus.txt&#39;</span><span class="p">,</span>
         <span class="n">seed_file</span><span class="o">=</span><span class="s1">&#39;data/sopmi_seed.txt&#39;</span><span class="p">)</span>       <span class="c1"># 人工标注的初始种子词</span>

</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">Step 1/4:...Preprocess   Corpus ...
Step 2/4:...Collect co-occurrency information ...
Step 3/4:...Calculate   mutual information ...
Step 4/4:...Save    candidate words ...
Finish! used 19.74 s
</code></pre></div><p><img loading="lazy" src="img/06-sopmi.png" alt=""  />
</p>
<br>
<h3 id="46-load_w2v">4.6 load_w2v()</h3>
<p>导入 cntext2.x 预训练的 word2vec 模型 .txt 文件</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">ct</span><span class="o">.</span><span class="n">load_w2v</span><span class="p">(</span><span class="n">w2v_path</span><span class="p">)</span>
</code></pre></div><ul>
<li><strong>w2v_path</strong> 模型文件路径</li>
</ul>
<p>读取 <strong><em>output/三体.100.6.txt</em></strong> 模型文件, 返回 <code>gensim.models.word2vec.Word2Vec</code> 类型。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="n">santi_w2v</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">load_w2v</span><span class="p">(</span><span class="n">w2v_path</span><span class="o">=</span><span class="s1">&#39;output/三体-Word2Vec.50.6.bin&#39;</span><span class="p">)</span>
<span class="c1"># santi_w2v = ct.load_wv(wv_path=&#39;output/三体-Word2Vec.50.6.txt&#39;)</span>

<span class="n">santi_glove</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">load_w2v</span><span class="p">(</span><span class="n">w2v_path</span><span class="o">=</span><span class="s1">&#39;output/三体-GloVe.50.15.bin&#39;</span><span class="p">)</span>
<span class="c1"># santi_glove = ct.load_wv(wv_path=&#39;output/三体-GloVe.50.15.bin&#39;)</span>

<span class="n">santi_w2v</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">Loading output/三体-Word2Vec.50.6.bin...
Loading output/三体-GloVe.50.15.bin...
&lt;gensim.models.keyedvectors.KeyedVectors at 0x33aa9cf80&gt;
</code></pre></div><br>
<h3 id="47-glove2word2vec">4.7 glove2word2vec()</h3>
<p>将 GLoVe 模型.txt 文件转化为 Word2Vec 模型.txt 文件； 除非从网络下载的 GloVe 模型资源， 否则一般情况用不到这个函数。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">ct</span><span class="o">.</span><span class="n">glove2word2vec</span><span class="p">(</span><span class="n">glove_file</span><span class="p">,</span> <span class="n">word2vec_file</span><span class="p">)</span>
</code></pre></div><ul>
<li><strong><em>glove_file</em></strong>: GLoVe 模型.txt 文件路径</li>
<li><strong><em>word2vec_file</em></strong>: Word2Vec 模型.txt 文件路径</li>
</ul>
<br>
<p>注意这里的 GLoVe 模型.txt 是通过<a href="https://github.com/standfordnlp/GloVe">Standfordnlp/GloVe</a> 训练得到的</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>
<span class="n">ct</span><span class="o">.</span><span class="n">glove2word2vec</span><span class="p">(</span><span class="n">glove_file</span><span class="o">=</span><span class="s1">&#39;data/GloVe.6B.50d.txt&#39;</span><span class="p">,</span>
                  <span class="n">word2vec_file</span><span class="o">=</span><span class="s1">&#39;output/word2vec_format_GloVe.6B.50d.txt&#39;</span><span class="p">)</span>
</code></pre></div><br>
<h3 id="注意">注意</h3>
<ul>
<li><strong><em>ct.load_w2v()</em></strong> 导入后得到的数据类型是 <strong><em>gensim.models.keyedvectors.KeyedVectors</em></strong> 。</li>
<li><strong><em>gensim.models.word2vec.Word2Vec</em></strong> 可以转化为 <strong><em>gensim.models.keyedvectors.KeyedVectors</em></strong> ，</li>
</ul>
<br>
<h3 id="48-expand_dictionary">4.8 expand_dictionary()</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">ct.expand_dictionary(wv,  seeddict, topn=100)
</code></pre></div><ul>
<li><strong>wv</strong> 预训练模型，数据类型为 gensim.models.keyedvectors.KeyedVectors。</li>
<li><strong>seeddict</strong> 参数类似于种子词；格式为 PYTHON 字典；</li>
<li><strong>topn</strong> 返回 topn 个语义最接近 seeddict 的词</li>
</ul>
<p>根据设置的 seeddict, 可按类别扩展并生成对应的词典 txt 文件， txt 文件位于[output]文件夹内。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">seeddict</span> <span class="o">=</span> <span class="p">{</span>
    <span class="s1">&#39;人物&#39;</span><span class="p">:</span> <span class="p">[</span><span class="s1">&#39;叶文洁&#39;</span><span class="p">,</span> <span class="s1">&#39;史强&#39;</span><span class="p">,</span> <span class="s1">&#39;罗辑&#39;</span><span class="p">],</span>
    <span class="s1">&#39;物体&#39;</span><span class="p">:</span> <span class="p">[</span><span class="s1">&#39;飞船&#39;</span><span class="p">,</span> <span class="s1">&#39;车辆&#39;</span><span class="p">]</span>
<span class="p">}</span>


<span class="n">ct</span><span class="o">.</span><span class="n">expand_dictionary</span><span class="p">(</span><span class="n">wv</span><span class="o">=</span><span class="n">santi_w2v</span><span class="o">.</span><span class="n">wv</span><span class="p">,</span>
                     <span class="n">seeddict</span><span class="o">=</span><span class="n">seeddict</span><span class="p">,</span>
                     <span class="n">topn</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/04-expand.png" alt=""  />
</p>
<br>
<br>
<h2 id="五mind-模块">五、Mind 模块</h2>
<p>词嵌入中蕴含着人类的认知信息，以往的词嵌入大多是比较一个概念中两组反义词与某对象的距离计算认知信息。</p>
<ul>
<li>
<p><strong>多个对象与某概念的语义远近</strong>，职业与性别，某个职业是否存在亲近男性，而排斥女性</p>
</li>
<li>
<p>多个对象在某概念向量投影的大小， 人类语言中留存着对不同动物体积的认知记忆，如小鼠大象。动物词在词向量空间中是否能留存着这种大小的记忆</p>
</li>
</ul>
<p>本模块主要是利用已训练出的 word2vec 模型，挖掘潜在的态度偏见、刻板印象等。 这部分难度较大， 建议有精力且电脑性能好的同学可以用 cntext 训练模型， 再来实验 Mind 模块。</p>
<table>
<thead>
<tr>
<th>模块</th>
<th>函数(类)</th>
<th>功能</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>mind</strong></td>
<td><code>ct.semantic_centroid(wv, words)</code></td>
<td>计算多个词语的语义中心向量</td>
</tr>
<tr>
<td><strong>mind</strong></td>
<td><code>ct.generate_concept_axis(wv, poswords, negwords)</code></td>
<td>生成概念轴向量。</td>
</tr>
<tr>
<td><strong>mind</strong></td>
<td><code>sematic_projection(wv, words, poswords, negwords)</code></td>
<td>测量语义投影</td>
</tr>
<tr>
<td><strong>mind</strong></td>
<td><code>ct.project_word(wv, a, b, cosine=False)</code></td>
<td>在词向量空间中， 计算词语 a 在词语 b 上的投影</td>
</tr>
<tr>
<td><strong>mind</strong></td>
<td><code>ct.project_text(wv, text, axis, lang='chinese', cosine=False)</code></td>
<td>计算词语文本text在概念轴向量axis上的投影值</td>
</tr>
<tr>
<td><strong>mind</strong></td>
<td><code>ct.wepa(wv, text, poswords, negwords, lang='chinese')</code></td>
<td>计算文本在概念轴上的投影得分，返回wepa得分</td>
</tr>
<tr>
<td><strong>mind</strong></td>
<td><code>ct.sematic_distance(wv, words1, words2)</code></td>
<td>测量语义距离</td>
</tr>
<tr>
<td><strong>mind</strong></td>
<td><code>ct.divergent_association_task(wv, words)</code></td>
<td>测量发散思维(创造力)</td>
</tr>
<tr>
<td><strong>mind</strong></td>
<td><code>ct.discursive_diversity_score(wv, words)</code></td>
<td>测量语言差异性(认知差异性)</td>
</tr>
<tr>
<td><strong>mind</strong></td>
<td><strong>ct.procrustes_align(base_wv, other_wv)</strong></td>
<td>两个 word2vec 进行语义对齐，可反应随时间的社会语义变迁</td>
</tr>
</tbody>
</table>
<br>
<h3 id="51-semantic_centroidwv-words">5.1 semantic_centroid(wv, words)</h3>
<p>计算多个词语的语义中心向量</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="c1"># 获取词向量文件 https://cntext.readthedocs.io/zh-cn/latest/embeddings.html</span>
<span class="n">w2v</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">load_w2v</span><span class="p">(</span><span class="s1">&#39;专利摘要-Word2Vec.200.15.bin&#39;</span><span class="p">)</span>
<span class="n">semantic_centroid</span><span class="p">(</span><span class="n">wv</span><span class="o">=</span><span class="n">w2v</span><span class="p">,</span> <span class="n">words</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;创新&#39;</span><span class="p">,</span> <span class="s1">&#39;颠覆&#39;</span><span class="p">])</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">array([ 0.15567462, -0.05117003, -0.18534171,  0.20808656, -0.01133028,
        0.10738188, -0.02571066,  0.06051835,  0.00107351,  0.08017981,
        0.08914138,  0.01845527,  0.06232869, -0.03851539, -0.17092938,
        0.02196799, -0.04136903,  0.11350462, -0.09539546,  0.04907424,
        0.01268489,  0.05294977,  0.08449743, -0.02762416,  0.02332745,
        0.08865491, -0.06260188, -0.0378293 ,  0.04771722,  0.05745243,
        0.04417403, -0.04126203, -0.02403288, -0.03834526,  0.08115771,
        0.01508994,  0.07678635,  0.01395652,  0.1360324 ,  0.03027042,
       -0.02819572,  0.02339242,  0.11504567,  0.02910597,  0.06149592,
        0.01126606, -0.10132807,  0.07762785, -0.01214836,  0.03780747,
        0.12758181, -0.03115267, -0.19343086, -0.21930983,  0.05253006,
       -0.01452067, -0.07067247, -0.04237257, -0.08911953,  0.08573315,
        0.02742999,  0.05392318,  0.02916237,  0.04465031, -0.0788566 ,
       -0.07088121,  0.03111146,  0.00387428, -0.04032568,  0.14935694,
       -0.03880607,  0.07259471,  0.01711774, -0.05551507,  0.01039889,
        0.00666137,  0.03313185,  0.03169986,  0.08127907,  0.0239668 ,
       -0.00991806, -0.04201584,  0.01199235, -0.08669737, -0.02087858,
       -0.03440931,  0.02360864,  0.06623896, -0.01020982,  0.01200165,
        0.01059455,  0.13041293,  0.01103112,  0.03814259, -0.01519256,
        0.02946554,  0.00593279,  0.08796389,  0.0198915 , -0.0569265 ,
       -0.14622693,  0.07680258, -0.02288322, -0.04959924,  0.03325186,
        0.11031196,  0.06893978,  0.04289736, -0.0307357 , -0.09662723,
        0.02554002,  0.05394766,  0.047071  , -0.09522557, -0.08160087,
       -0.01467315, -0.01304489,  0.07513782,  0.04484766, -0.0516454 ,
        0.00648148,  0.01093231, -0.00303798, -0.06217093,  0.02755075,
       -0.10749754, -0.05205868, -0.02562402,  0.09068517,  0.05208463,
       -0.11790312,  0.02881086, -0.02414756,  0.00192055,  0.03881926,
       -0.05390498,  0.06648378,  0.02055933, -0.07083403, -0.07248309,
       -0.12991821,  0.0603951 ,  0.14131376, -0.01507344, -0.06480791,
       -0.08994781, -0.03397571,  0.0108852 , -0.02777362,  0.01159309,
        0.00121858, -0.0690551 , -0.07747664,  0.03437752, -0.14576062,
        0.06320656, -0.10743124, -0.01910913,  0.15803815, -0.03027673,
       -0.02909171, -0.03350233, -0.0694584 , -0.09807504, -0.09133697,
       -0.01123043,  0.04894681, -0.01971908, -0.08290677, -0.00336836,
        0.09619438, -0.03496556,  0.09733834, -0.0421683 ,  0.01408717,
        0.03355598,  0.00748263,  0.011903  , -0.12909584,  0.01545653,
        0.07656407,  0.09496018,  0.0608537 ,  0.00597665, -0.01628997,
        0.06285962, -0.16796936, -0.0486528 ,  0.01525079, -0.03067709,
       -0.02952635, -0.02731965, -0.06351878,  0.03577968,  0.0457835 ,
        0.08370785, -0.03491699, -0.12606403, -0.08686454, -0.04782247])
</code></pre></div><br>
<h3 id="52-generate_concept_axiswv-poswords-negwords">5.2 generate_concept_axis(wv, poswords, negwords)</h3>
<p>生成概念轴向量。</p>
<ul>
<li><strong><em>wv</em></strong> 生成概念轴向量。</li>
<li><strong><em>poswords</em></strong> 第一个词语列表，表示概念正义词。</li>
<li><strong><em>negwords</em></strong> 第二个词语列表，表示概念反义词。</li>
</ul>
<p>需要注意， 概念 1 与 概念 2 是性质(方向)相反的两个概念， 如</p>
<ul>
<li>性别(男, 女)</li>
<li>尺寸(大, 小)</li>
<li>方向(高, 低)</li>
<li>方向(前, 后)</li>
<li>湿度(干, 湿)</li>
<li>财富(贫, 富)</li>
</ul>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="c1"># 获取词向量文件</span>
<span class="c1"># https://github.com/hiDaDeng/Chinese-Pretrained-Word-Embeddings</span>
<span class="n">dm_w2v</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">load_w2v</span><span class="p">(</span><span class="s1">&#39;douban-movie-1000w-Word2Vec.200.15.bin&#39;</span><span class="p">)</span>
<span class="n">gender_axis_vector</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">generate_concept_axis</span><span class="p">(</span><span class="n">wv</span><span class="o">=</span><span class="n">dm_w2v</span><span class="p">,</span>
                                              <span class="n">poswords</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;男&#39;</span><span class="p">,</span> <span class="s1">&#39;男人&#39;</span><span class="p">,</span> <span class="s1">&#39;父亲&#39;</span><span class="p">],</span>
                                              <span class="n">negwords</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;女&#39;</span><span class="p">,</span> <span class="s1">&#39;女人&#39;</span><span class="p">,</span> <span class="s1">&#39;母亲&#39;</span><span class="p">])</span>
<span class="n">gender_axis_vector</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">array([-0.0118976 ,  0.03178174, -0.04656127,  0.00613294, -0.03692355,
       -0.06293361, -0.04739443,  0.01368712,  0.02603469, -0.02268519,
       -0.09925436,  0.05780286,  0.11218373,  0.07519485,  0.06885784,
        0.05505687, -0.04097392,  0.1737831 ,  0.05118835, -0.06879821,
        0.04762978,  0.02224233, -0.04891564, -0.08712718, -0.01432874,
       -0.07395219,  0.01229804,  0.06655715, -0.01864985, -0.04864848,
        0.00260787,  0.06843776,  0.00472286,  0.03623124,  0.11959086,
       -0.04683099, -0.11005358,  0.0271024 , -0.05976011,  0.12669185,
        0.03592191, -0.01125782, -0.02587771, -0.02719228,  0.0507662 ,
       -0.09198377,  0.09546432, -0.01937146,  0.06106697, -0.0405688 ,
       -0.1311393 ,  0.06090249,  0.03515694,  0.01364273, -0.02491697,
        0.03379048, -0.06635275,  0.01432849,  0.01212378, -0.0625283 ,
       -0.03481676, -0.0422427 , -0.17145215, -0.06323837,  0.02563147,
       -0.02371969,  0.01217621, -0.00346871,  0.07024875,  0.08295133,
        0.00731711, -0.01932047,  0.02165518, -0.09927654, -0.08531073,
        0.01949702,  0.00536061,  0.10426087, -0.02010326,  0.02297032,
       -0.10657956,  0.1035546 ,  0.00569263, -0.0849498 ,  0.1098236 ,
        0.05310893, -0.0802139 , -0.01034231, -0.12204715,  0.01407488,
       -0.01781198, -0.0134118 ,  0.09836894,  0.16098371,  0.00609895,
        0.05433145, -0.08940306,  0.00136946, -0.08455469, -0.08432727,
        0.04675778, -0.03415223, -0.18552355, -0.05219543, -0.01127822,
        0.02059881, -0.08120015, -0.15610164,  0.01439221,  0.01727759,
       -0.14516874,  0.01783531, -0.13099317,  0.03820422,  0.03033866,
       -0.01779634,  0.07759558,  0.15866944,  0.00191632, -0.00905253,
        0.0312649 , -0.05698524,  0.07270953, -0.00734233,  0.06289094,
        0.01014149, -0.0052088 ,  0.02478063, -0.0112649 , -0.0930789 ,
        0.14639418, -0.08183327, -0.08392337, -0.01458992, -0.0163887 ,
        0.06790476, -0.03252221,  0.08593727,  0.10469338, -0.01363467,
        0.00749907, -0.01320484,  0.08405331,  0.0489707 , -0.11343482,
       -0.10319041, -0.02415894,  0.13382405, -0.01983603, -0.00990637,
       -0.03335103,  0.11718886, -0.05802442, -0.18935862, -0.07409969,
       -0.08306517, -0.04423901,  0.11331058,  0.00588326,  0.06339834,
        0.04405889,  0.1263905 , -0.007273  , -0.02706875,  0.02325469,
       -0.13092995,  0.02056245, -0.0442118 , -0.01964739, -0.06501938,
        0.02196051, -0.1823353 ,  0.04273191,  0.01935809, -0.01464438,
       -0.02626805,  0.09194217,  0.02489716,  0.05376589, -0.00484252,
        0.02822759,  0.06744799, -0.14196248,  0.03016541, -0.05347864,
       -0.16907257,  0.05094757,  0.0721257 , -0.00421157,  0.03022675,
       -0.00047884,  0.07792547, -0.00209365,  0.0669208 ,  0.02009218,
        0.11358768, -0.05002993,  0.01760067,  0.03407429, -0.0893421 ],
      dtype=float32)
</code></pre></div><br>
<h3 id="53-sematic_distance">5.3 sematic_distance()</h3>
<p><strong>多个对象与某概念的语义远近</strong>，例如成功与性别，成功是否存在亲近男性，而排斥女性</p>
<p><img loading="lazy" src="img/21-music-success-genderbias.png" alt=""  />
</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">ct.sematic_distance(wv, words1, words2)
</code></pre></div><ul>
<li><strong><em>wv</em></strong> 模型数据， 数据类型为 gensim.models.keyedvectors.KeyedVectors。</li>
<li><strong><em>words1</em></strong>、<strong>words2</strong> 均为词语列表</li>
</ul>
<p>分别计算 <strong><em>words1</em></strong> 与 <strong><em>words2</em></strong> 语义距离，返回距离差值。例如</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">male_concept = [&#39;male&#39;, &#39;man&#39;, &#39;he&#39;, &#39;him&#39;]
female_concept = [&#39;female&#39;, &#39;woman&#39;, &#39;she&#39;, &#39;her&#39;]
engineer_concept  = [&#39;engineer&#39;,  &#39;programming&#39;,  &#39;software&#39;]

dist(male, engineer) = distance(male_concept,  engineer_concept)
dist(female, engineer) = distance(female_concept,  engineer_concept)
</code></pre></div><p>如果 <strong><em>dist(male, engineer)-dist(female, engineer)&lt;0</em></strong>，说明在语义空间中，<strong><em>engineer_concept</em></strong> 更接近 <strong><em>male_concept</em></strong> ，更远离 <strong><em>female_concept</em></strong> 。</p>
<p>换言之，在该语料中，人们对软件工程师这一类工作，对女性存在刻板印象(偏见)。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="c1"># glove_w2v.6B.100d.txt链接: https://pan.baidu.com/s/1MMfQ7M0YCzL9Klp4zrlHBw 提取码: 72l0</span>
<span class="n">g_wv</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">load_w2v</span><span class="p">(</span><span class="s1">&#39;data/glove_w2v.6B.100d.txt&#39;</span><span class="p">)</span>

<span class="n">engineer</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;program&#39;</span><span class="p">,</span> <span class="s1">&#39;software&#39;</span><span class="p">,</span> <span class="s1">&#39;computer&#39;</span><span class="p">]</span>
<span class="n">man_words</span> <span class="o">=</span>  <span class="p">[</span><span class="s2">&#34;man&#34;</span><span class="p">,</span> <span class="s2">&#34;he&#34;</span><span class="p">,</span> <span class="s2">&#34;him&#34;</span><span class="p">]</span>
<span class="n">woman_words</span> <span class="o">=</span> <span class="p">[</span><span class="s2">&#34;woman&#34;</span><span class="p">,</span> <span class="s2">&#34;she&#34;</span><span class="p">,</span> <span class="s2">&#34;her&#34;</span><span class="p">]</span>

<span class="n">dist_male_engineer</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">sematic_distance</span><span class="p">(</span><span class="n">male_concept</span><span class="p">,</span>  <span class="n">engineer_concept</span><span class="p">)</span>
<span class="n">dist_female_engineer</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">sematic_distance</span><span class="p">(</span><span class="n">female_concept</span><span class="p">,</span>  <span class="n">engineer_concept</span><span class="p">)</span>

<span class="n">dist_male_engineer</span> <span class="o">-</span> <span class="n">dist_female_engineer</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">-0.5
</code></pre></div><p>dist_male_engineer &lt; dist_female_engineer，在语义空间中，工程师更接近于男人，而不是女人。</p>
<br>
<h3 id="54-sematic_projection">5.4 sematic_projection()</h3>
<p>多个对象在某概念向量投影的大小</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">ct</span><span class="o">.</span><span class="n">sematic_projection</span><span class="p">(</span><span class="n">wv</span><span class="p">,</span> <span class="n">words</span><span class="p">,</span> <span class="n">poswords</span><span class="p">,</span> <span class="n">negwords</span><span class="p">,</span> <span class="n">return_full</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">cosine</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
</code></pre></div><ul>
<li><strong><em>wv</em></strong> 模型数据， 数据类型为 gensim.models.keyedvectors.KeyedVectors。</li>
<li><strong><em>words</em></strong>、<strong><em>poswords</em></strong>、<strong><em>negwords</em></strong> 均为词语列表</li>
<li><strong>cosine</strong>: 是否使用余弦相似度，默认为False，返回投影值；True时返回余弦相似度</li>
<li><strong>return_full</strong>: 是否返回完整元组列表，默认为True</li>
</ul>
<br>
<p>为了解释词向量模型的语义投影，我使用了 2022 年 Nature 论文中的图片[@Grand2022SemanticPR]。 关于动物的名字，人类对动物大小的认知信息隐藏在语料库文本中。 通过将<strong>LARGE WORDS</strong> 和<strong>SMALL WORDS</strong>的含义用不同的<strong>animals</strong>的向量投影，动物在<strong>size 向量</strong>上的投影（就像下图中的红线 ) 得到，因此可以通过计算比较动物的大小。</p>
<p>根据两组反义词 <strong><em>poswords</em></strong> , <strong><em>negwords</em></strong> 构建一个概念(认知)向量, words 中的每个词向量在概念向量中投影，即可得到认知信息。</p>
<p>分值越大，<strong><em>words</em></strong> 越位于 <strong><em>poswords</em></strong> 一侧。</p>
<blockquote>
<p>Grand, G., Blank, I.A., Pereira, F. and Fedorenko, E., 2022. Semantic projection recovers rich human knowledge of multiple object features from word embeddings. <em>Nature Human Behaviour</em>, pp.1-13.&quot;</p>
</blockquote>
<p><img loading="lazy" src="img/22-semantic_projection.png" alt=""  />
</p>
<p>例如，人类的语言中，存在尺寸、性别、年龄、政治、速度、财富等不同的概念。每个概念可以由两组反义词确定概念的向量方向。</p>
<p>以尺寸为例，动物在人类认知中可能存在体积尺寸大小差异。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>
<span class="n">animals</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;mouse&#39;</span><span class="p">,</span> <span class="s1">&#39;cat&#39;</span><span class="p">,</span> <span class="s1">&#39;horse&#39;</span><span class="p">,</span>  <span class="s1">&#39;pig&#39;</span><span class="p">,</span> <span class="s1">&#39;whale&#39;</span><span class="p">]</span>
<span class="n">small_words</span><span class="o">=</span> <span class="p">[</span><span class="s2">&#34;small&#34;</span><span class="p">,</span> <span class="s2">&#34;little&#34;</span><span class="p">,</span> <span class="s2">&#34;tiny&#34;</span><span class="p">]</span>
<span class="n">large_words</span> <span class="o">=</span> <span class="p">[</span><span class="s2">&#34;large&#34;</span><span class="p">,</span> <span class="s2">&#34;big&#34;</span><span class="p">,</span> <span class="s2">&#34;huge&#34;</span><span class="p">]</span>

<span class="c1"># wiki_wv = ct.load_w2v(&#39;wiki的word2vec模型文件路径&#39;)</span>
<span class="c1"># wiki_wv</span>

<span class="c1"># In size conception, mouse is smallest, horse is biggest.</span>
<span class="c1"># 在大小概念上，老鼠最小，马是最大的。</span>
<span class="n">ct</span><span class="o">.</span><span class="n">sematic_projection</span><span class="p">(</span><span class="n">wv</span><span class="o">=</span><span class="n">wiki_wv</span><span class="p">,</span>
                      <span class="n">words</span><span class="o">=</span><span class="n">animals</span><span class="p">,</span>
                      <span class="n">poswords</span><span class="o">=</span><span class="n">large_words</span><span class="p">,</span>
                      <span class="n">negwords</span><span class="o">=</span><span class="n">small_words</span><span class="p">,</span>
                      <span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[(&#39;mouse&#39;, -1.68),
 (&#39;cat&#39;, -0.92),
 (&#39;pig&#39;, -0.46),
 (&#39;whale&#39;, -0.24),
 (&#39;horse&#39;, 0.4)]
</code></pre></div><p>关于尺寸的认知，人类在文本中隐含着老鼠较小，马较大。</p>
<br>
<h3 id="55-project_word">5.5 project_word</h3>
<p>在向量空间中， 计算词语a在词语b上的投影(余弦相似度)。默认返回的是投影值。
如果 cosine=True，返回词语a与词语b的余弦相似度。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">project_word</span><span class="p">(</span><span class="n">wv</span><span class="p">,</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="n">cosine</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
</code></pre></div><ul>
<li><strong>wv</strong> 语料 txt 文件路径</li>
<li><strong>a</strong> 词语 a 字符串或列表</li>
<li><strong>b</strong> 词语字符串、词语列表、或某概念向量</li>
<li><em><strong>cosine</strong></em>: 是否使用余弦相似度， 默认为False，返回a在b上的投影值； True时，返回a与b的余弦相似度。</li>
</ul>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">b</span><span class="o">=</span><span class="s1">&#39;苗条&#39;</span>
<span class="k">for</span> <span class="n">a</span> <span class="ow">in</span> <span class="p">[</span><span class="s1">&#39;性感&#39;</span><span class="p">,</span><span class="s1">&#39;美丽&#39;</span><span class="p">,</span> <span class="s1">&#39;可爱&#39;</span><span class="p">,</span> <span class="s1">&#39;丑陋&#39;</span><span class="p">]:</span>
    <span class="n">proj</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">project_word</span><span class="p">(</span><span class="n">dm_w2v</span><span class="p">,</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">)</span>
    <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;[</span><span class="si">{</span><span class="n">a</span><span class="si">}</span><span class="s1">]在[</span><span class="si">{</span><span class="n">b</span><span class="si">}</span><span class="s1">]投影值: </span><span class="si">{</span><span class="n">proj</span><span class="si">}</span><span class="s1">&#39;</span><span class="p">)</span>


<span class="n">b</span><span class="o">=</span><span class="s1">&#39;修长&#39;</span>
<span class="k">for</span> <span class="n">a</span> <span class="ow">in</span> <span class="p">[</span><span class="s1">&#39;性感&#39;</span><span class="p">,</span><span class="s1">&#39;美丽&#39;</span><span class="p">,</span> <span class="s1">&#39;可爱&#39;</span><span class="p">,</span> <span class="s1">&#39;丑陋&#39;</span><span class="p">]:</span>
    <span class="n">proj</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">project_word</span><span class="p">(</span><span class="n">dm_w2v</span><span class="p">,</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">)</span>
    <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;[</span><span class="si">{</span><span class="n">a</span><span class="si">}</span><span class="s1">]在[</span><span class="si">{</span><span class="n">b</span><span class="si">}</span><span class="s1">]投影值: </span><span class="si">{</span><span class="n">proj</span><span class="si">}</span><span class="s1">&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[性感]在[苗条]投影值: 14.172947883605957
[美丽]在[苗条]投影值: 7.0944623947143555
[可爱]在[苗条]投影值: 6.935092926025391
[丑陋]在[苗条]投影值: 1.235807180404663

[性感]在[修长]投影值: 14.599699974060059
[美丽]在[修长]投影值: 9.360642433166504
[可爱]在[修长]投影值: 4.740543842315674
[丑陋]在[修长]投影值: 4.010622501373291
</code></pre></div><p>可以看到， 在豆瓣电影语料中， 在[苗条、修长]维度的认知中，都认为</p>
<ul>
<li>[性感]意味着身材最瘦长</li>
<li>[美丽]次之、[可爱]略显不那么修长苗条</li>
<li>[丑陋]意味着基本与[苗条、修长]无关，数值最小。</li>
</ul>
<br>
<p>为了让投影值更稳定，可以选择词组，确定[苗条、修长]这个概念的概念轴向量</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="k">for</span> <span class="n">a</span> <span class="ow">in</span> <span class="p">[</span><span class="s1">&#39;性感&#39;</span><span class="p">,</span><span class="s1">&#39;美丽&#39;</span><span class="p">,</span> <span class="s1">&#39;可爱&#39;</span><span class="p">,</span> <span class="s1">&#39;丑陋&#39;</span><span class="p">]:</span>
    <span class="n">proj</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">project_word</span><span class="p">(</span><span class="n">wv</span><span class="o">=</span><span class="n">dm_w2v</span><span class="p">,</span> <span class="n">a</span><span class="o">=</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;修长&#39;</span><span class="p">,</span> <span class="s1">&#39;苗条&#39;</span><span class="p">])</span>
    <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;[</span><span class="si">{</span><span class="n">a</span><span class="si">}</span><span class="s1">]在[修长，苗条]投影值: </span><span class="si">{</span><span class="n">proj</span><span class="si">}</span><span class="s1">&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[性感]在[修长，苗条]投影值: 15.807487487792969
[美丽]在[修长，苗条]投影值: 9.040315628051758
[可爱]在[修长，苗条]投影值: 6.414511203765869
[丑陋]在[修长，苗条]投影值: 2.882350444793701
</code></pre></div><br>
<h3 id="56-project_text">5.6 project_text()</h3>
<p>在向量空间中，计算文本在概念轴向量上的投影值。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">ct</span><span class="o">.</span><span class="n">project_text</span><span class="p">(</span><span class="n">wv</span><span class="p">,</span> <span class="n">text</span><span class="p">,</span> <span class="n">axis</span><span class="p">,</span> <span class="n">lang</span><span class="o">=</span><span class="s1">&#39;chinese&#39;</span><span class="p">,</span> <span class="n">cosine</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
</code></pre></div><ul>
<li><strong>wv</strong>: 语言模型的KeyedVectors</li>
<li><strong>text</strong>: 文本字符串</li>
<li><strong>lang</strong>:  语言,有chinese和english两种; 默认&quot;chinese&quot;</li>
<li><strong>axis</strong>:  概念向量</li>
<li><strong>cosine</strong>: 投影值是否使用余弦相似度， 默认为False，返回text在axis上的投影值； True时，返回text与axis的余弦相似度。</li>
</ul>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="c1"># 1. 读取词嵌入模型文件</span>
<span class="n">embeddings_model</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">load_w2v</span><span class="p">(</span><span class="s1">&#39;cntext2x训练得到的模型文件路径&#39;</span><span class="p">)</span>
<span class="c1">#dm_w2v = ct.load_w2v(&#39;douban-movie-1000w-Word2Vec.200.15.bin&#39;)</span>

<span class="c1"># 2. 定义情绪正负词语，确定情绪概念轴向量sentiment_axis</span>
<span class="n">sentiment_pos</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;快乐&#39;</span><span class="p">,</span> <span class="s1">&#39;幸福&#39;</span><span class="p">,</span> <span class="s1">&#39;喜悦&#39;</span><span class="p">,</span> <span class="s1">&#39;满足&#39;</span><span class="p">,</span> <span class="s1">&#39;欣慰&#39;</span><span class="p">,</span> <span class="s1">&#39;激动&#39;</span><span class="p">,</span> <span class="s1">&#39;兴奋&#39;</span><span class="p">,</span> <span class="s1">&#39;感恩&#39;</span><span class="p">,</span> <span class="s1">&#39;热爱&#39;</span><span class="p">,</span> <span class="s1">&#39;赞美&#39;</span><span class="p">]</span>
<span class="n">sentiment_neg</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;痛苦&#39;</span><span class="p">,</span> <span class="s1">&#39;悲伤&#39;</span><span class="p">,</span> <span class="s1">&#39;难过&#39;</span><span class="p">,</span> <span class="s1">&#39;失望&#39;</span><span class="p">,</span> <span class="s1">&#39;愤怒&#39;</span><span class="p">,</span> <span class="s1">&#39;怨恨&#39;</span><span class="p">,</span> <span class="s1">&#39;绝望&#39;</span><span class="p">,</span> <span class="s1">&#39;恐惧&#39;</span><span class="p">,</span> <span class="s1">&#39;焦虑&#39;</span><span class="p">,</span> <span class="s1">&#39;压抑&#39;</span><span class="p">]</span>
<span class="n">sentiment_axis</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">generate_concept_axis</span><span class="p">(</span><span class="n">wv</span> <span class="o">=</span> <span class="n">embeddings_model</span><span class="p">,</span> 
                                         <span class="n">poswords</span><span class="o">=</span><span class="n">sentiment_pos</span><span class="p">,</span>
                                         <span class="n">negwords</span><span class="o">=</span><span class="n">sentiment_neg</span><span class="p">)</span>
<span class="c1"># 3. 创建实验文本（从正面到负面）</span>
<span class="n">texts</span> <span class="o">=</span> <span class="p">[</span>
    <span class="s2">&#34;今天阳光明媚，我和家人一起出游，感到无比幸福和快乐。&#34;</span><span class="p">,</span>
    <span class="s2">&#34;工作有了新进展，得到了领导的表扬，内心充满成就感。&#34;</span><span class="p">,</span>
    <span class="s2">&#34;虽然遇到了小挫折，但我依然保持乐观，相信明天会更好。&#34;</span><span class="p">,</span>
    <span class="s2">&#34;生活平淡，没什么特别的事发生，心情一般。&#34;</span><span class="p">,</span>
    <span class="s2">&#34;最近压力有点大，睡眠不好，感觉有点焦虑和疲惫。&#34;</span><span class="p">,</span>
    <span class="s2">&#34;项目失败了，还被领导批评，心里非常难过和失望。&#34;</span><span class="p">,</span>
    <span class="s2">&#34;亲人离世，我感到极度悲伤和痛苦，世界仿佛失去了颜色。&#34;</span>
<span class="p">]</span>


<span class="c1"># 4. 计算每条文本在情绪轴上的投影</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">&#34;文本情绪投影分析（越大越正面）：</span><span class="se">\n</span><span class="s2">&#34;</span><span class="p">)</span>
<span class="n">results</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">text</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">texts</span><span class="p">):</span>
    <span class="c1"># 使用投影函数（返回在 axis 方向上的投影值）</span>
    <span class="c1">#project_text(wv, text, axis, lang=&#39;chinese&#39;, cosine=False)</span>
    <span class="n">proj_value</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">project_text</span><span class="p">(</span><span class="n">wv</span><span class="o">=</span><span class="n">embeddings_model</span><span class="p">,</span> 
                                 <span class="n">text</span><span class="o">=</span><span class="n">text</span><span class="p">,</span> 
                                 <span class="n">axis</span><span class="o">=</span><span class="n">sentiment_axis</span><span class="p">,</span> 
                                 <span class="n">lang</span><span class="o">=</span><span class="s1">&#39;chinese&#39;</span><span class="p">)</span>
    <span class="n">results</span><span class="o">.</span><span class="n">append</span><span class="p">((</span><span class="n">proj_value</span><span class="p">,</span> <span class="n">text</span><span class="p">))</span>
    <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;[</span><span class="si">{</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="si">}</span><span class="s2">] 投影值: </span><span class="si">{</span><span class="n">proj_value</span><span class="si">:</span><span class="s2">+.4f</span><span class="si">}</span><span class="s2"> | </span><span class="si">{</span><span class="n">text</span><span class="si">}</span><span class="se">\n</span><span class="s2">&#34;</span><span class="p">)</span>

<span class="c1"># 5. 按投影值排序，查看情绪强度排序</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">&#34;</span><span class="se">\n</span><span class="s2">&#34;</span> <span class="o">+</span> <span class="s2">&#34;=&#34;</span><span class="o">*</span><span class="mi">60</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">&#34;按情绪正面性排序（从高到低）：&#34;</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">&#34;=&#34;</span><span class="o">*</span><span class="mi">60</span><span class="p">)</span>
<span class="k">for</span> <span class="n">value</span><span class="p">,</span> <span class="n">text</span> <span class="ow">in</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">results</span><span class="p">,</span> <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">reverse</span><span class="o">=</span><span class="kc">True</span><span class="p">):</span>
    <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;</span><span class="si">{</span><span class="n">value</span><span class="si">:</span><span class="s2">+.4f</span><span class="si">}</span><span class="s2"> → </span><span class="si">{</span><span class="n">text</span><span class="si">}</span><span class="s2">&#34;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[1] 投影值: +0.8213 | 今天阳光明媚，我和家人一起出游，感到无比幸福和快乐。
[2] 投影值: +0.5641 | 工作有了新进展，得到了领导的表扬，内心充满成就感。
[3] 投影值: +0.1205 | 虽然遇到了小挫折，但我依然保持乐观，相信明天会更好。
[4] 投影值: -0.0321 | 生活平淡，没什么特别的事发生，心情一般。
[5] 投影值: -0.3178 | 最近压力有点大，睡眠不好，感觉有点焦虑和疲惫。
[6] 投影值: -0.6124 | 项目失败了，还被领导批评，心里非常难过和失望。
[7] 投影值: -0.9012 | 亲人离世，我感到极度悲伤和痛苦，世界仿佛失去了颜色。
</code></pre></div><br>
<h3 id="57-wepa">5.7 wepa()</h3>
<p>计算文本在概念轴上的投影得分，返回wepa得分。 WEPA是词嵌入投影算法(Word Embeddings Projection Algorithm)的英文简称。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">ct.wepa(wv, text, poswords, negwords, lang=&#39;chinese&#39;, cosine=False)
</code></pre></div><ul>
<li><strong>wv</strong> 模型数据， 数据类型为 gensim.models.keyedvectors.KeyedVectors。</li>
<li><strong>poswords、negwords</strong> 确定概念轴的两极(端点)对应的词语列表</li>
<li><strong>lang (str)</strong>: 语言，支持&rsquo;chinese&rsquo;或&rsquo;english'，默认为&rsquo;chinese'</li>
<li><strong>cosine (bool)</strong>: 是否使用余弦相似度，默认为False</li>
</ul>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span> 

<span class="c1"># 加载已训练好的词嵌入模型(GloVe)</span>
<span class="n">M1</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">load_w2v</span><span class="p">(</span><span class="s2">&#34;output/corpus-GloVe.200.15.bin&#34;</span><span class="p">)</span>

<span class="c1"># 目标具体性正、负向词(部分)</span>

<span class="n">POSs</span> <span class="o">=</span> <span class="p">[</span><span class="s2">&#34;kg&#34;</span><span class="p">,</span> <span class="s2">&#34;公斤&#34;</span><span class="p">,</span> <span class="s2">&#34;体脂率&#34;</span><span class="p">,</span> <span class="s2">&#34;bmi&#34;</span><span class="p">,</span> <span class="s2">&#34;公里&#34;</span><span class="p">,</span> <span class="s2">&#34;km&#34;</span><span class="p">,</span> <span class="s2">&#34;小时&#34;</span><span class="p">,</span> <span class="s2">&#34;分钟&#34;</span><span class="p">,</span><span class="o">...</span><span class="p">]</span>
<span class="n">NEGs</span> <span class="o">=</span> <span class="p">[</span><span class="s2">&#34;差不多&#34;</span><span class="p">,</span> <span class="s2">&#34;大概&#34;</span><span class="p">,</span> <span class="s2">&#34;变瘦&#34;</span><span class="p">,</span> <span class="s2">&#34;变强&#34;</span><span class="p">,</span> <span class="s2">&#34;变好&#34;</span><span class="p">,</span> <span class="s2">&#34;变美&#34;</span><span class="p">,</span> <span class="s2">&#34;变帅&#34;</span><span class="p">,</span> <span class="s2">&#34;进步&#34;</span><span class="p">,</span> <span class="o">...</span><span class="p">]</span>

<span class="c1"># 实验文本</span>
<span class="n">TEXTs</span> <span class="o">=</span> <span class="p">[</span><span class="s2">&#34;完成了3组卧推，每组10次，重量80kg。&#34;</span><span class="p">,</span>
         <span class="s2">&#34;今天跑步5公里，用了30分钟，消耗了400卡路里。&#34;</span><span class="p">,</span>
         <span class="s2">&#34;差不多就行了，今天有点累。&#34;</span><span class="p">,</span>
         <span class="s2">&#34;希望自己能变得更强。&#34;</span><span class="p">]</span>

<span class="nb">print</span><span class="p">(</span><span class="s2">&#34;目标具体性得分: </span><span class="se">\n</span><span class="s2">&#34;</span><span class="p">)</span>
<span class="k">for</span> <span class="n">text</span> <span class="ow">in</span> <span class="n">TEXTs</span><span class="p">:</span>
    <span class="n">proj_score</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">wepa</span><span class="p">(</span><span class="n">wv</span><span class="o">=</span><span class="n">M1</span><span class="p">,</span> <span class="n">text</span><span class="o">=</span><span class="n">text</span><span class="p">,</span> <span class="n">poswords</span><span class="o">=</span><span class="n">POSs</span><span class="p">,</span> <span class="n">negwords</span><span class="o">=</span><span class="n">NEGs</span><span class="p">)</span>
    <span class="nb">print</span><span class="p">(</span><span class="n">proj_score</span><span class="p">,</span> <span class="n">text</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">目标具体性得分:
1.22 完成了3组卧推，每组10次，重量80kg。
0.96 今天跑步5公里，用了30分钟，消耗了40...
-0.17 差不多就行了，今天有点累。
-1.42 希望自己能变得更强。
</code></pre></div><p>输出结果清晰地表明，在目标具体性维度上的语义差异：包含具体量化指标的文本（如 80kg、5 公里）获得高分， 而表达模糊意图的文本（如差不多、变强）则得分较低。</p>
<br>
<h3 id="58-divergent_association_task">5.8 divergent_association_task()</h3>
<p><a href="https://textdata.cn/blog/2022-11-14-pnas_naming_unrelated_words_predicts_creativity/">PNAS | 使用语义距离测量一个人的创新力(发散思维)得分</a>。一些理论认为，有 创造力 的人能够产生更多 发散性 的想法。如果这是正确的，简单地让被试写 N 个不相关的单词，然后测量这 N 个词的语义距离， 作为发散思维的客观衡量标准。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">ct.divergent_association_task(wv, words)
</code></pre></div><ul>
<li><strong>wv</strong> 模型数据， 数据类型为 gensim.models.keyedvectors.KeyedVectors。</li>
<li><strong>words</strong>词语列表</li>
</ul>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">low_words = [&#34;arm&#34;, &#34;eyes&#34;, &#34;feet&#34;, &#34;hand&#34;, &#34;head&#34;, &#34;leg&#34;, &#34;body&#34;]
average_words = [&#34;bag&#34;, &#34;bee&#34;, &#34;burger&#34;, &#34;feast&#34;, &#34;office&#34;, &#34;shoes&#34;, &#34;tree&#34;]
high_words = [&#34;hippo&#34;, &#34;jumper&#34;, &#34;machinery&#34;, &#34;prickle&#34;, &#34;tickets&#34;, &#34;tomato&#34;, &#34;violin&#34;]

# 导入模型，得到wv。
# wv = ct.load_w2v(&#39;wiki的word2vec模型文件路径&#39;)


print(ct.divergent_association_task(wv, low_words)) # 50
print(ct.divergent_association_task(wv, average_words)) # 78
print(ct.divergent_association_task(wv, high_words)) # 95
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">50
78
95
</code></pre></div><br>
<h3 id="59-discursive_diversity_score">5.9 discursive_diversity_score()</h3>
<p><a href="https://textdata.cn/blog/2023-11-02-measure-cognitive-diversity-through-language-discursive-diversity/">MS2022 | 使用语言差异性测量团队认知差异性</a></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">ct.discursive_diversity_score(wv, words)
</code></pre></div><ul>
<li><strong><em>wv</em></strong> 模型数据， 数据类型为 gensim.models.keyedvectors.KeyedVectors。</li>
<li>**<em>words</em>**词语列表</li>
<li>返回一个数值</li>
</ul>
<p><img loading="lazy" src="img/23-low-and-high-examples-of-discursive-diversity.jpeg" alt=""  />
</p>
<p>高绩效团队是那些具有调节共享认知以适应不断变化的任务要求的集体能力的团队：在进行构思任务时，它们表现出更高的话语多样性，在执行协调任务时，表现出较低的话语多样性。</p>
<br>
<h3 id="510-procrustes_align">5.10 procrustes_align()</h3>
<p>该函数主要用于反映同一研究对象随着时间推进的社会文化变迁，或者同一时间范围内两个被研究主体间的差异。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">ct.procrustes_align(base_wv, other_wv, words=None)
</code></pre></div><ul>
<li>base_wv (gensim.models.keyedvectors.KeyedVectors): 基准语言模型</li>
<li>other_wv (gensim.models.keyedvectors.KeyedVectors): 其他语言模型</li>
<li>words (list, optional): 是否根据词典 words 对模型进行对齐， 对齐结束后的模型中含有的词不会超出 words 的范围； 默认 None.</li>
</ul>
<p>由于不同语料训练的 Word2Vec 模型无法直接比较， 需要先选定一个基准模型 <strong><em>base_embed</em></strong>， 之后根据 <strong><em>base_embed</em></strong> 对其他模型 <strong><em>other_embed</em></strong> 进行调整，调整后的模型就可以使用前面的语义距离函数或者语义投影函数。 这一过程用到的算法叫做 procrustes 正交算法。</p>
<p>这里推荐一篇 <a href="https://textdata.cn/blog/2023-12-28-visualize-the-culture-change-using-people-daily-dataset/">可视化 | 人民日报语料反映七十年文化演变</a></p>
<p><br><br></p>
<h2 id="六llm-模块">六、LLM 模块</h2>
<p>目前大模型本地化使用越来越方便，</p>
<table>
<thead>
<tr>
<th>模块</th>
<th>函数(类)</th>
<th>功能</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong><em>LLM</em></strong></td>
<td><strong>ct.llm(text, prompt, output_format, task, backend, base_url, api_key, model_name, temperature)</strong></td>
<td>调用大模型执行结构化文本分析任务（如情感分析、关键词提取、分类等）。</td>
</tr>
</tbody>
</table>
<h3 id="61-ctllm">6.1 ct.llm()</h3>
<p>使用大模型（本地或 API）进行文本分析，从非结构化的文本数据中识别模式、提取关键信息、理解语义，并将其转化为结构化数据以便进一步分析和应用。</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">ct</span><span class="o">.</span><span class="n">llm</span><span class="p">(</span><span class="n">text</span><span class="p">,</span> <span class="n">prompt</span><span class="p">,</span> <span class="n">output_format</span><span class="p">,</span> <span class="n">task</span><span class="p">,</span> <span class="n">backend</span><span class="p">,</span> <span class="n">base_url</span><span class="p">,</span> <span class="n">api_key</span><span class="p">,</span> <span class="n">model_name</span><span class="p">,</span> <span class="n">temperature</span><span class="p">)</span>
</code></pre></div><ul>
<li><strong>text</strong>: 待分析的文本内容</li>
<li><strong>task</strong>: 预设任务名称，默认为 &lsquo;sentiment&rsquo;。</li>
<li><strong>prompt</strong>: 自定义系统提示语</li>
<li><strong>output_format</strong>: 自定义输出结构，如 {&lsquo;label&rsquo;: str, &lsquo;score&rsquo;: float}</li>
<li><strong>backend</strong>: 快捷后端别名：
- &lsquo;ollama&rsquo; → http://127.0.0.1:11434/v1
- &lsquo;lmstudio&rsquo; 或 &lsquo;lms&rsquo; → http://localhost:1234/v1
- None → 需配合 base_url 使用</li>
<li><strong>base_url</strong>: 自定义模型服务地址，优先级高于 backend
示例：
- 远程：https://dashscope.aliyuncs.com/compatible-mode/v1
- 内网：http://192.168.1.10:11434/v1
- 本地：http://localhost:1234/v1</li>
<li><strong>api_key</strong>: API 密钥，远程服务必填，本地通常为 &ldquo;EMPTY&rdquo;</li>
<li><strong>model_name</strong>: 模型名称（需服务端已加载）</li>
<li><strong>temperature</strong>: 生成温度，0 表示确定性输出</li>
</ul>
<br>
<p><strong>实验数据为外卖评论， 今天咱们做个有难度的文本分析任务，从不同维度(味道、速度、服务)对外卖评论进行打分(-1.0~1.0)</strong>。</p>
<p><img loading="lazy" src="img/28-llm-analysis.png" alt=""  />
<br></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="n">PROMPT</span> <span class="o">=</span> <span class="s1">&#39;从口味taste、速度speed、服务service三个维度， 对外卖评论内容进行文本分析， 分别返回不同维度的分值(分值范围-1.0 ~ 1.0)&#39;</span>
<span class="n">BASE_URL</span> <span class="o">=</span> <span class="s1">&#39;https://dashscope.aliyuncs.com/compatible-mode/v1&#39;</span>
<span class="n">API_KEY</span> <span class="o">=</span> <span class="s1">&#39;你的API-KEY&#39;</span>
<span class="n">MODEL_NAME</span> <span class="o">=</span> <span class="s1">&#39;qwen-max&#39;</span>

<span class="c1">#味道、速度、服务</span>
<span class="n">OUTPUT_FORMAT</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;taste&#39;</span><span class="p">:</span> <span class="nb">float</span><span class="p">,</span> <span class="s1">&#39;speed&#39;</span><span class="p">:</span> <span class="nb">float</span><span class="p">,</span> <span class="s1">&#39;service&#39;</span><span class="p">:</span> <span class="nb">float</span><span class="p">}</span>

<span class="n">COMMENT_CONTENT</span> <span class="o">=</span> <span class="s1">&#39;太难吃了&#39;</span>

<span class="c1"># 使用</span>
<span class="c1"># result = ct.llm(text=COMMENT_CONTENT,</span>
<span class="c1"># 或</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">llm</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="n">COMMENT_CONTENT</span><span class="p">,</span>
                <span class="n">prompt</span><span class="o">=</span><span class="n">PROMPT</span><span class="p">,</span>
                <span class="n">base_url</span><span class="o">=</span><span class="n">BASE_URL</span><span class="p">,</span>
                <span class="n">api_key</span><span class="o">=</span><span class="n">API_KEY</span><span class="p">,</span>
                <span class="n">model_name</span><span class="o">=</span><span class="n">MODEL_NAME</span><span class="p">,</span>
                <span class="n">temperature</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span>
                <span class="n">output_format</span><span class="o">=</span><span class="n">OUTPUT_FORMAT</span><span class="p">)</span>

<span class="n">result</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">{&#39;taste&#39;: -1.0, &#39;speed&#39;: 0.0, &#39;service&#39;: 0.0}
</code></pre></div><br>
<p>批量运算</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>


<span class="c1"># 构造实验数据</span>
<span class="n">data</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;速度非常快，口味非常好， 服务非常棒！&#39;</span><span class="p">,</span>
        <span class="s1">&#39;送餐时间还是比较久&#39;</span><span class="p">,</span>
        <span class="s1">&#39;送单很快，菜也不错赞&#39;</span><span class="p">,</span>
        <span class="s1">&#39;太难吃了&#39;</span><span class="p">]</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;comment&#39;</span><span class="p">])</span>


<span class="c1"># 分析函数</span>
<span class="k">def</span> <span class="nf">llm_analysis</span><span class="p">(</span><span class="n">text</span><span class="p">):</span>
    <span class="n">result</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">llm</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="n">text</span><span class="p">,</span>
                    <span class="n">prompt</span><span class="o">=</span> <span class="s1">&#39;从口味taste、速度speed、服务service三个维度， 对外卖评论内容进行文本分析， 分别返回不同维度的分值(分值范围-1.0 ~ 1.0)&#39;</span><span class="p">,</span>
                    <span class="n">base_url</span><span class="o">=</span><span class="s1">&#39;https://dashscope.aliyuncs.com/compatible-mode/v1&#39;</span><span class="p">,</span>
                    <span class="n">api_key</span><span class="o">=</span><span class="s1">&#39;你的API-KEY&#39;</span><span class="p">,</span>
                    <span class="n">model_name</span><span class="o">=</span><span class="s1">&#39;qwen-max&#39;</span><span class="p">,</span>
                    <span class="n">output_format</span><span class="o">=</span><span class="p">{</span><span class="s1">&#39;taste&#39;</span><span class="p">:</span> <span class="nb">float</span><span class="p">,</span> <span class="s1">&#39;speed&#39;</span><span class="p">:</span> <span class="nb">float</span><span class="p">,</span> <span class="s1">&#39;service&#39;</span><span class="p">:</span> <span class="nb">float</span><span class="p">}</span>
                               <span class="p">)</span>
    <span class="k">return</span> <span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">(</span><span class="n">result</span><span class="p">)</span>


<span class="c1"># 批量运算</span>
<span class="n">df2</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;comment&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">llm_analysis</span><span class="p">)</span>
<span class="n">res_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">concat</span><span class="p">([</span><span class="n">df</span><span class="p">,</span> <span class="n">df2</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="c1"># 保存分析结果</span>
<span class="n">res_df</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s1">&#39;result.csv&#39;</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
<span class="n">res_df</span>
</code></pre></div><p><img loading="lazy" src="img/28-llm-analysis.png" alt=""  />
</p>
<br>
<p>LLM 更多详细内容，请阅读 <a href="https://textdata.cn/blog/2025-02-14-using-online-large-model-api-to-transform-text-data-into-structured-data/"><strong>教程 | 使用在线大模型将文本数据转化为结构化数据</strong></a></p>
<br>
<h3 id="62-内置prompt">6.2 内置prompt</h3>
<p>cntext</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">ct</span><span class="o">.</span><span class="n">llm</span><span class="o">.</span><span class="n">tasks_list</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[&#39;sentiment&#39;,
 &#39;emotion&#39;,
 &#39;classify&#39;,
 &#39;intent&#39;,
 &#39;keywords&#39;,
 &#39;entities&#39;,
 &#39;summarize&#39;,
 &#39;rewrite&#39;,
 &#39;quality&#39;,
 &#39;similarity&#39;]
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1"># 获取sentiment模板</span>
<span class="n">ct</span><span class="o">.</span><span class="n">llm</span><span class="o">.</span><span class="n">tasks_get</span><span class="p">(</span><span class="s1">&#39;sentiment&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">{&#39;prompt&#39;: &#39;分析评论的情感倾向：返回情感类别 label（pos 表示正面，neg 表示负面，neutral 表示中性）和情感分值 score（取值范围 -1~1，负数为负面）&#39;,
 &#39;output_format&#39;: {&#39;label&#39;: &#39;str&#39;, &#39;score&#39;: &#39;float&#39;}}
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1"># 使用sentiment提示词模板。</span>
<span class="c1"># 启用Ollama服务，调用qwen2.5:7b模型</span>
<span class="n">ct</span><span class="o">.</span><span class="n">llm</span><span class="p">(</span><span class="s2">&#34;服务很棒！&#34;</span><span class="p">,</span> <span class="n">task</span><span class="o">=</span><span class="s2">&#34;sentiment&#34;</span><span class="p">,</span> <span class="n">backend</span><span class="o">=</span><span class="s2">&#34;ollama&#34;</span><span class="p">,</span>  <span class="n">model_name</span><span class="o">=</span><span class="s2">&#34;qwen2.5:7b&#34;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[cntext2x] ✅ 连接模型服务: http://127.0.0.1:11434/v1
{&#39;label&#39;: &#39;pos&#39;, &#39;score&#39;: 0.8}
</code></pre></div><p><br><br></p>
<h2 id="使用声明">使用声明</h2>
<p>如在研究或项目中使用到 <strong>cntext</strong> ，请简要介绍 cntext ，并附加使用声明出处。</p>
<h3 id="apalike">apalike</h3>
<p>Deng, X., &amp; Nan, P. (2022). <strong>cntext: a Python tool for text mining</strong> [Computer software]. Zenodo. <a href="https://doi.org/10.5281/zenodo.7063523">https://doi.org/10.5281/zenodo.7063523</a></p>
<p>Source Code URL: <a href="https://github.com/hiDaDeng/cntext">https://github.com/hiDaDeng/cntext</a></p>
<br>
<h3 id="bibtex">bibtex</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">@misc{deng2022cntext,
  author       = {Deng, X. and Nan, P.},
  title        = {cntext: a Python tool for text mining},
  year         = {2022},
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.7063523},
  url          = {https://doi.org/10.5281/zenodo.7063523},
  howpublished = {[Computer software]},
  note         = {Source Code URL: \url{https://github.com/hiDaDeng/cntext}}
}
</code></pre></div><br>
<h3 id="endnote">endnote</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">%0 Generic
%A Deng, X.
%A Nan, P.
%T cntext: a Python tool for text mining
%Y [Computer software]
%D 2022
%I Zenodo
%R 10.5281/zenodo.7063523
%U https://doi.org/10.5281/zenodo.7063523
%Z Source Code URL: https://github.com/hiDaDeng/cntext
%@
</code></pre></div>]]></content:encoded>
    </item>
    
    <item>
      <title>数据集 | 6.6亿条美国谷歌地图评论数据(~2021.9)</title>
      <link>https://textdata.cn/blog/2025-03-14-google-map-review-dataset/</link>
      <pubDate>Fri, 14 Mar 2025 00:00:00 +0000</pubDate>
      
      <guid>/blog/2025-03-14-google-map-review-dataset/</guid>
      <description>这个数据集包含截至 2021 年 9 月美国谷歌地图上的评论信息（评分、文本、图片等），企业元数据（地址、地理信息、描述、类别信息、价格、营业时间以及其它信息），以及相关企业的链接。</description>
      <content:encoded><![CDATA[<h2 id="一数据集概况">一、数据集概况</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">数据集名: 美国谷歌地图评论数据集
数据来源: Google Map
覆盖日期: 2008-01-25 ~ 2021-09 
评论数量: 666,324,103(6.6亿)
用户数量: 113,643,107(1.1亿用户)
实体数量: 4,963,111家商铺
所含字段: 评分、文本、图片、地址、地理信息、价格、营业时间等
数据格式: json文件
下载数据: 
    - https://datarepo.eng.ucsd.edu/mcauley_group/gdrive/googlelocal/
    - https://mcauleylab.ucsd.edu/public_datasets/gdrive/googlelocal/
本文声明: 科研用途； 如分享有问题，可加微信372335839，备注「姓名-学校-专业」
</code></pre></div><p><img loading="lazy" src="img/01-google-map-review.png" alt=""  />
</p>
<h3 id="11-引用数据">1.1 引用数据</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[1]Li, Jiacheng, Jingbo Shang, and Julian McAuley. &#34;UCTopic: Unsupervised contrastive learning for phrase representations and topic mining.&#34; arXiv preprint arXiv:2202.13469 (2022).
[2]Yan, An, Zhankui He, Jiacheng Li, Tianyang Zhang, and Julian McAuley. &#34;Personalized showcases: Generating multi-modal explanations for recommendations.&#34; In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2251-2255. 2023.
</code></pre></div><p><img loading="lazy" src="img/02-opendataset.png" alt=""  />
</p>
<p><br><br></p>
<h3 id="12-下载数据">1.2 下载数据</h3>
<p><img loading="lazy" src="img/03-new-york.png" alt=""  />
</p>
<p>选择 <em><strong>Complete review data</strong></em> -&gt; <em><strong>New York</strong></em>, 分别点击<em><strong>review</strong></em> 和 <em><strong>metadata</strong></em>， 下载得到<em><strong>review-New_York.json.gz</strong></em> 和 <em><strong>meta-New_York.json.gz</strong></em> 。</p>
<p>下面分别以 <em><strong>review-New_York.json.gz</strong></em> 和 <em><strong>meta-New_York.json.gz</strong></em> 进行讲解。</p>
<p><br><br></p>
<h2 id="二meta-new_yorkjsongz">二、meta-New_York.json.gz</h2>
<p><em><strong>.json.gz</strong></em> 是json文件的压缩文件， 压缩格式为<em><strong>gzip</strong></em>。 双击可以使用解压软件进行解压。</p>
<p>json文件内，每条评论是一行 json 格式的数据。 如需进一步了解数据，请参见以下示例。</p>
<h3 id="21-使用jsonlines读取">2.1 使用jsonlines读取</h3>
<p>使用jsonlines读取前1行</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1"># 读取meta-New_York.json.gz 前1行记录</span>
<span class="n">first_n_json_objects</span> <span class="o">=</span> <span class="n">read_json_nlines</span><span class="p">(</span><span class="n">file</span><span class="o">=</span><span class="s1">&#39;meta-New_York.json.gz&#39;</span><span class="p">,</span> <span class="n">nrows</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">first_n_json_objects</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[{&#39;name&#39;: &#39;A-Top Insurance&#39;,
  &#39;address&#39;: &#39;A-Top Insurance, 1009 Brighton Beach Ave, Brooklyn, NY 11235&#39;,
  &#39;gmap_id&#39;: &#39;0x89c24469c758686b:0x641f5b84cb9bedfa&#39;,
  &#39;description&#39;: None,
  &#39;latitude&#39;: 40.5782537,
  &#39;longitude&#39;: -73.9591269,
  &#39;category&#39;: [&#39;Insurance broker&#39;, &#39;Insurance agency&#39;],
  &#39;avg_rating&#39;: 2,
  &#39;num_of_reviews&#39;: 4,
  &#39;price&#39;: None,
  &#39;hours&#39;: [[&#39;Thursday&#39;, &#39;10AM–6PM&#39;],
   [&#39;Friday&#39;, &#39;10AM–6PM&#39;],
   [&#39;Saturday&#39;, &#39;Closed&#39;],
   [&#39;Sunday&#39;, &#39;Closed&#39;],
   [&#39;Monday&#39;, &#39;10AM–6PM&#39;],
   [&#39;Tuesday&#39;, &#39;10AM–6PM&#39;],
   [&#39;Wednesday&#39;, &#39;10AM–6PM&#39;]],
  &#39;MISC&#39;: None,
  &#39;state&#39;: &#39;Open ⋅ Closes 6PM&#39;,
  &#39;relative_results&#39;: [&#39;0x89c24449907718fb:0x31b554a0983f621d&#39;,
   &#39;0x4065f38ac8af66fd:0x991c223e83658501&#39;,
   &#39;0x89c24450d471cd67:0x169d824916e1d42&#39;,
   &#39;0x89c25bae39b5a677:0x5f129fa988693a25&#39;,
   &#39;0x89c245029897765d:0xa04072e1ef9ab823&#39;],
  &#39;url&#39;: &#39;https://www.google.com/maps/place//data=!4m2!3m1!1s0x89c24469c758686b:0x641f5b84cb9bedfa?authuser=-1&amp;hl=en&amp;gl=us&#39;}]
</code></pre></div><br>
<h3 id="22-使用pandas读取">2.2 使用pandas读取</h3>
<p>使用pandas读取 <em><strong>meta-New_York.json.gz</strong></em> 全部数据</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="c1">#读取前2行 </span>
<span class="c1">#mdf = pd.read_json(&#39;meta-New_York.json.gz&#39;, nrows=2, compression=&#39;gzip&#39;, lines=True)</span>

<span class="c1">#全部读取</span>
<span class="n">mdf</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_json</span><span class="p">(</span><span class="s1">&#39;meta-New_York.json.gz&#39;</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">,</span> <span class="n">lines</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">mdf</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">CPU times: user 2.65 s, sys: 2.71 s, total: 5.36 s
Wall time: 2.55 s
</code></pre></div><p><img loading="lazy" src="img/05-df.png" alt=""  />
</p>
<br>
<h3 id="23-字段含义">2.3 字段含义</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">name             商业实体名称
address          地址
gmap_id          商业实体ID
description      业务描述
latitude         纬度
longitude        经度
category         商业的类别
avg_rating       该业务的平均评分
num_of_reviews   评论数量
price            业务价格
hours            营业时间
MISC             MISC信息
state            当前状态（例如，永久关闭）
relative_results 谷歌推荐的相关商业信息
url              企业的网址
</code></pre></div><br>
<h3 id="24-字段缺失程度">2.4 字段缺失程度</h3>
<p>字段缺失程度如下</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">missingno</span> <span class="k">as</span> <span class="nn">ms</span>

<span class="n">ms</span><span class="o">.</span><span class="n">matrix</span><span class="p">(</span><span class="n">mdf</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/07-meta-miss.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="三review-new_yorkjsongz">三、review-New_York.json.gz</h2>
<h3 id="31-使用jsonlines读取">3.1 使用jsonlines读取</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">gzip</span>
<span class="kn">import</span> <span class="nn">jsonlines</span>


<span class="k">def</span> <span class="nf">read_json_nlines</span><span class="p">(</span><span class="n">file</span><span class="p">,</span> <span class="n">nrows</span><span class="p">):</span>
    <span class="n">json_objects</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="k">with</span> <span class="n">gzip</span><span class="o">.</span><span class="n">open</span><span class="p">(</span><span class="n">file</span><span class="p">,</span> <span class="s1">&#39;rt&#39;</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">gz_file</span><span class="p">:</span>
        <span class="n">reader</span> <span class="o">=</span> <span class="n">jsonlines</span><span class="o">.</span><span class="n">Reader</span><span class="p">(</span><span class="n">gz_file</span><span class="p">)</span>
        <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">nrows</span><span class="p">):</span>
            <span class="k">try</span><span class="p">:</span>
                <span class="n">json_objects</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">reader</span><span class="o">.</span><span class="n">read</span><span class="p">())</span>
            <span class="k">except</span> <span class="ne">EOFError</span><span class="p">:</span>
                <span class="k">break</span> <span class="c1"># 如果到达文件末尾，则退出循环</span>
    <span class="k">return</span> <span class="n">json_objects</span>

<span class="c1"># 读取review-New_York.json.gz 前2行记录</span>
<span class="n">first_n_json_objects</span> <span class="o">=</span> <span class="n">read_json_nlines</span><span class="p">(</span><span class="n">file</span><span class="o">=</span><span class="s1">&#39;review-New_York.json.gz&#39;</span><span class="p">,</span> <span class="n">nrows</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
<span class="n">first_n_json_objects</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[
  {&#39;user_id&#39;: &#39;101855823232666695168&#39;,
  &#39;name&#39;: &#39;R Mac&#39;,
  &#39;time&#39;: 1629141186463,
  &#39;rating&#39;: 1,
  &#39;text&#39;: &#39;Natalia may be the worst agent I have ever dealt with. Look up the definition of entitled b**** in the dictionary and there she would be. Look at their reviews...they are garbage through and through.&#39;,
  &#39;pics&#39;: None,
  &#39;resp&#39;: None,
  &#39;gmap_id&#39;: &#39;0x89c24469c758686b:0x641f5b84cb9bedfa&#39;},

 {&#39;user_id&#39;: &#39;105821946869087882225&#39;,
  &#39;name&#39;: &#39;Beck TJ&#39;,
  &#39;time&#39;: 1528477593994,
  &#39;rating&#39;: 1,
  &#39;text&#39;: &#39;The lady at the front desk is rude. The bathroom key is supposed to be available for anyone who is visiting the business and any question must be answered in a nice manner.&#39;,
  &#39;pics&#39;: None,
  &#39;resp&#39;: None,
  &#39;gmap_id&#39;: &#39;0x89c24469c758686b:0x641f5b84cb9bedfa&#39;}
  
  ]
</code></pre></div><br>
<h3 id="32-使用pandas读取">3.2 使用pandas读取</h3>
<p>使用 pandas 读取 <em><strong>review-New_York.json.gz</strong></em> 全部数据</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="c1">#读取前2行 </span>
<span class="c1">#rdf = pd.read_json(&#39;review-New_York.json.gz&#39;, nrows=2, compression=&#39;gzip&#39;, lines=True)</span>
<span class="c1">#全部读取</span>
<span class="n">rdf</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_json</span><span class="p">(</span><span class="s1">&#39;review-New_York.json.gz&#39;</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">,</span> <span class="n">lines</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">rdf</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">CPU times: user 1min 30s, sys: 2min 2s, total: 3min 32s
Wall time: 4min 11s
</code></pre></div><p><img loading="lazy" src="img/06-df.png" alt=""  />
</p>
<br>
<h3 id="33-字段含义">3.3 字段含义</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">user_id   评论者的 ID
name      评论者的姓名
time      评论时间（unix 时间戳）
rating    对企业的评分
text      评论文本
pics      评论图片链接
resp      商家对评论的回复，包括 unix 时间戳和回复文本
gmap_id   商业 ID
</code></pre></div><br>
<h3 id="34-字段缺失程度">3.4 字段缺失程度</h3>
<p>字段缺失程度如下</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">missingno</span> <span class="k">as</span> <span class="nn">ms</span>

<span class="n">ms</span><span class="o">.</span><span class="n">matrix</span><span class="p">(</span><span class="n">rdf</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/08-review-miss.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="相关内容">相关内容</h2>
<p>可以借鉴这篇 <a href="https://textdata.cn/blog/2023-11-17-how-handle-mega-csv-that-far-exceed-memory/"><strong>代码 | 如何用pandas处理超大csv文件</strong></a>  ， 了解如何处理远超内存的数据文件。</p>
<br>
]]></content:encoded>
    </item>
    
    <item>
      <title>数据集 | arXiv网站 269w 学术论文元数据 (2007 ~ 2025)</title>
      <link>https://textdata.cn/blog/2025-03-21-the-arxiv-metadata-dataset-of-millions-of-scholarly-papers/</link>
      <pubDate>Fri, 14 Mar 2025 00:00:00 +0000</pubDate>
      
      <guid>/blog/2025-03-21-the-arxiv-metadata-dataset-of-millions-of-scholarly-papers/</guid>
      <description>在这些独特的全球挑战时期，从数据中高效提取洞察至关重要。为了使 arXiv 更加易于访问，我们在此提供一个免费的开源 Kaggle 管道，用于机器可读的 arXiv 数据集：一个包含 170 万篇文章的仓库，具有相关特征，如文章标题、作者、类别、摘要、全文 PDF 等。In these times of unique global challenges, efficient extraction of insights from data is essential. To help make the arXiv more accessible, we present a free, open pipeline on Kaggle to the machine-readable arXiv dataset: a repository of 1.7 million articles, with relevant features such as article titles, authors, categories, abstracts, full text PDFs, and more.</description>
      <content:encoded><![CDATA[<p>近 30 年来，arXiv 一直向公众和研究社区提供开放访问的学术文章，从物理学的广阔分支到计算机科学的众多子学科，再到经济学等所有领域，包括数学、统计学、电气工程、定量生物学等。这一丰富的信息库提供了重要的深度，但有时也会显得令人难以应对。</p>
<p>在这些独特的全球挑战时期，从数据中高效提取洞察至关重要。为了使 arXiv 更加易于访问，康奈尔大学将 arXiv元信息数据集存放在 Kaggle 供大家下载，该数据集目前含 268 万篇元信息，如文章标题、作者、类别、摘要、全文 PDF 等。</p>
<br>
<h2 id="一数据集概况">一、数据集概况</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">数据集名: arXiv学术论文元数据
数据来源: https://arxiv.org/
提交日期: 1986-04-25 ~ 2025-03-13（数据每周更新)
论文数量: 2689088(截止到2025.3.14)
所含字段: 标题、作者、摘要、期刊信息、DOI等
数据格式: json
数据体积: 4.58G
下载数据: https://www.kaggle.com/datasets/Cornell-University/arxiv
本文声明: 科研用途； 如分享有问题，可加微信372335839，备注「姓名-学校-专业」
</code></pre></div><p><br><br></p>
<h2 id="二查看数据">二、查看数据</h2>
<h3 id="21-读取数据">2.1 读取数据</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_json</span><span class="p">(</span><span class="s1">&#39;arxiv-metadata-oai-snapshot.json&#39;</span><span class="p">,</span> <span class="n">lines</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;论文数量: &#39;</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">))</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">论文数量: 2689088
</code></pre></div><p><img loading="lazy" src="img/01-df.png" alt=""  />
</p>
<br>
<h3 id="22-提交日期">2.2 提交日期</h3>
<p>字段 <em><strong>versions</strong></em> 中含论文提交日期信息，可以通过如下代码提取并保存为字段 <em><strong>created</strong></em> 。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s1">&#39;created&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;versions&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">text</span><span class="p">:</span> <span class="n">text</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="s1">&#39;created&#39;</span><span class="p">])</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;created&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;created&#39;</span><span class="p">])</span>

<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;提交日期: &#39;</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;update_date&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">min</span><span class="p">()</span><span class="o">.</span><span class="n">date</span><span class="p">(),</span> <span class="s1">&#39;~&#39;</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;update_date&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">()</span><span class="o">.</span><span class="n">date</span><span class="p">())</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">提交日期:  1986-04-25 ~ 2025-03-13
</code></pre></div><br>
<h3 id="23-所含字段">2.3 所含字段</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="k">for</span> <span class="n">col</span> <span class="ow">in</span> <span class="n">df</span><span class="o">.</span><span class="n">columns</span><span class="p">:</span>
    <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39; - </span><span class="si">{</span><span class="n">col</span><span class="si">}</span><span class="s1">&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"> - id               ArXiv ID（可以用来访问论文，详情请见下方）
 - submitter        论文提交者
 - authors          论文作者
 - title            标题
 - comments         附加信息，例如页数和图表数量
 - journal-ref      论文发表的期刊信息
 - doi              论文的DOI号(数字对象标识符)
 - report-no        报告编号(作者在提交到arXiv前已经获得作者所属机构的报告编号)
 - categories       arXiv 系统的类别/标签
 - license          论文所依据的许可协议
 - abstract         论文摘要
 - versions         版本历史
 - update_date      论文更新日期
 - authors_parsed   论文作者
</code></pre></div><br>
<p>字段数据的缺失程度</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">missingno</span> <span class="k">as</span> <span class="nn">ms</span>

<span class="n">ms</span><span class="o">.</span><span class="n">matrix</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/02-vis.png" alt=""  />
</p>
<br>
<h3 id="24-作者">2.4 作者</h3>
<p>可以使用字段 <em><strong>authors_parsed</strong></em> 计算每篇论文的作者数量</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s1">&#39;N_authors&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;authors_parsed&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">ap</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="n">ap</span><span class="p">))</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;N_authors&#39;</span><span class="p">]</span> 
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">0          4
1          2
2          1
3          1
4          2
          ..
2689083    7
2689084    4
2689085    3
2689086    1
2689087    3
Name: N_authors, Length: 2689088, dtype: int64
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">data</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;N_authors&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">value_counts</span><span class="p">()</span><span class="o">.</span><span class="n">sort_index</span><span class="p">(</span><span class="n">ascending</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span><span class="o">.</span><span class="n">reset_index</span><span class="p">()</span>
<span class="n">data</span><span class="p">[</span><span class="s1">&#39;N_authors&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="s1">&#39;N_authors&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="nb">str</span><span class="p">)</span>
<span class="n">data</span>
</code></pre></div><p><img loading="lazy" src="img/03-df.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="三可视化">三、可视化</h2>
<h3 id="31-论文年度提交量">3.1 论文年度提交量</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">plotnine</span> <span class="kn">import</span> <span class="o">*</span>
<span class="kn">from</span> <span class="nn">matplotlib.font_manager</span> <span class="kn">import</span> <span class="n">FontProperties</span>

<span class="n">font_prop</span> <span class="o">=</span> <span class="n">FontProperties</span><span class="p">(</span><span class="n">fname</span><span class="o">=</span><span class="s1">&#39;文泉驿微米黑.ttf&#39;</span><span class="p">)</span>

<span class="n">data</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">created</span><span class="o">.</span><span class="n">dt</span><span class="o">.</span><span class="n">year</span><span class="o">.</span><span class="n">value_counts</span><span class="p">()</span><span class="o">.</span><span class="n">sort_index</span><span class="p">()</span><span class="o">.</span><span class="n">reset_index</span><span class="p">()</span>


<span class="p">(</span>
    <span class="n">ggplot</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="s1">&#39;created&#39;</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="s1">&#39;count&#39;</span><span class="p">))</span>
    <span class="o">+</span><span class="n">geom_bar</span><span class="p">(</span><span class="n">stat</span><span class="o">=</span><span class="s1">&#39;identity&#39;</span><span class="p">)</span>
    <span class="o">+</span><span class="n">labs</span><span class="p">(</span><span class="n">title</span><span class="o">=</span><span class="s1">&#39;arXiv论文年度提交量趋势图(1986-2025.3)&#39;</span><span class="p">,</span>
          <span class="n">x</span><span class="o">=</span><span class="s1">&#39;&#39;</span><span class="p">,</span>
          <span class="n">y</span><span class="o">=</span><span class="s1">&#39;&#39;</span><span class="p">)</span>
    <span class="o">+</span> <span class="n">annotate</span><span class="p">(</span>
        <span class="s1">&#39;text&#39;</span><span class="p">,</span>
        <span class="n">x</span><span class="o">=</span> <span class="mi">1986</span><span class="p">,</span>
        <span class="n">y</span><span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="s1">&#39;count&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">()</span> <span class="o">*</span> <span class="mf">1.05</span><span class="p">,</span>
        <span class="n">label</span><span class="o">=</span><span class="s1">&#39;公众号: 大邓和他的Python&#39;</span><span class="p">,</span>
        <span class="n">ha</span><span class="o">=</span><span class="s1">&#39;left&#39;</span><span class="p">,</span>
        <span class="n">va</span><span class="o">=</span><span class="s1">&#39;top&#39;</span><span class="p">,</span>
        <span class="n">size</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span>
        <span class="n">color</span><span class="o">=</span><span class="s1">&#39;black&#39;</span><span class="p">,</span>
    <span class="p">)</span>
    <span class="o">+</span><span class="n">scale_x_continuous</span><span class="p">(</span><span class="n">breaks</span><span class="o">=</span><span class="nb">range</span><span class="p">(</span><span class="mi">1986</span><span class="p">,</span> <span class="mi">2026</span><span class="p">,</span> <span class="mi">3</span><span class="p">))</span>
    <span class="o">+</span><span class="n">theme</span><span class="p">(</span><span class="n">figure_size</span><span class="o">=</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span> <span class="mi">5</span><span class="p">),</span>
           <span class="n">text</span><span class="o">=</span><span class="n">element_text</span><span class="p">(</span><span class="n">family</span><span class="o">=</span> <span class="n">font_prop</span><span class="o">.</span><span class="n">get_name</span><span class="p">(),</span> <span class="n">size</span><span class="o">=</span><span class="mi">8</span><span class="p">,</span> <span class="n">rotation</span><span class="o">=</span><span class="mi">0</span><span class="p">),</span>
           <span class="n">plot_title</span> <span class="o">=</span> <span class="n">element_text</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="mi">13</span><span class="p">,</span> <span class="n">rotation</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">weight</span><span class="o">=</span><span class="s1">&#39;bold&#39;</span><span class="p">),</span>
           <span class="n">axis_text_x</span> <span class="o">=</span> <span class="n">element_text</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="mi">8</span><span class="p">,</span> <span class="n">rotation</span><span class="o">=</span><span class="mi">0</span><span class="p">))</span>
<span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/03-vis.png" alt=""  />
</p>
<br>
<h3 id="32-作者数量分布图">3.2 作者数量分布图</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">plotnine</span> <span class="kn">import</span> <span class="o">*</span>
<span class="kn">from</span> <span class="nn">matplotlib.font_manager</span> <span class="kn">import</span> <span class="n">FontProperties</span>

<span class="n">font_prop</span> <span class="o">=</span> <span class="n">FontProperties</span><span class="p">(</span><span class="n">fname</span><span class="o">=</span><span class="s1">&#39;文泉驿微米黑.ttf&#39;</span><span class="p">)</span>

<span class="n">data</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;N_authors&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">value_counts</span><span class="p">()</span><span class="o">.</span><span class="n">sort_index</span><span class="p">(</span><span class="n">ascending</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span><span class="o">.</span><span class="n">reset_index</span><span class="p">()</span>

<span class="p">(</span>
    <span class="n">ggplot</span><span class="p">(</span><span class="n">data</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">20</span><span class="p">),</span> <span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="s1">&#39;N_authors&#39;</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="s1">&#39;count&#39;</span><span class="p">))</span>
    <span class="o">+</span><span class="n">geom_bar</span><span class="p">(</span><span class="n">stat</span><span class="o">=</span><span class="s1">&#39;identity&#39;</span><span class="p">)</span>
    <span class="o">+</span><span class="n">labs</span><span class="p">(</span><span class="n">title</span><span class="o">=</span><span class="s1">&#39;arXiv每篇论文作者数量分布图&#39;</span><span class="p">,</span>
          <span class="n">x</span><span class="o">=</span><span class="s1">&#39;&#39;</span><span class="p">,</span>
          <span class="n">y</span><span class="o">=</span><span class="s1">&#39;&#39;</span><span class="p">)</span>
    <span class="o">+</span> <span class="n">annotate</span><span class="p">(</span>
        <span class="s1">&#39;text&#39;</span><span class="p">,</span>
        <span class="n">x</span><span class="o">=</span> <span class="mi">16</span><span class="p">,</span>
        <span class="n">y</span><span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="s1">&#39;count&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">()</span> <span class="p">,</span>
        <span class="n">label</span><span class="o">=</span><span class="s1">&#39;公众号: 大邓和他的Python&#39;</span><span class="p">,</span>
        <span class="n">ha</span><span class="o">=</span><span class="s1">&#39;left&#39;</span><span class="p">,</span>
        <span class="n">va</span><span class="o">=</span><span class="s1">&#39;top&#39;</span><span class="p">,</span>
        <span class="n">size</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span>
        <span class="n">color</span><span class="o">=</span><span class="s1">&#39;black&#39;</span><span class="p">,</span>
    <span class="p">)</span>
    <span class="o">+</span><span class="n">scale_x_continuous</span><span class="p">(</span><span class="n">breaks</span><span class="o">=</span><span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">21</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span>
    <span class="o">+</span><span class="n">theme</span><span class="p">(</span><span class="n">figure_size</span><span class="o">=</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span> <span class="mi">5</span><span class="p">),</span>
           <span class="n">text</span><span class="o">=</span><span class="n">element_text</span><span class="p">(</span><span class="n">family</span><span class="o">=</span> <span class="n">font_prop</span><span class="o">.</span><span class="n">get_name</span><span class="p">(),</span> <span class="n">size</span><span class="o">=</span><span class="mi">8</span><span class="p">,</span> <span class="n">rotation</span><span class="o">=</span><span class="mi">0</span><span class="p">),</span>
           <span class="n">plot_title</span> <span class="o">=</span> <span class="n">element_text</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="mi">13</span><span class="p">,</span> <span class="n">rotation</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">weight</span><span class="o">=</span><span class="s1">&#39;bold&#39;</span><span class="p">),</span>
           <span class="n">axis_text_x</span> <span class="o">=</span> <span class="n">element_text</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="mi">8</span><span class="p">,</span> <span class="n">rotation</span><span class="o">=</span><span class="mi">0</span><span class="p">))</span>
<span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/04-author.png" alt=""  />
</p>
<br>
<h3 id="33-前10大研究热点">3.3 前10大研究热点</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">plotnine</span> <span class="kn">import</span> <span class="o">*</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">from</span> <span class="nn">matplotlib.font_manager</span> <span class="kn">import</span> <span class="n">FontProperties</span>

<span class="n">font_prop</span> <span class="o">=</span> <span class="n">FontProperties</span><span class="p">(</span><span class="n">fname</span><span class="o">=</span><span class="s1">&#39;文泉驿微米黑.ttf&#39;</span><span class="p">)</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;categories&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">value_counts</span><span class="p">()</span><span class="o">.</span><span class="n">reset_index</span><span class="p">()</span>
<span class="n">data</span><span class="o">.</span><span class="n">sort_values</span><span class="p">(</span><span class="n">by</span><span class="o">=</span><span class="s1">&#39;count&#39;</span><span class="p">,</span> <span class="n">ascending</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">data</span><span class="p">[</span><span class="s1">&#39;categories&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">Categorical</span><span class="p">(</span><span class="n">data</span><span class="p">[</span><span class="s1">&#39;categories&#39;</span><span class="p">],</span> <span class="n">categories</span><span class="o">=</span><span class="n">data</span><span class="p">[</span><span class="s1">&#39;categories&#39;</span><span class="p">],</span> <span class="n">ordered</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>

<span class="p">(</span>
    <span class="n">ggplot</span><span class="p">(</span><span class="n">data</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">10</span><span class="p">),</span> <span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="s1">&#39;categories&#39;</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="s1">&#39;count&#39;</span><span class="p">))</span>  <span class="c1"># 注意这里x和y的位置</span>
    <span class="o">+</span> <span class="n">geom_bar</span><span class="p">(</span><span class="n">stat</span><span class="o">=</span><span class="s1">&#39;identity&#39;</span><span class="p">)</span>
    <span class="o">+</span> <span class="n">labs</span><span class="p">(</span><span class="n">title</span><span class="o">=</span><span class="s1">&#39;arXiv每篇论文作者数量分布图&#39;</span><span class="p">,</span>
           <span class="n">x</span><span class="o">=</span><span class="s1">&#39;&#39;</span><span class="p">,</span>
           <span class="n">y</span><span class="o">=</span><span class="s1">&#39;&#39;</span><span class="p">)</span>
    <span class="o">+</span> <span class="n">coord_flip</span><span class="p">()</span>  <span class="c1"># 翻转坐标轴以创建水平条形图</span>
    <span class="o">+</span> <span class="n">theme</span><span class="p">(</span><span class="n">figure_size</span><span class="o">=</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span> <span class="mi">4</span><span class="p">),</span>
            <span class="n">text</span><span class="o">=</span><span class="n">element_text</span><span class="p">(</span><span class="n">family</span><span class="o">=</span><span class="n">font_prop</span><span class="o">.</span><span class="n">get_name</span><span class="p">(),</span> <span class="n">size</span><span class="o">=</span><span class="mi">8</span><span class="p">,</span> <span class="n">rotation</span><span class="o">=</span><span class="mi">0</span><span class="p">),</span>
            <span class="n">plot_title</span><span class="o">=</span><span class="n">element_text</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="mi">12</span><span class="p">,</span> <span class="n">rotation</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">weight</span><span class="o">=</span><span class="s1">&#39;bold&#39;</span><span class="p">),</span>
            <span class="n">axis_text_x</span><span class="o">=</span><span class="n">element_text</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="mi">8</span><span class="p">,</span> <span class="n">rotation</span><span class="o">=</span><span class="mi">0</span><span class="p">))</span>
<span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/05-hot.png" alt=""  />
</p>
]]></content:encoded>
    </item>
    
    <item>
      <title>数据集 | Glassdoor网站 990w 条英国公司(职位)评论数据(2008~2023.7)</title>
      <link>https://textdata.cn/blog/2025-03-14-uk-glassdoor-review-dataset/</link>
      <pubDate>Fri, 14 Mar 2025 00:00:00 +0000</pubDate>
      
      <guid>/blog/2025-03-14-uk-glassdoor-review-dataset/</guid>
      <description>Glassdoor 成立于2007年，总部位于美国加利福尼亚州的 Mill Valley。 Glassdoor允许员工匿名发布对公司、工作环境、薪资等方面的评价，同时也提供了职位搜索、公司评分、面试经验分享等功能，为求职者和在职员工提供参考。尽管Glassdoor起源于美国，但它已经扩展到包括英国在内的多个国家和地区，为全球用户提供服务。这意味着用户可以在Glassdoor上查找来自世界各地的公司信息和职位空缺，包括但不限于：公司评论和评分、薪资报告、面试问题和经验、职位招聘信息因此，虽然Glassdoor可以在英国使用，并且对英国的职场人士非常有用，但它并不是一个仅限于英国或由英国运营的网站。它是一个跨国平台，旨在为全球用户提供有关职场和招聘过程中的透明信息。</description>
      <content:encoded><![CDATA[<h2 id="一数据集概况">一、数据集概况</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">数据集名: 英国公司(职位)评论数据
数据来源: Glassdooor
覆盖日期: 2008-01-25 ~ 2023-07-26
评价数量: 9901889 条
公司数量: 35541家
下载数据: https://www.kaggle.com/datasets/davidgauthier/glassdoor-job-reviews-2/data
本文声明: 科研用途； 如分享有问题，可加微信372335839，备注「姓名-学校-专业」
</code></pre></div><p><img loading="lazy" src="img/00-glassdoor.png" alt=""  />
</p>
<br>
<h3 id="11-网站介绍">1.1 网站介绍</h3>
<p>Glassdoor 成立于2007年，总部位于美国加利福尼亚州的 Mill Valley。 Glassdoor允许员工匿名发布对公司、工作环境、薪资等方面的评价，同时也提供了职位搜索、公司评分、面试经验分享等功能，为求职者和在职员工提供参考。</p>
<p>尽管Glassdoor起源于美国，但它已经扩展到包括英国在内的多个国家和地区，为全球用户提供服务。这意味着用户可以在Glassdoor上查找来自世界各地的公司信息和职位空缺，包括但不限于：</p>
<ul>
<li>公司评论和评分</li>
<li>薪资报告</li>
<li>面试问题和经验</li>
<li>职位招聘信息</li>
</ul>
<p>因此，虽然Glassdoor可以在英国使用，并且对英国的职场人士非常有用，但它并不是一个仅限于英国或由英国运营的网站。它是一个跨国平台，旨在为全球用户提供有关职场和招聘过程中的透明信息。</p>
<br>
<h3 id="12-字段">1.2 字段</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">-  rating                    总体评分(1~5分)
-  title                     评论的标题
-  status                    员工状态(在职、离职，以及在该公司的工作时长)
-  pros                      公司的优点
-  cons                      公司的缺点
-  advice                    建议
-  Recommend                 推荐程度(v正面，r轻微，x负面，o无意见)
-  CEO Approval              对CEO的认可程度(v正面，r轻微，x负面，o无意见)
-  Business Outlook          公司(业务)前景(v正面，r轻微，x负面，o无意见)
-  Career Opportunities      职业发展机会评分(1~5分) 
-  Compensation and Benefits 薪酬与福利评分(1~5分) 
-  Senior Management         高级管理层评分(1~5分) 
-  Work/Life Balance         工作与生活平衡评分(1~5分) 
-  Culture &amp; Values          文化&amp;价值观评分(1~5分) 
-  Diversity &amp; Inclusion     多样性&amp;包容性评分(1~5分) 
-  firm_link                 公司链接
-  date                      评论发布日期
-  job                       职位
</code></pre></div><p><img loading="lazy" src="img/01-glassdoor.png" alt=""  />
</p>
<p><img loading="lazy" src="img/02-glassdoor-pros-cons.jpg" alt=""  />
</p>
<p><br><br></p>
<h2 id="二实验">二、实验</h2>
<h3 id="21-读取数据">2.1 读取数据</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;glassdoor_review.csv&#39;</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">CPU times: user 27.4 s, sys: 3.47 s, total: 30.9 s
Wall time: 31 s
</code></pre></div><p><img loading="lazy" src="img/02-df.jpg" alt=""  />
</p>
<br>
<h3 id="22-字段缺失程度">2.2 字段缺失程度</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">missingno</span> <span class="k">as</span> <span class="nn">ms</span>

<span class="n">ms</span><span class="o">.</span><span class="n">matrix</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/03-miss.png" alt=""  />
</p>
<p>从上图可知， advice字段条纯白， 几乎全为缺失值。 而黑白相间的则存在一定比例的缺失值， 如</p>
<ul>
<li>Career Opportunities</li>
<li>Compensation and Benefits</li>
<li>Senior Management</li>
<li>Work/Life Balance</li>
<li>Culture &amp; Values</li>
<li>Diversity &amp; Inclusion</li>
</ul>
<br>
<h3 id="23-公司数">2.3 公司数</h3>
<p>数据集中涉及的公司数量</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="nb">print</span><span class="p">(</span><span class="s1">&#39;公司数:&#39;</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">firm_link</span><span class="o">.</span><span class="n">nunique</span><span class="p">())</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">公司数:35541
</code></pre></div><br>
<h3 id="24-覆盖日期">2.4 覆盖日期</h3>
<p>员工评价发布日期覆盖(起止)范围</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s1">&#39;date&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;date&#39;</span><span class="p">])</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;覆盖日期: &#39;</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">date</span><span class="o">.</span><span class="n">min</span><span class="p">()</span><span class="o">.</span><span class="n">date</span><span class="p">(),</span> <span class="s1">&#39;~&#39;</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">date</span><span class="o">.</span><span class="n">max</span><span class="p">()</span><span class="o">.</span><span class="n">date</span><span class="p">())</span>
<span class="nb">print</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">覆盖日期: 2008-01-25 ~2023-07-26
</code></pre></div><br>
<h2 id="三可视化">三、可视化</h2>
<p>可视化数据集内英国公司评论记录量（2008.1~2023.7），绘制柱状图。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">plotnine</span> <span class="kn">import</span> <span class="o">*</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">from</span> <span class="nn">matplotlib.font_manager</span> <span class="kn">import</span> <span class="n">FontProperties</span>

<span class="c1">#文泉驿微米黑.ttf位于代码同文件夹</span>
<span class="n">font_prop</span> <span class="o">=</span> <span class="n">FontProperties</span><span class="p">(</span><span class="n">fname</span><span class="o">=</span><span class="s1">&#39;文泉驿微米黑.ttf&#39;</span><span class="p">)</span> 

<span class="n">years</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">volumes</span> <span class="o">=</span> <span class="p">[]</span>


<span class="n">df</span><span class="p">[</span><span class="s1">&#39;date&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;date&#39;</span><span class="p">])</span>
<span class="n">df</span><span class="o">.</span><span class="n">set_index</span><span class="p">(</span><span class="s1">&#39;date&#39;</span><span class="p">,</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="k">for</span> <span class="n">date</span><span class="p">,</span> <span class="n">y_df</span> <span class="ow">in</span> <span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">pd</span><span class="o">.</span><span class="n">Grouper</span><span class="p">(</span><span class="n">freq</span><span class="o">=</span><span class="s1">&#39;YE&#39;</span><span class="p">)):</span>
    <span class="n">years</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">date</span><span class="o">.</span><span class="n">year</span><span class="p">)</span>
    <span class="n">volumes</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">y_df</span><span class="p">))</span>

<span class="n">data</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="s1">&#39;year&#39;</span><span class="p">:</span> <span class="n">years</span><span class="p">,</span> 
                     <span class="s1">&#39;volume&#39;</span><span class="p">:</span> <span class="n">volumes</span><span class="p">})</span>


<span class="p">(</span>
    <span class="n">ggplot</span><span class="p">(</span><span class="n">data</span><span class="p">,</span>  <span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="s1">&#39;year&#39;</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="s1">&#39;volume&#39;</span><span class="p">))</span>
    <span class="o">+</span><span class="n">geom_bar</span><span class="p">(</span><span class="n">stat</span><span class="o">=</span><span class="s1">&#39;identity&#39;</span><span class="p">)</span>
    <span class="o">+</span><span class="n">labs</span><span class="p">(</span><span class="n">title</span><span class="o">=</span><span class="s1">&#39;数据集内英国公司评论记录量（2008.1~2023.7）&#39;</span><span class="p">,</span>
          <span class="n">x</span> <span class="o">=</span> <span class="s1">&#39;&#39;</span><span class="p">,</span> 
          <span class="n">y</span> <span class="o">=</span> <span class="s1">&#39;&#39;</span><span class="p">)</span>
    <span class="o">+</span><span class="n">geom_text</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">label</span> <span class="o">=</span> <span class="s1">&#39;volume&#39;</span><span class="p">),</span>  <span class="c1"># 添加数据标签</span>
               <span class="n">va</span> <span class="o">=</span> <span class="s1">&#39;bottom&#39;</span><span class="p">,</span>           <span class="c1"># 垂直对齐方式为底部（即在柱子顶部）</span>
               <span class="n">size</span> <span class="o">=</span> <span class="mi">8</span><span class="p">,</span>                <span class="c1"># 设置字体大小</span>
               <span class="n">format_string</span><span class="o">=</span><span class="s1">&#39;</span><span class="si">{}</span><span class="s1">&#39;</span><span class="p">)</span>     <span class="c1"># 格式化字符串</span>
    <span class="o">+</span><span class="n">theme</span><span class="p">(</span><span class="n">figure_size</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">6</span><span class="p">),</span>
           <span class="n">text</span> <span class="o">=</span> <span class="n">element_text</span><span class="p">(</span><span class="n">family</span> <span class="o">=</span> <span class="n">font_prop</span><span class="o">.</span><span class="n">get_name</span><span class="p">(),</span> <span class="n">size</span><span class="o">=</span><span class="mi">14</span><span class="p">),</span> 
           <span class="n">plot_title</span> <span class="o">=</span> <span class="n">element_text</span><span class="p">(</span><span class="n">family</span> <span class="o">=</span> <span class="n">font_prop</span><span class="o">.</span><span class="n">get_name</span><span class="p">(),</span> <span class="n">size</span><span class="o">=</span><span class="mi">18</span><span class="p">)</span>
          <span class="p">)</span>
    <span class="o">+</span><span class="n">scale_x_continuous</span><span class="p">(</span><span class="n">breaks</span><span class="o">=</span><span class="nb">range</span><span class="p">(</span><span class="mi">2008</span><span class="p">,</span> <span class="mi">2024</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span> 

<span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/03-vis.png" alt=""  />
</p>
<br>]]></content:encoded>
    </item>
    
    <item>
      <title>数据集 | Layline美股内幕交易数据集</title>
      <link>https://textdata.cn/blog/2025-03-11-layline-insider-trading-dataset/</link>
      <pubDate>Tue, 11 Mar 2025 00:00:00 +0000</pubDate>
      
      <guid>/blog/2025-03-11-layline-insider-trading-dataset/</guid>
      <description>该数据集捕捉了公开交易公司的内幕交易活动。证券交易委员会自 2003 年中以来在其网站上以结构化格式提供了这些内幕交易报告。然而，大多数学术论文使用的是商业数据库而非直接使用监管文件，这使得复制变得困难，因为商业数据库中的数据操作和聚合步骤是不透明的，而且随着时间的推移，数据提供者可能会更改历史记录。为了克服这些限制，本数据集是从原始监管文件创建的；它每天更新，并包括内幕人士报告的所有信息，未经修改。This dataset captures insider trading activity at publicly traded companies. The Securities and Exchange Commission has made these insider trading reports available on its web site in a structured format since mid-2003. However, most academic papers use proprietary commercial databases instead of regulatory filings directly, which makes replication challenging because the data manipulation and aggregation steps in commercial databases are opaque and historical records could be altered by the data provider over time. To overcome these limitations, the presented dataset is created from the original regulatory filings; it is updated daily and includes all information reported by insiders without alteration.</description>
      <content:encoded><![CDATA[<p>该数据集捕捉了上市公司内幕交易活动。投资者和投资分析师需要这些信息，因为高管、董事和大股东被认为比外界人士更了解其公司的前景。内幕交易中的股票买卖可能揭示了财务报表中未披露的公司业务信息。如果内幕人士能够更好地解读有关公司的公开信息，这些交易还可能传递出预测股价变动的新信息。</p>
<p>自 2003 年中起，证券交易委员会以结构化格式向公众提供了这些内幕交易报告； 然而，大多数学术论文使用的是商业数据库而非直接的监管文件。这使得复制研究变得困难，因为数据处理和聚合过程不透明，历史记录可能会随着时间被数据库提供商修改。为了克服这些限制，本数据集是从原始监管文件创建的；它每天更新，并包括内幕人士报告的所有信息，未经修改。</p>
<p><br><br></p>
<h2 id="一-数据集介绍">一、 数据集介绍</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">数据集名:  Layline内幕交易数据集(Layline insider trading dataset)
覆盖日期: 2003-06-30 ~ 2023-02-02
数据体积:  解压后35G(截止2023-02-02)
数据来源: https://www.sec.gov/edgar
下载数据:  https://www.kaggle.com/datasets/layline/insidertrading
引用数据:  Balogh, Attila, 2023, &#34;Layline insider trading dataset&#34;, https://doi.org/10.7910/DVN/VH6GVH, Harvard Dataverse, V419
本文说明: 科研用途; 如有问题，请加微信372335839，备注「姓名-学校-专业」
</code></pre></div><p><img loading="lazy" src="img/01-harvard.png" alt=""  />
</p>
<br>
<blockquote>
<p>Layline项目是一个研究倡议，旨在利用高性能和云计算创建金融经济学领域的公开可访问数据集。它通过民主化数据访问来降低进入门槛，同时也通过促进复制研究增加了该领域的透明度。</p>
</blockquote>
<p><img loading="lazy" src="img/05-layline.png" alt=""  />
</p>
<br>
<br>
<h2 id="二实验代码">二、实验代码</h2>
<p><strong>一般1G大小的csv，对应着要消耗电脑内 1G ~ 5G 内存</strong> 。 该数据集相关文件截图如下， 可以看到文件的体积都比较大。 目前大家使用的电脑，内存参数大多是8G或16G， 少部分同学们使用的32G+。接下来以 <em><strong>lit_footnotes.csv</strong></em> 为例， 简单学下 <em><strong>pd.read_csv</strong></em> 读取技巧。</p>
<p><img loading="lazy" src="img/02-screen.png" alt=""  />
</p>
<p><strong>pd.read_csv(filepath_or_buffer, nrows, usecols, engine=&lsquo;pyarrow&rsquo;, dtype_backend=&lsquo;pyarrow&rsquo;,  chunksize)</strong></p>
<ul>
<li><em><strong>filepath_or_buffer</strong></em>   csv数据文件路径</li>
<li><em><strong>nrows</strong></em>  限定读取行数</li>
<li><em><strong>usecols</strong></em> 选取部分字段进行读取(列表)</li>
<li><em><strong>engine</strong></em>  设置读取引擎， 可选python、c、pyarrow。
<ul>
<li>其中python为默认方式， 兼容性最佳。</li>
<li>c和pyarrow读取速度较python更快</li>
<li>pyarrow可并行读取，速度最快，但兼容性差，容易报错。</li>
</ul>
</li>
<li><em><strong>dtype_backend</strong></em>  设置pyarrow后，大大降低python中该数据内存占用量</li>
<li><em><strong>chunksize</strong></em>  每批次行数； 如果文件体积远超电脑内存时， 可将一个大文件拆分， 分批次读取。</li>
</ul>
<br>
<p>这里设计一个df内存查看函数，单位GB</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">bytes_to_GB</span><span class="p">(</span><span class="n">df</span><span class="p">):</span>  
    <span class="n">bytes_value</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">memory_usage</span><span class="p">(</span><span class="n">deep</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>
    <span class="k">return</span> <span class="nb">round</span><span class="p">(</span><span class="n">bytes_value</span> <span class="o">/</span> <span class="p">(</span><span class="mi">1024</span> <span class="o">**</span> <span class="mi">3</span><span class="p">),</span> <span class="mi">2</span><span class="p">)</span>  
  
  
<span class="k">def</span> <span class="nf">bytes_to_MB</span><span class="p">(</span><span class="n">df</span><span class="p">):</span>  
    <span class="n">bytes_value</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">memory_usage</span><span class="p">(</span><span class="n">deep</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>
    <span class="k">return</span> <span class="nb">round</span><span class="p">(</span><span class="n">bytes_value</span> <span class="o">/</span> <span class="p">(</span><span class="mi">1024</span> <span class="o">**</span> <span class="mi">2</span><span class="p">),</span> <span class="mi">2</span><span class="p">)</span>  
</code></pre></div><br>
<h3 id="21-nrows">2.1 nrows</h3>
<p>使用 <em><strong>nrows</strong></em> 参数设置只读取前n条记录， 了解csv字段有哪些</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="c1">#只读取csv中前5条记录</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;lit_footnotes.csv&#39;</span><span class="p">,</span> <span class="n">nrows</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
<span class="n">df</span>
</code></pre></div><p><img loading="lazy" src="img/03-df.png" alt=""  />
</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="c1">#只读取csv中前5条记录</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;lit_footnotes.csv&#39;</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;内存占用: </span><span class="si">{</span><span class="n">bytes_to_GB</span><span class="p">(</span><span class="n">df</span><span class="p">)</span><span class="si">}</span><span class="s2"> GB&#34;</span><span class="p">)</span> 
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">CPU times: user 35.7 s, sys: 4.55 s, total: 40.3 s
Wall time: 37.5 s

内存占用: 10.84 GB
</code></pre></div><br>
<h3 id="22-usecols">2.2 usecols</h3>
<p>指定某几个字段进行读取</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="c1">#指定某些字段读取</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;lit_footnotes.csv&#39;</span><span class="p">,</span> <span class="n">usecols</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;filingDate&#39;</span><span class="p">,</span> <span class="s1">&#39;id&#39;</span><span class="p">,</span> <span class="s1">&#39;content&#39;</span><span class="p">])</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;内存占用: </span><span class="si">{</span><span class="n">bytes_to_GB</span><span class="p">(</span><span class="n">df</span><span class="p">)</span><span class="si">}</span><span class="s2"> GB&#34;</span><span class="p">)</span> 
<span class="n">df</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">内存占用: 6.95 GB
CPU times: user 32.2 s, sys: 1.55 s, total: 33.8 s
Wall time: 34.1 s
</code></pre></div><p><img loading="lazy" src="img/04-df.png" alt=""  />
</p>
<br>
<h3 id="23-engine">2.3 engine</h3>
<p>可指定 <em><strong>engine=&lsquo;pyarrow&rsquo;</strong></em>,  来提高读取速度。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;lit_footnotes.csv&#39;</span><span class="p">,</span> <span class="n">engine</span><span class="o">=</span><span class="s1">&#39;pyarrow&#39;</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;内存占用: </span><span class="si">{</span><span class="n">bytes_to_GB</span><span class="p">(</span><span class="n">df</span><span class="p">)</span><span class="si">}</span><span class="s2"> GB&#34;</span><span class="p">)</span> 
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">内存占用: 10.84 GB
CPU times: user 23.3 s, sys: 12.6 s, total: 35.9 s
Wall time: 19.7 s
</code></pre></div><p><strong>注: 有时候使用engine=&lsquo;pyarrow&rsquo;, 容易代码报错， 这时候就只能放弃这个方法乖乖的默认读取。</strong> 经大邓实验，本数据集全部csv文件均可正常使用 <em><strong>engine=&lsquo;pyarrow&rsquo;</strong></em> 。</p>
<br>
<h3 id="24-dtype_backend">2.4 dtype_backend</h3>
<p>指定 <em><strong>dtype_backend=&lsquo;pyarrow&rsquo;</strong></em> 理论上会大大降低内存占用，但读取速度可能不一定提高。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;lit_footnotes.csv&#39;</span><span class="p">,</span> <span class="n">dtype_backend</span><span class="o">=</span><span class="s1">&#39;pyarrow&#39;</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;内存占用: </span><span class="si">{</span><span class="n">bytes_to_GB</span><span class="p">(</span><span class="n">df</span><span class="p">)</span><span class="si">}</span><span class="s2"> GB&#34;</span><span class="p">)</span> 
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">内存占用: 7.19 GB
CPU times: user 44 s, sys: 8 s, total: 52 s
Wall time: 53.5 s
</code></pre></div><br>
<p>同时指定 <em><strong>engine</strong></em> 和 <em><strong>dtype_backend</strong></em> 两个参数， 会明显提高读取速度。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;lit_footnotes.csv&#39;</span><span class="p">,</span> <span class="n">engine</span><span class="o">=</span><span class="s1">&#39;pyarrow&#39;</span><span class="p">,</span> <span class="n">dtype_backend</span><span class="o">=</span><span class="s1">&#39;pyarrow&#39;</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;内存占用: </span><span class="si">{</span><span class="n">bytes_to_GB</span><span class="p">(</span><span class="n">df</span><span class="p">)</span><span class="si">}</span><span class="s2"> GB&#34;</span><span class="p">)</span> 
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">内存占用: 7.19 GB
CPU times: user 9.88 s, sys: 7.55 s, total: 17.4 s
Wall time: 2.12 s
</code></pre></div><br>
<h3 id="对比">对比</h3>
<table>
<thead>
<tr>
<th>参数</th>
<th>解析速度</th>
<th>内存占用</th>
</tr>
</thead>
<tbody>
<tr>
<td><em><strong>pd.read_csv(csvf)</strong></em></td>
<td>最慢</td>
<td>最大</td>
</tr>
<tr>
<td><em><strong>pd.read_csv(csvf. engine=&lsquo;pyarrow&rsquo;)</strong></em></td>
<td>较快</td>
<td>中等</td>
</tr>
<tr>
<td><em><strong>pd.read_csv(csvf, engine=&lsquo;pyarrow&rsquo;, dtype_backend=&lsquo;pyarrow&rsquo;)</strong></em></td>
<td><strong>最快</strong></td>
<td><strong>最小</strong></td>
</tr>
</tbody>
</table>
<br>
<h3 id="25-chunksize">2.5 chunksize</h3>
<p>当探索完前n行，选中某些列，我们已经了解了哪些字段是我们必须要用的， 占用系统内存的大小。</p>
<p>接下来，我们就可以尝试着按照批次读取数据。</p>
<p>为了让实验简单高效，我们假设只读取前50000行， 每批次是10000 行。 对比下占用系统内存的量</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="c1">#一次性读取50000条记录</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;lit_footnotes.csv&#39;</span><span class="p">,</span> <span class="n">nrows</span><span class="o">=</span><span class="mi">50000</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;一次性读取内存占用: </span><span class="si">{</span><span class="n">bytes_to_MB</span><span class="p">(</span><span class="n">df</span><span class="p">)</span><span class="si">}</span><span class="s2"> MB&#34;</span><span class="p">)</span> 


<span class="c1">#分批次读取</span>
<span class="c1">#每10000条记录是一个批次，得到chunk_dfs</span>
<span class="n">chunk_dfs</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;lit_footnotes.csv&#39;</span><span class="p">,</span> <span class="n">nrows</span><span class="o">=</span><span class="mi">50000</span><span class="p">,</span> <span class="n">chunksize</span><span class="o">=</span><span class="mi">10000</span><span class="p">)</span>
<span class="c1">#每个chunk_df就是我们熟悉的dataframe类型数据</span>
<span class="k">for</span> <span class="n">chunk_df</span> <span class="ow">in</span> <span class="n">chunk_dfs</span><span class="p">:</span>
    <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;分批次读取内存占用: </span><span class="si">{</span><span class="n">bytes_to_MB</span><span class="p">(</span><span class="n">chunk_df</span><span class="p">)</span><span class="si">}</span><span class="s2"> MB&#34;</span><span class="p">)</span> 
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">一次性读取内存占用: 23.04 MB
分批次读取内存占用: 4.71 MB
分批次读取内存占用: 4.73 MB
分批次读取内存占用: 4.43 MB
分批次读取内存占用: 4.68 MB
分批次读取内存占用: 4.5 MB
</code></pre></div><br>
<p>在实践中，<em><strong>nrows</strong></em>  和 <em><strong>chunksize</strong></em> 不会同时出现， 而且 <em><strong>chunksize</strong></em> 一般都会设置的很大，例如1000000条。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">chunk_dfs = pd.read_csv(&#39;csv文件&#39;, chunksize=1000000)
</code></pre></div><p>看到 <em><strong>chunk_dfs</strong></em> 也不要害怕，其实每个 <em><strong>chunk_df</strong></em> 就是我们熟悉的 <em><strong>df</strong></em> ，即dataframe数据类型。</p>
<br>
<br>
<h2 id="三总结">三、总结</h2>
<p>记住这行代码</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">pd.read_csv(csvf, nrows, usecols, engine, dtype_backend, chunksize)
</code></pre></div><p>8G内存的电脑， 通过以上技巧，基本可以把我们应对大数据的潜力放大N倍。  N可以是几倍、十几倍、几十倍、上百倍&hellip;，<strong>放大潜力的过程</strong></p>
<ul>
<li><em><strong>usecols</strong></em> 和 <em><strong>chunksize</strong></em>  起主要作用，百试百爽，稳定不出错。</li>
<li><em><strong>engine</strong></em>  和  <em><strong>dtype_backend</strong></em>  提高读取速度并降低内存占用，但代码容易出错。</li>
<li><em><strong>chunksize</strong></em>、<em><strong>nrows</strong></em> 参数不能与 <em><strong>engine</strong></em>、<em><strong>dtype_backend</strong></em>同时使用。</li>
</ul>
<p><br><br></p>
<h2 id="相关内容">相关内容</h2>
<ul>
<li><a href="https://textdata.cn/blog/management_python_course/">付费视频课 | Python实证指标构建与文本分析</a></li>
<li><a href="https://textdata.cn/blog/2023-11-17-how-handle-mega-csv-that-far-exceed-memory/">推荐 | 如何处理远超电脑内存的csv文件</a></li>
<li><a href="https://textdata.cn/blog/datasets_available_for_management_science/">LIST | 可供社科(经管)领域使用的科研数据集清单</a></li>
<li><a href="https://textdata.cn/blog/the_text_analysis_list_about_ms/">LIST | 社科(经管)数据挖掘文献资料汇总</a></li>
</ul>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>推荐 | 如何处理远超电脑内存的 csv 文件</title>
      <link>https://textdata.cn/blog/2023-11-17-how-handle-mega-csv-that-far-exceed-memory/</link>
      <pubDate>Mon, 10 Mar 2025 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-11-17-how-handle-mega-csv-that-far-exceed-memory/</guid>
      <description>&lt;h2 id=&#34;一问题&#34;&gt;一、问题&lt;/h2&gt;
&lt;p&gt;最近分享的数据集都是体量巨大，&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-05-07-china-law-judgment-documents-datasets/&#34;&gt;&lt;strong&gt;93G数据集| 中国裁判文书网(2010-2021)&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-04-12-china-mainland-corporate-registration-information/&#34;&gt;&lt;strong&gt;数据集 | 2.49亿条中国大陆工商企业注册信息(更新至23.9)&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-04-12-china-poi-datasets/&#34;&gt;&lt;strong&gt;数据集|  3.9G全国POI地点兴趣点数据集&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-04-13-3571w-patent-dataset-in-china-mainland/&#34;&gt;&lt;strong&gt;数据集|  5112万条专利申请数据集(1985-2025年)&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2025-03-05-consumer-complaint-dataset/&#34;&gt;&lt;strong&gt;数据集 | 1500w+消费者投诉数据集(2018 ~ 2024.8)&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;br&gt;
&lt;p&gt;下图是 &lt;a href=&#34;https://textdata.cn/blog/2023-04-13-3571w-patent-dataset-in-china-mainland/&#34;&gt;&lt;strong&gt;数据集 | 5112万条专利申请数据集(1985-2025年)&lt;/strong&gt;&lt;/a&gt;截图，其中 &lt;em&gt;&lt;strong&gt;广东省.csv.gz&lt;/strong&gt;&lt;/em&gt;  4.01 G，解压后得到的 &lt;em&gt;&lt;strong&gt;广东省.csv&lt;/strong&gt;&lt;/em&gt; 达到 15.78G， 已经超过很多学员电脑内存（现在常见的笔记本内存是8G、16G、32G），我们应该如何应对这类 &lt;strong&gt;巨大csv文件&lt;/strong&gt; 呢？&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/screen-datasets.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二思路&#34;&gt;二、思路&lt;/h2&gt;
&lt;p&gt;一般应对 &lt;em&gt;&lt;strong&gt;广东省.csv.gz&lt;/strong&gt;&lt;/em&gt; 这类巨大csv文件，可以从以下两大类思路:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;思路1. 使用更高配置的电脑&lt;/strong&gt;&lt;br&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;思路2. 花点功夫学大文件处理技巧&lt;/strong&gt;&lt;/p&gt;
&lt;h3 id=&#34;21-使用更高配置的电脑服务器&#34;&gt;2.1 使用更高配置的电脑(服务器)&lt;/h3&gt;
&lt;p&gt;思路1， 方法简单，思路简单， 写代码的方式一如既往， 认知成本低， 美中不足要花钱。&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;买电脑； 如果你不差钱，直接换更好的电脑， 8G&amp;ndash;&amp;gt;16G&amp;ndash;&amp;gt;32G&amp;ndash;&amp;gt;64&amp;ndash;&amp;gt;96G&amp;ndash;&amp;gt;128G&amp;hellip;  预算决定数据处理能力的上限。&lt;/li&gt;
&lt;li&gt;租用服务器；如果差钱，资金不足脑力凑。 租用服务器的难点是像你我刚接触电脑一样，要熟悉服务器操作，前期存在较大的认知难度和学习难度。&lt;/li&gt;
&lt;/ul&gt;
&lt;br&gt;
&lt;h3 id=&#34;22--花点功夫学大文件处理技巧&#34;&gt;2.2  花点功夫学大文件处理技巧&lt;/h3&gt;
&lt;p&gt;网上关于处理大文件的技巧虽然很多，比如针对每个字段的数据类型，整形、浮点型、64位、32位， 反正大邓是不太懂。 咱们学python的原则是，用最少的时间学到最常用最有用的，解决80%的问题，剩下的20%太难的问题还是交给专业人士。假设你我电脑内存是8G，要在此环境下进行数据处理， 以下是常见的处理方法&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;读取前n条记录&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;读取某个(些)字段&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;小批次读取&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;转csv为xlsx&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;在接下来的章节中，我们重点分享以上5类技巧代码。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;三好技巧&#34;&gt;三、好技巧&lt;/h2&gt;
&lt;p&gt;以csv、xlsx这类数据， 每行代表一条记录，每列代表一个字段，而文件体积是由行数和列数决定。而 &lt;em&gt;&lt;strong&gt;pd.read_csv&lt;/strong&gt;&lt;/em&gt;有三个最常用的参数nrows、usecols、chunksize，分别决定读前nrows行、选择usecols列读取、按照chunksize分批次读取。&lt;/p&gt;
&lt;p&gt;我选择以 &lt;a href=&#34;https://textdata.cn/blog/2025-03-05-consumer-complaint-dataset/&#34;&gt;&lt;strong&gt;数据集 | 1500w+消费者投诉数据集(2018 ~ 2024.8)&lt;/strong&gt;&lt;/a&gt; 中的文件 &lt;em&gt;&lt;strong&gt;消费者黑猫投诉数据.csv.gz(解压后3.63G)&lt;/strong&gt;&lt;/em&gt; 为例进行实验。 该文件格式较为干净， 不会出现太多意外情况，能更好的展示实验效果。&lt;/p&gt;
&lt;p&gt;对这个csv文件，除了知道文件名，其他信息一无所知。这时候最简单的技巧就是尝试着读取前n条记录，先了解字段有哪些。&lt;/p&gt;
&lt;h3 id=&#34;31-nrows&#34;&gt;3.1 nrows&lt;/h3&gt;
&lt;p&gt;使用 &lt;em&gt;&lt;strong&gt;nrows&lt;/strong&gt;&lt;/em&gt; 参数设置只读取前n条记录， 了解csv字段有哪些&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#只读取csv中前5条记录&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;消费者黑猫投诉数据.csv.gz&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;nrows&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;5&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;compression&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;gzip&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#使用bandizp、winrar等常用的解压软件解压gz文件，得到csv文件&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#df = pd.read_csv(&amp;#39;消费者黑猫投诉数据.csv&amp;#39;, nrows=5)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;head&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/01-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;32-usecols&#34;&gt;3.2 usecols&lt;/h3&gt;
&lt;p&gt;使用usecols参数，设置只读取某个(些)字段&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;消费者黑猫投诉数据.csv.gz&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;compression&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;gzip&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;nrows&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;5&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;usecols&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;标题&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;投诉时间&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;进度时间&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/02-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;33-bytes_to_gbdf&#34;&gt;3.3 bytes_to_GB(df)&lt;/h3&gt;
&lt;p&gt;设计一个查看文件内存的函数， 单位GB&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;k&#34;&gt;def&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;bytes_to_GB&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;  
    &lt;span class=&#34;n&#34;&gt;bytes_value&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;memory_usage&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;deep&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;sum&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
    &lt;span class=&#34;k&#34;&gt;return&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;round&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;bytes_value&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;/&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;1024&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;**&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;3&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;  
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;o&#34;&gt;%%&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;time&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;消费者黑猫投诉数据.csv.gz&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;compression&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;gzip&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;sa&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;内存占用: &lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;bytes_to_GB&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt; GB&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;  
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;CPU times: user 35.7 s, sys: 1.62 s, total: 37.3 s
Wall time: 37.3 s
内存占用: 10.95 GB
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;34-engine&#34;&gt;3.4 engine&lt;/h3&gt;
&lt;p&gt;可指定 &lt;em&gt;&lt;strong&gt;engine=&amp;lsquo;pyarrow&amp;rsquo;&lt;/strong&gt;&lt;/em&gt;,  来提高读取速度。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;o&#34;&gt;%%&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;time&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;消费者黑猫投诉数据.csv.gz&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;compression&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;gzip&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;engine&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;pyarrow&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;sa&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;内存占用: &lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;bytes_to_GB&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt; GB&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;CPU times: user 19.1 s, sys: 2.81 s, total: 21.9 s
Wall time: 14.1 s
内存占用: 9.14 GB
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;strong&gt;注: 使用engine=&amp;lsquo;pyarrow&amp;rsquo;, 容易代码报错， 这时候就只能放弃这个方法乖乖的默认读取。&lt;/strong&gt;&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;35-dtype_backend&#34;&gt;3.5 dtype_backend&lt;/h3&gt;
&lt;p&gt;指定 &lt;em&gt;&lt;strong&gt;dtype_backend=&amp;lsquo;pyarrow&amp;rsquo;&lt;/strong&gt;&lt;/em&gt; 理论上会大大降低内存占用，但读取速度可能不一定提高。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;o&#34;&gt;%%&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;time&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;消费者黑猫投诉数据.csv.gz&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
                  &lt;span class=&#34;n&#34;&gt;compression&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;gzip&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
                  &lt;span class=&#34;n&#34;&gt;dtype_backend&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;pyarrow&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;sa&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;内存占用: &lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;bytes_to_GB&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt; GB&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;CPU times: user 54.1 s, sys: 5.59 s, total: 59.7 s
Wall time: 1min
内存占用: 9.14 GB
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;p&gt;同时指定 &lt;em&gt;&lt;strong&gt;engine&lt;/strong&gt;&lt;/em&gt; 和 &lt;em&gt;&lt;strong&gt;dtype_backend&lt;/strong&gt;&lt;/em&gt; 两个参数， 会明显提高读取速度&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;o&#34;&gt;%%&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;time&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;消费者黑猫投诉数据.csv.gz&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
                  &lt;span class=&#34;n&#34;&gt;compression&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;gzip&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
                  &lt;span class=&#34;n&#34;&gt;engine&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;pyarrow&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
                  &lt;span class=&#34;n&#34;&gt;dtype_backend&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;pyarrow&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;sa&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;内存占用: &lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;bytes_to_GB&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt; GB&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;CPU times: user 10.9 s, sys: 2.4 s, total: 13.3 s
Wall time: 4.75 s

内存占用: 3.36 GB
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;对比&#34;&gt;对比&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;参数&lt;/th&gt;
&lt;th&gt;解析速度&lt;/th&gt;
&lt;th&gt;内存占用&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;em&gt;&lt;strong&gt;pd.read_csv(csvf)&lt;/strong&gt;&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;最慢&lt;/td&gt;
&lt;td&gt;最大&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;em&gt;&lt;strong&gt;pd.read_csv(csvf. engine=&amp;lsquo;pyarrow&amp;rsquo;)&lt;/strong&gt;&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;较快&lt;/td&gt;
&lt;td&gt;中等&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;em&gt;&lt;strong&gt;pd.read_csv(csvf, engine=&amp;lsquo;pyarrow&amp;rsquo;, dtype_backend=&amp;lsquo;pyarrow&amp;rsquo;)&lt;/strong&gt;&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;最快&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;最小&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;br&gt;
&lt;h3 id=&#34;36-chunksize&#34;&gt;3.6 chunksize&lt;/h3&gt;
&lt;p&gt;当探索完前n行，选中某些列，我们已经了解了哪些字段是我们必须要用的， 占用系统内存的大小。&lt;/p&gt;
&lt;p&gt;接下来，我们就可以尝试着按照批次读取数据。&lt;/p&gt;
&lt;p&gt;为了让实验简单高效，我们假设只读取前50000行， 每批次是10000 行。 对比下占用系统内存的量&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#一次性读取10000条记录&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;消费者黑猫投诉数据.csv.gz&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;compression&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;gzip&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;nrows&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;50000&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;sa&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;一次性读取内存占用: &lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;bytes_to_GB&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt; GB&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; 



&lt;span class=&#34;c1&#34;&gt;#分批次读取&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#每10000条记录是一个批次，得到chunk_dfs&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;chunk_dfs&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;消费者黑猫投诉数据.csv.gz&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;chunksize&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;10000&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;nrows&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;50000&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#每个chunk_df就是我们熟悉的dataframe类型数据&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;chunk_df&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;chunk_dfs&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;chunkdf_total_mb&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;bytes_to_mb&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;chunk_df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;memory_usage&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;deep&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;sum&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;())&lt;/span&gt;  
    &lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;sa&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;分批次读取内存占用: &lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;bytes_to_GB&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt; GB&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;一次性读取内存占用: 36.62 MB

分批次读取内存占用: 7.32 MB
分批次读取内存占用: 7.32 MB
分批次读取内存占用: 7.33 MB
分批次读取内存占用: 7.33 MB
分批次读取内存占用: 7.32 MB
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;p&gt;在实践中，&lt;em&gt;&lt;strong&gt;nrows&lt;/strong&gt;&lt;/em&gt;  和 &lt;em&gt;&lt;strong&gt;chunksize&lt;/strong&gt;&lt;/em&gt; 不会同时出现， 而且 &lt;em&gt;&lt;strong&gt;chunksize&lt;/strong&gt;&lt;/em&gt; 一般都会设置的很大，例如1000000条。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;chunk_dfs = pd.read_csv(&amp;#39;csv文件&amp;#39;, chunksize=1000000)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;看到 &lt;em&gt;&lt;strong&gt;chunk_dfs&lt;/strong&gt;&lt;/em&gt; 也不要害怕，其实每个 &lt;em&gt;&lt;strong&gt;chunk_df&lt;/strong&gt;&lt;/em&gt; 就是我们熟悉的 &lt;em&gt;&lt;strong&gt;df&lt;/strong&gt;&lt;/em&gt; ，即dataframe数据类型。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;四总结&#34;&gt;四、总结&lt;/h2&gt;
&lt;p&gt;记住这行代码&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;pd.read_csv(csvf, nrows, usecols, engine, dtype_backend, chunksize)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;8G内存的电脑， 通过以上技巧，基本可以把我们应对大数据的潜力放大N倍。  N可以是几倍、十几倍、几十倍、上百倍&amp;hellip;，&lt;strong&gt;放大潜力的过程&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;&lt;strong&gt;usecols&lt;/strong&gt;&lt;/em&gt; 和 &lt;em&gt;&lt;strong&gt;chunksize&lt;/strong&gt;&lt;/em&gt;  起主要作用，百试百爽，稳定不出错。&lt;/li&gt;
&lt;li&gt;&lt;em&gt;&lt;strong&gt;engine&lt;/strong&gt;&lt;/em&gt;  和  &lt;em&gt;&lt;strong&gt;dtype_backend&lt;/strong&gt;&lt;/em&gt;  提高读取速度并降低内存占用，但代码容易出错。&lt;/li&gt;
&lt;li&gt;&lt;em&gt;&lt;strong&gt;chunksize&lt;/strong&gt;&lt;/em&gt;、&lt;em&gt;&lt;strong&gt;nrows&lt;/strong&gt;&lt;/em&gt; 参数不能与 &lt;em&gt;&lt;strong&gt;engine&lt;/strong&gt;&lt;/em&gt;、&lt;em&gt;&lt;strong&gt;dtype_backend&lt;/strong&gt;&lt;/em&gt;同时使用。&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
</description>
      <content:encoded><![CDATA[<h2 id="一问题">一、问题</h2>
<p>最近分享的数据集都是体量巨大，</p>
<ul>
<li><a href="https://textdata.cn/blog/2023-05-07-china-law-judgment-documents-datasets/"><strong>93G数据集| 中国裁判文书网(2010-2021)</strong></a></li>
<li><a href="https://textdata.cn/blog/2023-04-12-china-mainland-corporate-registration-information/"><strong>数据集 | 2.49亿条中国大陆工商企业注册信息(更新至23.9)</strong></a></li>
<li><a href="https://textdata.cn/blog/2023-04-12-china-poi-datasets/"><strong>数据集|  3.9G全国POI地点兴趣点数据集</strong></a></li>
<li><a href="https://textdata.cn/blog/2023-04-13-3571w-patent-dataset-in-china-mainland/"><strong>数据集|  5112万条专利申请数据集(1985-2025年)</strong></a></li>
<li><a href="https://textdata.cn/blog/2025-03-05-consumer-complaint-dataset/"><strong>数据集 | 1500w+消费者投诉数据集(2018 ~ 2024.8)</strong></a></li>
</ul>
<br>
<p>下图是 <a href="https://textdata.cn/blog/2023-04-13-3571w-patent-dataset-in-china-mainland/"><strong>数据集 | 5112万条专利申请数据集(1985-2025年)</strong></a>截图，其中 <em><strong>广东省.csv.gz</strong></em>  4.01 G，解压后得到的 <em><strong>广东省.csv</strong></em> 达到 15.78G， 已经超过很多学员电脑内存（现在常见的笔记本内存是8G、16G、32G），我们应该如何应对这类 <strong>巨大csv文件</strong> 呢？</p>
<p><img loading="lazy" src="img/screen-datasets.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="二思路">二、思路</h2>
<p>一般应对 <em><strong>广东省.csv.gz</strong></em> 这类巨大csv文件，可以从以下两大类思路:</p>
<p><strong>思路1. 使用更高配置的电脑</strong><br></p>
<p><strong>思路2. 花点功夫学大文件处理技巧</strong></p>
<h3 id="21-使用更高配置的电脑服务器">2.1 使用更高配置的电脑(服务器)</h3>
<p>思路1， 方法简单，思路简单， 写代码的方式一如既往， 认知成本低， 美中不足要花钱。</p>
<ul>
<li>买电脑； 如果你不差钱，直接换更好的电脑， 8G&ndash;&gt;16G&ndash;&gt;32G&ndash;&gt;64&ndash;&gt;96G&ndash;&gt;128G&hellip;  预算决定数据处理能力的上限。</li>
<li>租用服务器；如果差钱，资金不足脑力凑。 租用服务器的难点是像你我刚接触电脑一样，要熟悉服务器操作，前期存在较大的认知难度和学习难度。</li>
</ul>
<br>
<h3 id="22--花点功夫学大文件处理技巧">2.2  花点功夫学大文件处理技巧</h3>
<p>网上关于处理大文件的技巧虽然很多，比如针对每个字段的数据类型，整形、浮点型、64位、32位， 反正大邓是不太懂。 咱们学python的原则是，用最少的时间学到最常用最有用的，解决80%的问题，剩下的20%太难的问题还是交给专业人士。假设你我电脑内存是8G，要在此环境下进行数据处理， 以下是常见的处理方法</p>
<ol>
<li>
<p>读取前n条记录</p>
</li>
<li>
<p>读取某个(些)字段</p>
</li>
<li>
<p>小批次读取</p>
</li>
<li>
<p>转csv为xlsx</p>
</li>
</ol>
<p>在接下来的章节中，我们重点分享以上5类技巧代码。</p>
<p><br><br></p>
<h2 id="三好技巧">三、好技巧</h2>
<p>以csv、xlsx这类数据， 每行代表一条记录，每列代表一个字段，而文件体积是由行数和列数决定。而 <em><strong>pd.read_csv</strong></em>有三个最常用的参数nrows、usecols、chunksize，分别决定读前nrows行、选择usecols列读取、按照chunksize分批次读取。</p>
<p>我选择以 <a href="https://textdata.cn/blog/2025-03-05-consumer-complaint-dataset/"><strong>数据集 | 1500w+消费者投诉数据集(2018 ~ 2024.8)</strong></a> 中的文件 <em><strong>消费者黑猫投诉数据.csv.gz(解压后3.63G)</strong></em> 为例进行实验。 该文件格式较为干净， 不会出现太多意外情况，能更好的展示实验效果。</p>
<p>对这个csv文件，除了知道文件名，其他信息一无所知。这时候最简单的技巧就是尝试着读取前n条记录，先了解字段有哪些。</p>
<h3 id="31-nrows">3.1 nrows</h3>
<p>使用 <em><strong>nrows</strong></em> 参数设置只读取前n条记录， 了解csv字段有哪些</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="c1">#只读取csv中前5条记录</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;消费者黑猫投诉数据.csv.gz&#39;</span><span class="p">,</span> <span class="n">nrows</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">)</span>
<span class="c1">#使用bandizp、winrar等常用的解压软件解压gz文件，得到csv文件</span>
<span class="c1">#df = pd.read_csv(&#39;消费者黑猫投诉数据.csv&#39;, nrows=5)</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/01-df.png" alt=""  />
</p>
<br>
<h3 id="32-usecols">3.2 usecols</h3>
<p>使用usecols参数，设置只读取某个(些)字段</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;消费者黑猫投诉数据.csv.gz&#39;</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">,</span> <span class="n">nrows</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">usecols</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;标题&#39;</span><span class="p">,</span> <span class="s1">&#39;投诉时间&#39;</span><span class="p">,</span><span class="s1">&#39;进度时间&#39;</span><span class="p">])</span>
<span class="n">df</span>
</code></pre></div><p><img loading="lazy" src="img/02-df.png" alt=""  />
</p>
<br>
<h3 id="33-bytes_to_gbdf">3.3 bytes_to_GB(df)</h3>
<p>设计一个查看文件内存的函数， 单位GB</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">bytes_to_GB</span><span class="p">(</span><span class="n">df</span><span class="p">):</span>  
    <span class="n">bytes_value</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">memory_usage</span><span class="p">(</span><span class="n">deep</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>
    <span class="k">return</span> <span class="nb">round</span><span class="p">(</span><span class="n">bytes_value</span> <span class="o">/</span> <span class="p">(</span><span class="mi">1024</span> <span class="o">**</span> <span class="mi">3</span><span class="p">),</span> <span class="mi">2</span><span class="p">)</span>  
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;消费者黑猫投诉数据.csv.gz&#39;</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;内存占用: </span><span class="si">{</span><span class="n">bytes_to_GB</span><span class="p">(</span><span class="n">df</span><span class="p">)</span><span class="si">}</span><span class="s2"> GB&#34;</span><span class="p">)</span>  
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">CPU times: user 35.7 s, sys: 1.62 s, total: 37.3 s
Wall time: 37.3 s
内存占用: 10.95 GB
</code></pre></div><br>
<h3 id="34-engine">3.4 engine</h3>
<p>可指定 <em><strong>engine=&lsquo;pyarrow&rsquo;</strong></em>,  来提高读取速度。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;消费者黑猫投诉数据.csv.gz&#39;</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">,</span> <span class="n">engine</span><span class="o">=</span><span class="s1">&#39;pyarrow&#39;</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;内存占用: </span><span class="si">{</span><span class="n">bytes_to_GB</span><span class="p">(</span><span class="n">df</span><span class="p">)</span><span class="si">}</span><span class="s2"> GB&#34;</span><span class="p">)</span> 
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">CPU times: user 19.1 s, sys: 2.81 s, total: 21.9 s
Wall time: 14.1 s
内存占用: 9.14 GB
</code></pre></div><p><strong>注: 使用engine=&lsquo;pyarrow&rsquo;, 容易代码报错， 这时候就只能放弃这个方法乖乖的默认读取。</strong></p>
<br>
<h3 id="35-dtype_backend">3.5 dtype_backend</h3>
<p>指定 <em><strong>dtype_backend=&lsquo;pyarrow&rsquo;</strong></em> 理论上会大大降低内存占用，但读取速度可能不一定提高。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;消费者黑猫投诉数据.csv.gz&#39;</span><span class="p">,</span> 
                  <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">,</span> 
                  <span class="n">dtype_backend</span><span class="o">=</span><span class="s1">&#39;pyarrow&#39;</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;内存占用: </span><span class="si">{</span><span class="n">bytes_to_GB</span><span class="p">(</span><span class="n">df</span><span class="p">)</span><span class="si">}</span><span class="s2"> GB&#34;</span><span class="p">)</span> 
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">CPU times: user 54.1 s, sys: 5.59 s, total: 59.7 s
Wall time: 1min
内存占用: 9.14 GB
</code></pre></div><br>
<p>同时指定 <em><strong>engine</strong></em> 和 <em><strong>dtype_backend</strong></em> 两个参数， 会明显提高读取速度</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;消费者黑猫投诉数据.csv.gz&#39;</span><span class="p">,</span> 
                  <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">,</span> 
                  <span class="n">engine</span><span class="o">=</span><span class="s1">&#39;pyarrow&#39;</span><span class="p">,</span> 
                  <span class="n">dtype_backend</span><span class="o">=</span><span class="s1">&#39;pyarrow&#39;</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;内存占用: </span><span class="si">{</span><span class="n">bytes_to_GB</span><span class="p">(</span><span class="n">df</span><span class="p">)</span><span class="si">}</span><span class="s2"> GB&#34;</span><span class="p">)</span> 
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">CPU times: user 10.9 s, sys: 2.4 s, total: 13.3 s
Wall time: 4.75 s

内存占用: 3.36 GB
</code></pre></div><br>
<h3 id="对比">对比</h3>
<table>
<thead>
<tr>
<th>参数</th>
<th>解析速度</th>
<th>内存占用</th>
</tr>
</thead>
<tbody>
<tr>
<td><em><strong>pd.read_csv(csvf)</strong></em></td>
<td>最慢</td>
<td>最大</td>
</tr>
<tr>
<td><em><strong>pd.read_csv(csvf. engine=&lsquo;pyarrow&rsquo;)</strong></em></td>
<td>较快</td>
<td>中等</td>
</tr>
<tr>
<td><em><strong>pd.read_csv(csvf, engine=&lsquo;pyarrow&rsquo;, dtype_backend=&lsquo;pyarrow&rsquo;)</strong></em></td>
<td><strong>最快</strong></td>
<td><strong>最小</strong></td>
</tr>
</tbody>
</table>
<br>
<h3 id="36-chunksize">3.6 chunksize</h3>
<p>当探索完前n行，选中某些列，我们已经了解了哪些字段是我们必须要用的， 占用系统内存的大小。</p>
<p>接下来，我们就可以尝试着按照批次读取数据。</p>
<p>为了让实验简单高效，我们假设只读取前50000行， 每批次是10000 行。 对比下占用系统内存的量</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="c1">#一次性读取10000条记录</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;消费者黑猫投诉数据.csv.gz&#39;</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">,</span> <span class="n">nrows</span><span class="o">=</span><span class="mi">50000</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;一次性读取内存占用: </span><span class="si">{</span><span class="n">bytes_to_GB</span><span class="p">(</span><span class="n">df</span><span class="p">)</span><span class="si">}</span><span class="s2"> GB&#34;</span><span class="p">)</span> 



<span class="c1">#分批次读取</span>
<span class="c1">#每10000条记录是一个批次，得到chunk_dfs</span>
<span class="n">chunk_dfs</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;消费者黑猫投诉数据.csv.gz&#39;</span><span class="p">,</span> <span class="n">chunksize</span><span class="o">=</span><span class="mi">10000</span><span class="p">,</span> <span class="n">nrows</span><span class="o">=</span><span class="mi">50000</span><span class="p">)</span>

<span class="c1">#每个chunk_df就是我们熟悉的dataframe类型数据</span>
<span class="k">for</span> <span class="n">chunk_df</span> <span class="ow">in</span> <span class="n">chunk_dfs</span><span class="p">:</span>
    <span class="n">chunkdf_total_mb</span> <span class="o">=</span> <span class="n">bytes_to_mb</span><span class="p">(</span><span class="n">chunk_df</span><span class="o">.</span><span class="n">memory_usage</span><span class="p">(</span><span class="n">deep</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">())</span>  
    <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;分批次读取内存占用: </span><span class="si">{</span><span class="n">bytes_to_GB</span><span class="p">(</span><span class="n">df</span><span class="p">)</span><span class="si">}</span><span class="s2"> GB&#34;</span><span class="p">)</span> 
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">一次性读取内存占用: 36.62 MB

分批次读取内存占用: 7.32 MB
分批次读取内存占用: 7.32 MB
分批次读取内存占用: 7.33 MB
分批次读取内存占用: 7.33 MB
分批次读取内存占用: 7.32 MB
</code></pre></div><br>
<p>在实践中，<em><strong>nrows</strong></em>  和 <em><strong>chunksize</strong></em> 不会同时出现， 而且 <em><strong>chunksize</strong></em> 一般都会设置的很大，例如1000000条。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">chunk_dfs = pd.read_csv(&#39;csv文件&#39;, chunksize=1000000)
</code></pre></div><p>看到 <em><strong>chunk_dfs</strong></em> 也不要害怕，其实每个 <em><strong>chunk_df</strong></em> 就是我们熟悉的 <em><strong>df</strong></em> ，即dataframe数据类型。</p>
<p><br><br></p>
<h2 id="四总结">四、总结</h2>
<p>记住这行代码</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">pd.read_csv(csvf, nrows, usecols, engine, dtype_backend, chunksize)
</code></pre></div><p>8G内存的电脑， 通过以上技巧，基本可以把我们应对大数据的潜力放大N倍。  N可以是几倍、十几倍、几十倍、上百倍&hellip;，<strong>放大潜力的过程</strong></p>
<ul>
<li><em><strong>usecols</strong></em> 和 <em><strong>chunksize</strong></em>  起主要作用，百试百爽，稳定不出错。</li>
<li><em><strong>engine</strong></em>  和  <em><strong>dtype_backend</strong></em>  提高读取速度并降低内存占用，但代码容易出错。</li>
<li><em><strong>chunksize</strong></em>、<em><strong>nrows</strong></em> 参数不能与 <em><strong>engine</strong></em>、<em><strong>dtype_backend</strong></em>同时使用。</li>
</ul>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>代码 | 使用5112w专利申请数据集构造面板数据</title>
      <link>https://textdata.cn/blog/2024-12-18-how-to-extract-data-from-patent-application-dataset/</link>
      <pubDate>Sat, 08 Mar 2025 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-12-18-how-to-extract-data-from-patent-application-dataset/</guid>
      <description>使用5112w专利申请数据生成面板数据</description>
      <content:encoded><![CDATA[<h2 id="相关代码">相关代码</h2>
<ul>
<li><a href="https://textdata.cn/blog/2023-12-18-how-to-generate-panel-data-from-daily-news-dataset/">代码 | 使用jjrb/rmrb数据构造某类概念词频「面板数据」</a></li>
<li><a href="https://textdata.cn/blog/2023-02-26-cctv1-xwlb-news-text-dataset/">代码 | 使用cctv新闻联播文稿构造面板数据</a></li>
</ul>
<p><br><br></p>
<h2 id="一任务">一、任务</h2>
<p>设计筛选条件，将某类专利(如<strong>人工智能</strong>)申请信息， 按 <strong>省份、年度、专利申请数</strong> 构造面板数据。如下图</p>
<p><img loading="lazy" src="img/03-df.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="二专利数据集">二、专利数据集</h2>
<p><a href="https://textdata.cn/blog/2023-04-13-3571w-patent-dataset-in-china-mainland/">数据集 | 5112万条专利申请数据集(1985-2025年)</a></p>
<br>
<h3 id="21-概况">2.1 概况</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">- 数据集名称：中国专利申请数据集
- 时间跨度：1985.1-2025.1，专利申请总量3571万
- 数据来源：『国家知识产权局』
- 数据来源：『国家知识产权局』
- 数据体积:  解压后整个文件夹大概 90 G+
</code></pre></div><p><img loading="lazy" src="img/screen-datasets.png" alt=""  />
</p>
<br>
<h3 id="22-获取数据">2.2 获取数据</h3>
<ul>
<li>免费下载 <a href="%E4%B8%93%E5%88%A9%E9%9D%A2%E6%9D%BF%E6%95%B0%E6%8D%AE.ipynb">专利面板数据.ipynb</a></li>
<li>免费下载 <a href="AI_panel.xlsx">AI_panel.xlsx</a></li>
<li>免费下载 <a href="AI_details.csv.gz">AI_details.csv</a></li>
</ul>
<br>
<p><a href="https://textdata.cn/blog/2023-04-13-3571w-patent-dataset-in-china-mainland/">5112万条专利申请数据集(1985-2025.1年)</a></p>
<p><br><br></p>
<h2 id="三实验代码">三、实验代码</h2>
<p>本实验代码文件目录结构</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">  |- 专利面板数据.ipynb
  |- Word2Vec
     |-1000w专利摘要文本.100.6.bin
     |-1000w专利摘要文本.100.6.bin.syn1neg.npy
     |-1000w专利摘要文本.100.6.bin.wv.vectors.npy
  |5112万专利申请全量数据1985-2025年
     |-广东省.csv.gz
     |-...
     |-西藏自治区.csv.gz
  |-AI_details.xlsx
  |-AI_panel.xlsx
</code></pre></div><br>
<h3 id="31-人工智能相关词">3.1 人工智能相关词</h3>
<p>使用之前 <a href="https://textdata.cn/blog/2023-11-10-training-word2vec-model-using-china-3751w-patent-application-dataset/"><strong>词向量(付费) | 使用1985年-2025年专利申请摘要训练Word2Vec模型</strong></a>  来扩展 「大数据」相关关键词。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#pip3 install cntext==2.1.7</span>
<span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="c1">#查看版本</span>
<span class="nb">print</span><span class="p">(</span><span class="n">ct</span><span class="o">.</span><span class="n">__version__</span><span class="p">)</span>

<span class="n">w2v_m</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">load_w2v</span><span class="p">(</span><span class="s1">&#39;Word2Vec/1000w专利摘要文本.100.6.bin&#39;</span><span class="p">)</span>
<span class="n">w2v_m</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">2.1.7
Loading word2vec model...
&lt;gensim.models.word2vec.Word2Vec at 0x109a8c810&gt;
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#我能想到的AI技术就这四个词</span>
<span class="n">w2v_m</span><span class="o">.</span><span class="n">wv</span><span class="o">.</span><span class="n">most_similar</span><span class="p">([</span><span class="s1">&#39;人工智能&#39;</span><span class="p">],</span> <span class="n">topn</span><span class="o">=</span><span class="mi">30</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[(&#39;AI&#39;, 0.8372030854225159),
 (&#39;人工智能技术&#39;, 0.7714870572090149),
 (&#39;AI智能&#39;, 0.74532151222229),
 (&#39;智能决策&#39;, 0.7404459714889526),
 (&#39;AI人工智能&#39;, 0.7198485732078552),
 (&#39;云计算&#39;, 0.7136917114257812),
 (&#39;人工智能学习&#39;, 0.7058480381965637),
 (&#39;深度学习&#39;, 0.6903414130210876),
 (&#39;交互式&#39;, 0.6859808564186096),
 (&#39;智慧校园&#39;, 0.6856474876403809),
 (&#39;信息技术&#39;, 0.6841551661491394),
 (&#39;智慧养老&#39;, 0.682081937789917),
 (&#39;智慧旅游&#39;, 0.6777652502059937),
 (&#39;智慧医疗&#39;, 0.6757360100746155),
 (&#39;智能机器人&#39;, 0.6742302179336548),
 (&#39;智慧&#39;, 0.6734717488288879),
 (&#39;人工智能语音&#39;, 0.6727728247642517),
 (&#39;物联网&#39;, 0.66999351978302),
 (&#39;机器学习&#39;, 0.6683002710342407),
 (&#39;健康管理&#39;, 0.6656192541122437),
 (&#39;人工智能AI&#39;, 0.6648072600364685),
 (&#39;AI视觉&#39;, 0.6609936356544495),
 (&#39;智慧社区&#39;, 0.6581154465675354),
 (&#39;自主学习&#39;, 0.6569625735282898),
 (&#39;图像识别&#39;, 0.6551436185836792),
 (&#39;健康管理系统&#39;, 0.6537778377532959),
 (&#39;数据分析系统&#39;, 0.6528143882751465),
 (&#39;教学系统&#39;, 0.6516135334968567),
 (&#39;图形化编程&#39;, 0.6513208150863647),
 (&#39;计算机技术&#39;, 0.6512178182601929)]
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">w2v_m</span><span class="o">.</span><span class="n">wv</span><span class="o">.</span><span class="n">most_similar</span><span class="p">([</span><span class="s1">&#39;人工智能&#39;</span><span class="p">,</span> <span class="s1">&#39;机器学习&#39;</span><span class="p">,</span> <span class="s1">&#39;AI&#39;</span><span class="p">,</span> <span class="s1">&#39;NLP&#39;</span><span class="p">,</span> <span class="s1">&#39;智能机器人&#39;</span><span class="p">],</span> <span class="n">topn</span><span class="o">=</span><span class="mi">30</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[(&#39;人工智能技术&#39;, 0.8236023783683777),
 (&#39;人工智能学习&#39;, 0.7996466159820557),
 (&#39;自然语言理解&#39;, 0.7942413687705994),
 (&#39;深度学习&#39;, 0.7931050658226013),
 (&#39;智能决策&#39;, 0.7848177552223206),
 (&#39;上下文感知&#39;, 0.7765907049179077),
 (&#39;自然语言处理&#39;, 0.7757146954536438),
 (&#39;智能问答&#39;, 0.7602421641349792),
 (&#39;自主学习&#39;, 0.7582942247390747),
 (&#39;问答系统&#39;, 0.7564904093742371),
 (&#39;在线学习&#39;, 0.7510443329811096),
 (&#39;人工智能算法&#39;, 0.7500166296958923),
 (&#39;数据挖掘&#39;, 0.7495553493499756),
 (&#39;AI算法&#39;, 0.7419456839561462),
 (&#39;自我学习&#39;, 0.7414599061012268),
 (&#39;AI模型&#39;, 0.7412964105606079),
 (&#39;人工智能AI&#39;, 0.7401654720306396),
 (&#39;知识推理&#39;, 0.7398316860198975),
 (&#39;语音语义&#39;, 0.7393308877944946),
 (&#39;行为识别&#39;, 0.7342970967292786),
 (&#39;人工智能语音&#39;, 0.7332825660705566),
 (&#39;多任务&#39;, 0.7270201444625854),
 (&#39;神经机器翻译&#39;, 0.7220420837402344),
 (&#39;边云协同&#39;, 0.7219405174255371),
 (&#39;图形化编程&#39;, 0.7205625772476196),
 (&#39;云计算&#39;, 0.7199273109436035),
 (&#39;众包&#39;, 0.7197409272193909),
 (&#39;AI智能&#39;, 0.7154985666275024),
 (&#39;NLU&#39;, 0.7152286767959595),
 (&#39;AI人工智能&#39;, 0.7139929533004761)]
</code></pre></div><br>
<p>通过运行多次查询相似词，不断浓缩，得到人工智能技术相关技术词(不一定全，只是演示)</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">AI_rela_words</span> <span class="o">=</span> <span class="s1">&#39;人工智能|机器学习|AI|NLP|智能问答|智能问答|神经机器翻译|NLU|增量学习&#39;</span>
</code></pre></div><br>
<h3 id="32-读取专利数据">3.2 读取专利数据</h3>
<p>尝试读取一个文件
写代码先局部后整体，先小后大。 能在局部小文件做实验成功，就可以for循环推广到所有的文件。这里我们选择 <em><strong>内蒙古自治区.csv.gz</strong></em></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;5112万专利申请全量数据1985-2025年/内蒙古自治区.csv.gz&#39;</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">))</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/01-df.png" alt=""  />
</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#含有的字段</span>
<span class="n">df</span><span class="o">.</span><span class="n">columns</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">Index([&#39;专利名称&#39;, &#39;专利类型&#39;, &#39;申请人&#39;, &#39;申请人类型&#39;, &#39;申请人地址&#39;, &#39;申请人国家&#39;, &#39;申请人省份&#39;, &#39;申请人城市&#39;,
       &#39;申请人区县&#39;, &#39;申请号&#39;, &#39;申请日&#39;, &#39;申请年份&#39;, &#39;公开公告号&#39;, &#39;公开公告日&#39;, &#39;公开公告年份&#39;, &#39;授权公告号&#39;,
       &#39;授权公告日&#39;, &#39;授权公告年份&#39;, &#39;IPC分类号&#39;, &#39;IPC主分类号&#39;, &#39;发明人&#39;, &#39;摘要文本&#39;, &#39;主权项内容&#39;, &#39;当前权利人&#39;,
       &#39;当前专利权人地址&#39;, &#39;专利权人类型&#39;, &#39;统一社会信用代码&#39;, &#39;引证次数&#39;, &#39;被引证次数&#39;, &#39;自引次数&#39;, &#39;他引次数&#39;,
       &#39;被自引次数&#39;, &#39;被他引次数&#39;, &#39;家族引证次数&#39;, &#39;家族被引证次数&#39;],
      dtype=&#39;object&#39;)
</code></pre></div><br>
<h3 id="33-筛选专利">3.3 筛选专利</h3>
<p>使用逻辑条件把 <em><strong>专利名称</strong></em> 、 <em><strong>摘要文本</strong></em> 、<em><strong>主权项内容</strong></em> 中含 <em><strong>人工智能</strong></em> 相关概念词的申请记录筛选出来。 注意， 筛选条件的严格程度根据自己需要调整，这里使用的最严格的条件，即 人工智能词同时出现在专利名称和专利摘要，才将该专利识别为人工智能专利。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">AI_rela_words</span> <span class="o">=</span> <span class="s1">&#39;人工智能|机器学习|AI|NLP|智能问答|智能问答|神经机器翻译|NLU|增量学习&#39;</span>

<span class="n">mask1</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;专利名称&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="n">AI_rela_words</span><span class="p">)</span>
<span class="n">mask2</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;摘要文本&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="n">AI_rela_words</span><span class="p">)</span>
<span class="n">mask3</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;主权项内容&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="n">AI_rela_words</span><span class="p">)</span>

<span class="c1">#内容太多， 选择需要的字段进行展示</span>
<span class="n">selected_fields</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;专利名称&#39;</span><span class="p">,</span> <span class="s1">&#39;摘要文本&#39;</span><span class="p">,</span> <span class="s1">&#39;主权项内容&#39;</span><span class="p">,</span> <span class="s1">&#39;申请日&#39;</span><span class="p">,</span> <span class="s1">&#39;IPC分类号&#39;</span><span class="p">]</span>
<span class="c1">#专利</span>
<span class="n">ai_df</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">mask1</span> <span class="o">&amp;</span> <span class="n">mask2</span> <span class="o">&amp;</span> <span class="n">mask3</span><span class="p">][</span><span class="n">selected_fields</span><span class="p">]</span>
<span class="n">ai_df</span>
</code></pre></div><p><img loading="lazy" src="img/02-df.png" alt=""  />
</p>
<br>
<p>筛选结果基本上都是人工智能相关专利技术。</p>
<br>
<h3 id="34-年度申请量">3.4 年度申请量</h3>
<p>计算内蒙古自治区，人工智能相关专利年度申请量。 根据申请日， 先生成year字段</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">ai_df</span><span class="p">[</span><span class="s1">&#39;year&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">ai_df</span><span class="p">[</span><span class="s2">&#34;申请日&#34;</span><span class="p">]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">d</span><span class="p">:</span><span class="n">d</span><span class="p">[:</span><span class="mi">4</span><span class="p">])</span>

<span class="k">for</span> <span class="n">year</span><span class="p">,</span> <span class="n">ai_year_df</span> <span class="ow">in</span> <span class="n">ai_df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">&#39;year&#39;</span><span class="p">):</span>
    <span class="nb">print</span><span class="p">(</span><span class="n">year</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">ai_year_df</span><span class="p">))</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">2014 2
2016 2
2017 13
2018 25
2019 38
2020 65
2021 83
2022 123
2023 4
2024 75
</code></pre></div><br>
<h3 id="36-获取年度各种专利类型的数量">3.6 获取年度各种专利类型的数量</h3>
<p>计算内蒙古自治区，人工智能领域各类型专利的年度申请量</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="k">for</span> <span class="n">year</span><span class="p">,</span> <span class="n">ai_year_df</span> <span class="ow">in</span> <span class="n">ai_df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">&#39;year&#39;</span><span class="p">):</span>
    <span class="n">data</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span>
    <span class="n">data</span><span class="p">[</span><span class="s1">&#39;年度&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">year</span>
    <span class="n">data</span><span class="p">[</span><span class="s1">&#39;实用新型&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">ai_year_df</span><span class="p">[</span><span class="s1">&#39;专利类型&#39;</span><span class="p">]</span><span class="o">==</span><span class="s1">&#39;实用新型&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>
    <span class="n">data</span><span class="p">[</span><span class="s1">&#39;发明公开&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">ai_year_df</span><span class="p">[</span><span class="s1">&#39;专利类型&#39;</span><span class="p">]</span><span class="o">==</span><span class="s1">&#39;发明公开&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>
    <span class="n">data</span><span class="p">[</span><span class="s1">&#39;外观设计&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">ai_year_df</span><span class="p">[</span><span class="s1">&#39;专利类型&#39;</span><span class="p">]</span><span class="o">==</span><span class="s1">&#39;外观设计&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>
    <span class="n">data</span><span class="p">[</span><span class="s1">&#39;发明授权&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">ai_year_df</span><span class="p">[</span><span class="s1">&#39;专利类型&#39;</span><span class="p">]</span><span class="o">==</span><span class="s1">&#39;发明授权&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>
    <span class="n">data</span><span class="p">[</span><span class="s1">&#39;省份&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="s1">&#39;内蒙古自治区&#39;</span>
    <span class="nb">print</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">{&#39;年度&#39;: &#39;2014&#39;, &#39;实用新型&#39;: 0, &#39;发明公开&#39;: 0, &#39;外观设计&#39;: 0, &#39;发明授权&#39;: 1, &#39;省份&#39;: &#39;内蒙古自治区&#39;}
{&#39;年度&#39;: &#39;2016&#39;, &#39;实用新型&#39;: 0, &#39;发明公开&#39;: 0, &#39;外观设计&#39;: 0, &#39;发明授权&#39;: 1, &#39;省份&#39;: &#39;内蒙古自治区&#39;}
{&#39;年度&#39;: &#39;2017&#39;, &#39;实用新型&#39;: 1, &#39;发明公开&#39;: 0, &#39;外观设计&#39;: 0, &#39;发明授权&#39;: 3, &#39;省份&#39;: &#39;内蒙古自治区&#39;}
{&#39;年度&#39;: &#39;2018&#39;, &#39;实用新型&#39;: 1, &#39;发明公开&#39;: 0, &#39;外观设计&#39;: 0, &#39;发明授权&#39;: 6, &#39;省份&#39;: &#39;内蒙古自治区&#39;}
{&#39;年度&#39;: &#39;2019&#39;, &#39;实用新型&#39;: 5, &#39;发明公开&#39;: 0, &#39;外观设计&#39;: 0, &#39;发明授权&#39;: 9, &#39;省份&#39;: &#39;内蒙古自治区&#39;}
{&#39;年度&#39;: &#39;2020&#39;, &#39;实用新型&#39;: 12, &#39;发明公开&#39;: 0, &#39;外观设计&#39;: 0, &#39;发明授权&#39;: 14, &#39;省份&#39;: &#39;内蒙古自治区&#39;}
{&#39;年度&#39;: &#39;2021&#39;, &#39;实用新型&#39;: 7, &#39;发明公开&#39;: 0, &#39;外观设计&#39;: 0, &#39;发明授权&#39;: 11, &#39;省份&#39;: &#39;内蒙古自治区&#39;}
{&#39;年度&#39;: &#39;2022&#39;, &#39;实用新型&#39;: 14, &#39;发明公开&#39;: 0, &#39;外观设计&#39;: 0, &#39;发明授权&#39;: 14, &#39;省份&#39;: &#39;内蒙古自治区&#39;}
{&#39;年度&#39;: &#39;2023&#39;, &#39;实用新型&#39;: 0, &#39;发明公开&#39;: 0, &#39;外观设计&#39;: 0, &#39;发明授权&#39;: 0, &#39;省份&#39;: &#39;内蒙古自治区&#39;}
{&#39;年度&#39;: &#39;2024&#39;, &#39;实用新型&#39;: 2, &#39;发明公开&#39;: 0, &#39;外观设计&#39;: 0, &#39;发明授权&#39;: 1, &#39;省份&#39;: &#39;内蒙古自治区&#39;}
</code></pre></div><br>
<h3 id="37-路径列表">3.7 路径列表</h3>
<p>使用glob库查看专利申请数据集内的含 <em><strong>csv.gz</strong></em> 的所有文件路径</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">glob</span>

<span class="n">files</span> <span class="o">=</span> <span class="n">glob</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s1">&#39;5112万专利申请全量数据1985-2025年/*.csv.gz&#39;</span><span class="p">)</span>
<span class="n">files</span> <span class="o">=</span> <span class="p">[</span><span class="n">f</span> <span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="n">files</span> <span class="k">if</span> <span class="n">f</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">not_in</span><span class="p">]</span>
<span class="n">files</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[&#39;5112万专利申请全量数据1985-2025年/北京市.csv.gz&#39;,
 &#39;5112万专利申请全量数据1985-2025年/广西壮族自治区.csv.gz&#39;,
 &#39;5112万专利申请全量数据1985-2025年/河北省.csv.gz&#39;,
 &#39;5112万专利申请全量数据1985-2025年/海南省.csv.gz&#39;,
 &#39;5112万专利申请全量数据1985-2025年/天津市.csv.gz&#39;,
 ......
 &#39;5112万专利申请全量数据1985-2025年/新疆维吾尔自治区.csv.gz&#39;,
 &#39;5112万专利申请全量数据1985-2025年/辽宁省.csv.gz&#39;,
 &#39;5112万专利申请全量数据1985-2025年/河南省.csv.gz&#39;,
 &#39;5112万专利申请全量数据1985-2025年/宁夏回族自治区.csv.gz&#39;]
</code></pre></div><br>
<h3 id="38-批量运算">3.8 批量运算</h3>
<p>现在对所有省市进行刚刚的操作， 筛选出的人工智能专利详细信息保存到 <em><strong>AI_details.csv</strong></em> , 同时汇总面板数据(年度、省份、专利数量), 得到 <em><strong>AI_panel.xlsx</strong></em> 。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>

<span class="n">AI_rela_words</span> <span class="o">=</span> <span class="s1">&#39;人工智能|机器学习|AI|NLP|智能问答|智能问答|神经机器翻译|NLU|增量学习&#39;</span>
<span class="n">AI_Relatives_Patents</span> <span class="o">=</span> <span class="p">[]</span>


<span class="k">for</span> <span class="n">file</span> <span class="ow">in</span> <span class="n">files</span><span class="p">:</span>
    <span class="nb">print</span><span class="p">(</span><span class="n">file</span><span class="p">)</span>
    <span class="n">prov</span> <span class="o">=</span> <span class="n">file</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">&#39;/&#39;</span><span class="p">)[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s1">&#39;.csv.gz&#39;</span><span class="p">,</span> <span class="s1">&#39;&#39;</span><span class="p">)</span>

    <span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="n">file</span><span class="p">,</span> 
                     <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">,</span> 
                     <span class="n">usecols</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;专利名称&#39;</span><span class="p">,</span> <span class="s1">&#39;摘要文本&#39;</span><span class="p">,</span> <span class="s1">&#39;主权项内容&#39;</span><span class="p">,</span> <span class="s1">&#39;申请日&#39;</span><span class="p">,</span> <span class="s1">&#39;专利类型&#39;</span><span class="p">]</span>
                    <span class="p">)</span>
    
    <span class="n">mask1</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;专利名称&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="n">AI_rela_words</span><span class="p">)</span>
    <span class="n">mask2</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;摘要文本&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="n">AI_rela_words</span><span class="p">)</span>
    <span class="n">mask3</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;主权项内容&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="n">AI_rela_words</span><span class="p">)</span>

    <span class="n">ai_df</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">mask1</span> <span class="o">&amp;</span> <span class="n">mask2</span> <span class="o">&amp;</span> <span class="n">mask3</span><span class="p">]</span>
    <span class="n">ai_df</span><span class="p">[</span><span class="s1">&#39;year&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">ai_df</span><span class="p">[</span><span class="s2">&#34;申请日&#34;</span><span class="p">]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">d</span><span class="p">:</span><span class="n">d</span><span class="p">[:</span><span class="mi">4</span><span class="p">])</span>
    
    <span class="c1">#保存全国AI专利详情信息</span>
    <span class="n">ai_df</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s1">&#39;AI_details.csv&#39;</span><span class="p">,</span> <span class="n">mode</span><span class="o">=</span><span class="s1">&#39;a&#39;</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
    
    <span class="k">for</span> <span class="n">year</span><span class="p">,</span> <span class="n">ai_year_df</span> <span class="ow">in</span> <span class="n">ai_df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">&#39;year&#39;</span><span class="p">):</span>
        <span class="n">data</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span>
        <span class="n">data</span><span class="p">[</span><span class="s1">&#39;年度&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">year</span>
        <span class="n">data</span><span class="p">[</span><span class="s1">&#39;实用新型&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">ai_year_df</span><span class="p">[</span><span class="s1">&#39;专利类型&#39;</span><span class="p">]</span><span class="o">==</span><span class="s1">&#39;实用新型&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>
        <span class="n">data</span><span class="p">[</span><span class="s1">&#39;发明公开&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">ai_year_df</span><span class="p">[</span><span class="s1">&#39;专利类型&#39;</span><span class="p">]</span><span class="o">==</span><span class="s1">&#39;发明公开&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>
        <span class="n">data</span><span class="p">[</span><span class="s1">&#39;外观设计&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">ai_year_df</span><span class="p">[</span><span class="s1">&#39;专利类型&#39;</span><span class="p">]</span><span class="o">==</span><span class="s1">&#39;外观设计&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>
        <span class="n">data</span><span class="p">[</span><span class="s1">&#39;发明授权&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">ai_year_df</span><span class="p">[</span><span class="s1">&#39;专利类型&#39;</span><span class="p">]</span><span class="o">==</span><span class="s1">&#39;发明授权&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>
        <span class="n">data</span><span class="p">[</span><span class="s1">&#39;省份&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">prov</span>
        <span class="n">AI_Relatives_Patents</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>


<span class="n">ai_panel_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">AI_Relatives_Patents</span><span class="p">)</span>
<span class="n">ai_panel_df</span><span class="o">.</span><span class="n">to_excel</span><span class="p">(</span><span class="s1">&#39;AI_panel.xlsx&#39;</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;记录数:&#39;</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">ai_panel_df</span><span class="p">))</span>
<span class="n">ai_panel_df</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">5112万专利申请全量数据1985-2025年/北京市.csv.gz
5112万专利申请全量数据1985-2025年/广西壮族自治区.csv.gz
......
5112万专利申请全量数据1985-2025年/河南省.csv.gz
5112万专利申请全量数据1985-2025年/宁夏回族自治区.csv.gz

记录数: 523
CPU times: user 15min 38s, sys: 46.4 s, total: 16min 24s
Wall time: 16min 27s
</code></pre></div><p><img loading="lazy" src="img/04-df.png" alt=""  />
</p>
<br>
<h3 id="39-剔除重复">3.9 剔除重复</h3>
<p>AI_details.csv 会有一些重复内容，可以剔除重复内容，删除旧文件，导出新的不重复的文件。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">os</span>

<span class="n">AI_detail_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;AI_details.csv&#39;</span><span class="p">)</span>
<span class="n">AI_detail_df</span> <span class="o">=</span> <span class="n">AI_detail_df</span><span class="p">[</span><span class="n">AI_detail_df</span><span class="p">[</span><span class="s1">&#39;专利公开号&#39;</span><span class="p">]</span><span class="o">!=</span><span class="s1">&#39;专利公开号&#39;</span><span class="p">]</span>
<span class="n">AI_detail_df</span><span class="o">.</span><span class="n">drop_duplicates</span><span class="p">(</span><span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>

<span class="n">os</span><span class="o">.</span><span class="n">remove</span><span class="p">(</span><span class="s2">&#34;AI_details.csv&#34;</span><span class="p">)</span>
<span class="n">AI_detail_df</span><span class="o">.</span><span class="n">to_excel</span><span class="p">(</span><span class="s2">&#34;AI_details.xlsx&#34;</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
</code></pre></div><p><br><br></p>
<h3 id="四汇总代码">四、汇总代码</h3>
<p>本文全部代码汇总于此</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">glob</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">import</span> <span class="nn">tqdm</span> <span class="kn">import</span> <span class="nn">tqdm</span>

<span class="n">AI_rela_words</span> <span class="o">=</span> <span class="s1">&#39;人工智能|机器学习|AI|NLP|智能问答|智能问答|神经机器翻译|NLU|增量学习&#39;</span>

<span class="c1">#剔除港、澳、台、海外</span>
<span class="n">not_in</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;5112万专利申请全量数据1985-2025年/台湾省.csv.gz&#39;</span><span class="p">,</span>
          <span class="s1">&#39;5112万专利申请全量数据1985-2025年/澳门特别行政区.csv.gz&#39;</span><span class="p">,</span> 
          <span class="s1">&#39;5112万专利申请全量数据1985-2025年/香港特别行政区.csv.gz&#39;</span><span class="p">,</span> 
          <span class="s1">&#39;5112万专利申请全量数据1985-2025年/其他国家.csv.gz&#39;</span><span class="p">]</span>
<span class="n">files</span> <span class="o">=</span> <span class="n">glob</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s1">&#39;5112万专利申请全量数据1985-2025年/*.csv.gz&#39;</span><span class="p">)</span>

<span class="n">AI_Relatives_Patents</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">file</span> <span class="ow">in</span> <span class="n">tqdm</span><span class="p">(</span><span class="n">files</span><span class="p">):</span>
    <span class="nb">print</span><span class="p">(</span><span class="n">file</span><span class="p">)</span>
    <span class="n">prov</span> <span class="o">=</span> <span class="n">file</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">&#39;/&#39;</span><span class="p">)[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s1">&#39;.csv.gz&#39;</span><span class="p">,</span> <span class="s1">&#39;&#39;</span><span class="p">)</span>
    <span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="n">file</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">)</span>
    
    <span class="c1">#筛选出AI专利</span>
    <span class="n">mask1</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;专利名称&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="n">AI_rela_words</span><span class="p">)</span>
    <span class="n">mask2</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;摘要文本&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="n">AI_rela_words</span><span class="p">)</span>
    <span class="n">mask3</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;主权项内容&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="n">AI_rela_words</span><span class="p">)</span>
    <span class="n">ai_df</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">mask1</span> <span class="o">&amp;</span> <span class="n">mask2</span> <span class="o">&amp;</span> <span class="n">mask3</span><span class="p">]</span>
    
    <span class="c1">#保存全国AI专利详情信息</span>
    <span class="n">ai_df</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s1">&#39;AI_details.csv&#39;</span><span class="p">,</span> <span class="n">mode</span><span class="o">=</span><span class="s1">&#39;a&#39;</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
    
    <span class="n">ai_df</span><span class="p">[</span><span class="s1">&#39;year&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">ai_df</span><span class="p">[</span><span class="s2">&#34;申请日&#34;</span><span class="p">]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">d</span><span class="p">:</span><span class="n">d</span><span class="p">[:</span><span class="mi">4</span><span class="p">])</span>
    <span class="k">for</span> <span class="n">year</span><span class="p">,</span> <span class="n">ai_year_df</span> <span class="ow">in</span> <span class="n">ai_df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">&#39;year&#39;</span><span class="p">):</span>
        <span class="n">data</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span>
        <span class="n">data</span><span class="p">[</span><span class="s1">&#39;年度&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">year</span>
        <span class="n">data</span><span class="p">[</span><span class="s1">&#39;实用新型&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">ai_year_df</span><span class="p">[</span><span class="s1">&#39;专利类型&#39;</span><span class="p">]</span><span class="o">==</span><span class="s1">&#39;实用新型&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>
        <span class="n">data</span><span class="p">[</span><span class="s1">&#39;发明公开&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">ai_year_df</span><span class="p">[</span><span class="s1">&#39;专利类型&#39;</span><span class="p">]</span><span class="o">==</span><span class="s1">&#39;发明公开&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>
        <span class="n">data</span><span class="p">[</span><span class="s1">&#39;外观设计&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">ai_year_df</span><span class="p">[</span><span class="s1">&#39;专利类型&#39;</span><span class="p">]</span><span class="o">==</span><span class="s1">&#39;外观设计&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>
        <span class="n">data</span><span class="p">[</span><span class="s1">&#39;发明授权&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">ai_year_df</span><span class="p">[</span><span class="s1">&#39;专利类型&#39;</span><span class="p">]</span><span class="o">==</span><span class="s1">&#39;发明授权&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>
        <span class="n">data</span><span class="p">[</span><span class="s1">&#39;省份&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">prov</span>
        <span class="n">AI_Relatives_Patents</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
    

<span class="n">china_ai_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">AI_Relatives_Patents</span><span class="p">)</span>
<span class="n">china_ai_df</span><span class="o">.</span><span class="n">to_excel</span><span class="p">(</span><span class="s1">&#39;AI_panel.xlsx&#39;</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>


<span class="n">AI_detail_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;AI_details.csv&#39;</span><span class="p">)</span>
<span class="n">AI_detail_df</span> <span class="o">=</span> <span class="n">AI_detail_df</span><span class="p">[</span><span class="n">AI_detail_df</span><span class="p">[</span><span class="s1">&#39;专利公开号&#39;</span><span class="p">]</span><span class="o">!=</span><span class="s1">&#39;专利公开号&#39;</span><span class="p">]</span>
<span class="n">AI_detail_df</span><span class="o">.</span><span class="n">drop_duplicates</span><span class="p">(</span><span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">os</span><span class="o">.</span><span class="n">remove</span><span class="p">(</span><span class="s2">&#34;AI_details.csv&#34;</span><span class="p">)</span>
<span class="n">AI_detail_df</span><span class="o">.</span><span class="n">to_excel</span><span class="p">(</span><span class="s2">&#34;AI_details.xlsx&#34;</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
</code></pre></div><br>
<br>
<h2 id="cntext使用声明">cntext使用声明</h2>
<p>如在研究或项目中使用 cntext ，请在文中介绍并附引用声明。引用格式可参考 <a href="https://textdata.cn/blog/2025-09-09-transform-text-data-into-structured-data-with-lmstudio-and-cntext/#%E4%BD%BF%E7%94%A8%E5%A3%B0%E6%98%8E">cntext 推荐引用格式</a></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>数据集 | 5112万条专利申请数据集(1985-2025)</title>
      <link>https://textdata.cn/blog/2023-04-13-3571w-patent-dataset-in-china-mainland/</link>
      <pubDate>Fri, 07 Mar 2025 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-04-13-3571w-patent-dataset-in-china-mainland/</guid>
      <description>5112万条专利申请数据集(1985-2025)</description>
      <content:encoded><![CDATA[<h2 id="一数据介绍">一、数据介绍</h2>
<h3 id="11-概况">1.1 概况</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">数据集名：中国专利申请数据集
时间跨度：1985.1-2025.1
记录条数: 5112w
数据来源：『国家知识产权局』
数据体积:  解压后整个文件夹大概 90 G+
本文声明:  科研问题； 如有问题， 请加微信372335839，备注「姓名-学校-专业」
</code></pre></div><p><img loading="lazy" src="img/01-screen-datasets.png" alt=""  />
</p>
<br>
<h3 id="12-字段">1.2 字段</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"> - 专利名称
 - 专利类型
 - 申请人
 - 申请人类型
 - 申请人地址
 - 申请人国家
 - 申请人省份
 - 申请人城市
 - 申请人区县
 - 申请号
 - 申请日
 - 申请年份
 - 公开公告号
 - 公开公告日
 - 公开公告年份
 - 授权公告号
 - 授权公告日
 - 授权公告年份
 - IPC分类号
 - IPC主分类号
 - 发明人
 - 摘要文本
 - 主权项内容
 - 当前权利人
 - 当前专利权人地址
 - 专利权人类型
 - 统一社会信用代码
 - 引证次数
 - 被引证次数
 - 自引次数
 - 他引次数
 - 被自引次数
 - 被他引次数
 - 家族引证次数
 - 家族被引证次数
</code></pre></div><br>
<h3 id="13-声明">1.3 声明</h3>
<p>科研用途；如有问题， 请加微信372335839，备注「姓名-学校-专业」</p>
<p><br><br></p>
<h2 id="二实验代码">二、实验代码</h2>
<h3 id="21-读取全库文件">2.1 读取全库文件</h3>
<p>全库 <strong>中国专利数据库.csv.gz</strong> 体积 <strong>25.59G</strong>， 解压后 <em><strong>90G+</strong></em>。 大邓这里有一台内存 <em><strong>256 G</strong></em> 的服务器， 可以任性的试一试。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">mega_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;中国专利数据库.csv.gz&#39;</span><span class="p">,</span>  <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">,</span> <span class="n">low_memory</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
<span class="c1">#mega_df = pd.read_feather(&#39;中国专利数据库.feather&#39;)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">CPU times: user 56min 29s,
Wall time: 1h 2min 47s
</code></pre></div><br>
<h3 id="22-读取技巧">2.2 读取技巧</h3>
<p>但平常我们使用的电脑，内存大概16G~32G， 所以只能借助 pandas的一些方法巧妙的读取这个大文件， 详情可参考 <a href="https://textdata.cn/blog/2023-11-17-how-handle-mega-csv-that-far-exceed-memory/"><strong>代码 | 如何用python处理超大csv文件</strong></a> 。为了后期绘制省份申请量和年度申请量这两个图，因此这里可以选择指定字段进行读取， 减少占用内存量。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;中国专利数据库.csv.gz&#39;</span><span class="p">,</span> 
                 <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">,</span> 
                 <span class="n">usecols</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;申请人省份&#39;</span><span class="p">,</span> <span class="s1">&#39;申请日&#39;</span><span class="p">])</span>

<span class="n">memory_size</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">memory_usage</span><span class="p">(</span><span class="n">deep</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span><span class="o">/</span><span class="mi">1024</span><span class="o">/</span><span class="mi">1024</span><span class="o">/</span><span class="mi">1024</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;两字段占用内存: </span><span class="si">{</span><span class="n">memory_size</span><span class="si">}</span><span class="s1">G&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">CPU times: user 6min 29s, sys: 13 s, total: 6min 42s
Wall time: 6min 44s

两字段占用内存: 6G
</code></pre></div><p><img loading="lazy" src="img/02-alldf.png" alt=""  />
</p>
<br>
<h3 id="23-读取省份文件">2.3 读取省份文件</h3>
<p>数据集中的个别csv文件较大，例如 <strong>河北省.csv.gz</strong>体积528M , 解压得到的 <strong>河北省.csv</strong> 2G。 建议直接读取 <em><strong>.csv.gz</strong></em>，这样会提高数据读取的速度。 需要注意每次分析时不要开其他软件，如Word/PPT/Excel/WPS。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;河北省.csv.gz&#39;</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8&#39;</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">,</span> <span class="n">low_memory</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
<span class="c1">#df = pd.read_csv(&#39;河北省.csv&#39;, encoding=&#39;utf-8&#39;, low_memory=False)</span>

<span class="n">df</span><span class="p">[</span><span class="s1">&#39;申请日&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;申请日&#39;</span><span class="p">])</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;河北省申请量: &#39;</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">))</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">河北省申请量: 1048890
</code></pre></div><p><img loading="lazy" src="img/03-df.png" alt=""  />
</p>
<br>
<h3 id="24-覆盖日期">2.4 覆盖日期</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="nb">print</span><span class="p">(</span><span class="s1">&#39;覆盖日期: &#39;</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;申请日&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">min</span><span class="p">()</span><span class="o">.</span><span class="n">date</span><span class="p">(),</span> <span class="s1">&#39;~&#39;</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;申请日&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">()</span><span class="o">.</span><span class="n">date</span><span class="p">())</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">覆盖日期:  1985-04-01 ~ 2025-01-22
</code></pre></div><br>
<h3 id="25-字段">2.5 字段</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="o">.</span><span class="n">columns</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">Index([&#39;专利名称&#39;, &#39;专利类型&#39;, &#39;申请人&#39;, &#39;申请人类型&#39;, &#39;申请人地址&#39;, &#39;申请人国家&#39;, &#39;申请人省份&#39;, &#39;申请人城市&#39;,
       &#39;申请人区县&#39;, &#39;申请号&#39;, &#39;申请日&#39;, &#39;申请年份&#39;, &#39;公开公告号&#39;, &#39;公开公告日&#39;, &#39;公开公告年份&#39;, &#39;授权公告号&#39;,
       &#39;授权公告日&#39;, &#39;授权公告年份&#39;, &#39;IPC分类号&#39;, &#39;IPC主分类号&#39;, &#39;发明人&#39;, &#39;摘要文本&#39;, &#39;主权项内容&#39;, &#39;当前权利人&#39;,
       &#39;当前专利权人地址&#39;, &#39;专利权人类型&#39;, &#39;统一社会信用代码&#39;, &#39;引证次数&#39;, &#39;被引证次数&#39;, &#39;自引次数&#39;, &#39;他引次数&#39;,
       &#39;被自引次数&#39;, &#39;被他引次数&#39;, &#39;家族引证次数&#39;, &#39;家族被引证次数&#39;],
      dtype=&#39;object&#39;)
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s1">&#39;专利类型&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">value_counts</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">专利类型
实用新型    480822
发明申请    319386
外观设计    166560
发明授权     82122
Name: count, dtype: int64
</code></pre></div><p><br><br></p>
<h2 id="三字段详情">三、字段详情</h2>
<h3 id="31-字段缺失程度">3.1 字段缺失程度</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">missingno</span> <span class="k">as</span> <span class="nn">ms</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">import</span> <span class="nn">platform</span>

<span class="c1"># 根据操作系统设置字体</span>
<span class="n">plt</span><span class="o">.</span><span class="n">rcParams</span><span class="p">[</span><span class="s1">&#39;font.family&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="s1">&#39;SimHei&#39;</span> <span class="k">if</span> <span class="n">platform</span><span class="o">.</span><span class="n">system</span><span class="p">()</span> <span class="o">==</span> <span class="s1">&#39;Windows&#39;</span> <span class="k">else</span> <span class="s1">&#39;Arial Unicode MS&#39;</span> <span class="k">if</span> <span class="n">platform</span><span class="o">.</span><span class="n">system</span><span class="p">()</span> <span class="o">==</span> <span class="s1">&#39;Darwin&#39;</span> <span class="k">else</span> <span class="s1">&#39;sans-serif&#39;</span>
<span class="n">plt</span><span class="o">.</span><span class="n">rcParams</span><span class="p">[</span><span class="s1">&#39;axes.unicode_minus&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="kc">False</span>  <span class="c1"># 解决负号显示问题</span>

<span class="c1"># 绘制缺失值矩阵图</span>
<span class="n">ms</span><span class="o">.</span><span class="n">matrix</span><span class="p">(</span><span class="n">dff</span><span class="p">)</span>

<span class="c1"># 显示图表</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/04-nan.png" alt=""  />
</p>
<p>可以看到图右侧的柱状条存在明显的条纹， 柱状条的空白越多，说明对应字段的缺失程度越大。</p>
<br>
<h3 id="32-专利类型">3.2 专利类型</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">df[&#39;专利类型&#39;].unique()
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">array([&#39;发明申请&#39;, &#39;实用新型&#39;, &#39;发明授权&#39;, &#39;外观设计&#39;], dtype=object)
</code></pre></div><br>
<h3 id="32-发明人">3.2 发明人</h3>
<p>发明人一般是自然人，但是极少数情况也可以是法人。发明人可以是多个自然人，一般以  <code>; </code>间隔</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">df[&#39;发明人&#39;]
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">0                            柴秀芬
1                            秋海滨
2                            葛长虹
3                       孙盛典; 申富德
4                       马建维; 邢建军
                   ...          
1048885             何治东; 刘应梁; 王琴
1048886                 邢维鹏; 王春良
1048887                 张正东; 张宏彬
1048888    张岩鑫; 韩静; 王宏建; 麻俐; 马东泽
1048889        储焰平; 薛志涛; 袁晓峰; 黄博
Name: 发明人, Length: 1048890, dtype: object
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#整个河北省，只有这2条记录是发明人是公司法人</span>
<span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;发明人&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">&#39;公司&#39;</span><span class="p">)][[</span><span class="s1">&#39;申请号&#39;</span><span class="p">,</span> <span class="s1">&#39;专利名称&#39;</span><span class="p">,</span> <span class="s1">&#39;申请人类型&#39;</span><span class="p">,</span> <span class="s1">&#39;摘要文本&#39;</span><span class="p">,</span> <span class="s1">&#39;发明人&#39;</span><span class="p">,</span> <span class="s1">&#39;申请人&#39;</span><span class="p">]]</span>
</code></pre></div><p><img loading="lazy" src="img/05-df.png" alt=""  />
</p>
<br>
<h3 id="33-申请人">3.3 申请人</h3>
<p>注意，申请人可以是自然人、法人、多个(自然人、法人),  一般以  <code>; </code>间隔。我们先直接看 <code>申请人</code></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s1">&#39;申请人&#39;</span><span class="p">]</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">0                       柴秀芬
1                       秋海滨
2                       葛长虹
3                  孙盛典; 申富德
4                  马建维; 邢建军
                 ...       
1048885    昆明浩淼水利水电工程检测有限公司
1048886      海南紫程众投生物科技有限公司
1048887     泸西县宏达农业发展有限责任公司
1048888       辽宁省德明环境检测有限公司
1048889        中冶南方工程技术有限公司
Name: 申请人, Length: 1048890, dtype: object
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#河北省，【申请人类型】大多数是企业</span>
<span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;申请人&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">&#39;公司&#39;</span><span class="p">)][[</span><span class="s1">&#39;申请号&#39;</span><span class="p">,</span> <span class="s1">&#39;专利名称&#39;</span><span class="p">,</span> <span class="s1">&#39;申请人类型&#39;</span><span class="p">,</span> <span class="s1">&#39;摘要文本&#39;</span><span class="p">,</span> <span class="s1">&#39;发明人&#39;</span><span class="p">,</span> <span class="s1">&#39;申请人&#39;</span><span class="p">]]</span>
</code></pre></div><p><img loading="lazy" src="img/06-df.png" alt=""  />
</p>
<br>
<h3 id="34-ipc分类号">3.4 IPC分类号</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">dff[[&#39;专利名称&#39;, &#39;主权项内容&#39;, &#39;IPC主分类号&#39;, &#39;IPC分类号&#39;]]
</code></pre></div><p><img loading="lazy" src="img/07-df.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="四可视化">四、可视化</h2>
<h3 id="41-省份">4.1 省份</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">from</span> <span class="nn">plotnine</span> <span class="kn">import</span> <span class="o">*</span>
<span class="kn">from</span> <span class="nn">matplotlib.font_manager</span> <span class="kn">import</span> <span class="n">FontProperties</span>

<span class="n">font_prop</span> <span class="o">=</span> <span class="n">FontProperties</span><span class="p">(</span><span class="n">fname</span><span class="o">=</span><span class="s1">&#39;文泉驿微米黑.ttf&#39;</span><span class="p">)</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;中国专利数据库.csv.gz&#39;</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">,</span> <span class="n">usecols</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;申请人省份&#39;</span><span class="p">])</span>
<span class="n">prov_volumes_df</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">&#39;申请人省份&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">size</span><span class="p">()</span><span class="o">.</span><span class="n">reset_index</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s1">&#39;申请数量&#39;</span><span class="p">)</span>
<span class="n">prov_volumes_df</span><span class="o">.</span><span class="n">sort_values</span><span class="p">(</span><span class="s1">&#39;申请数量&#39;</span><span class="p">,</span> <span class="n">ascending</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>


<span class="c1"># 转换为分类变量并保持原始顺序</span>
<span class="n">prov_volumes_df</span><span class="p">[</span><span class="s1">&#39;申请人省份&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">Categorical</span><span class="p">(</span>
    <span class="n">prov_volumes_df</span><span class="p">[</span><span class="s1">&#39;申请人省份&#39;</span><span class="p">],</span>
    <span class="n">categories</span><span class="o">=</span><span class="n">prov_volumes_df</span><span class="p">[</span><span class="s1">&#39;申请人省份&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">unique</span><span class="p">(),</span>
    <span class="n">ordered</span><span class="o">=</span><span class="kc">True</span>
<span class="p">)</span>

<span class="p">(</span>
    <span class="n">ggplot</span><span class="p">(</span><span class="n">prov_volumes_df</span><span class="p">,</span> <span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="s1">&#39;申请人省份&#39;</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="s1">&#39;申请数量&#39;</span><span class="p">))</span>
    <span class="o">+</span><span class="n">geom_bar</span><span class="p">(</span><span class="n">stat</span><span class="o">=</span><span class="s1">&#39;identity&#39;</span><span class="p">)</span>
    <span class="o">+</span><span class="n">labs</span><span class="p">(</span><span class="n">title</span><span class="o">=</span><span class="s1">&#39;省份专利申请量(1985-2025)&#39;</span><span class="p">,</span>
         <span class="n">x</span><span class="o">=</span><span class="s1">&#39;&#39;</span><span class="p">,</span>
         <span class="n">y</span><span class="o">=</span><span class="s1">&#39;&#39;</span><span class="p">)</span>
    <span class="o">+</span><span class="n">geom_text</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">label</span><span class="o">=</span><span class="s1">&#39;申请数量&#39;</span><span class="p">),</span>
               <span class="n">va</span><span class="o">=</span><span class="s1">&#39;bottom&#39;</span><span class="p">,</span>
               <span class="n">size</span><span class="o">=</span><span class="mf">4.5</span><span class="p">,</span>
               <span class="n">format_string</span><span class="o">=</span><span class="s1">&#39;</span><span class="si">{}</span><span class="s1">&#39;</span><span class="p">)</span>
    <span class="o">+</span> <span class="n">annotate</span><span class="p">(</span>
        <span class="s1">&#39;text&#39;</span><span class="p">,</span>
        <span class="n">x</span><span class="o">=</span> <span class="s1">&#39;新疆维吾尔自治区&#39;</span><span class="p">,</span>
        <span class="n">y</span><span class="o">=</span> <span class="n">prov_volumes_df</span><span class="p">[</span><span class="s1">&#39;申请数量&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">()</span> <span class="o">*</span> <span class="mf">1.05</span><span class="p">,</span>
        <span class="n">label</span><span class="o">=</span><span class="s1">&#39;公众号: 大邓和他的Python&#39;</span><span class="p">,</span>
        <span class="n">ha</span><span class="o">=</span><span class="s1">&#39;left&#39;</span><span class="p">,</span>
        <span class="n">va</span><span class="o">=</span><span class="s1">&#39;top&#39;</span><span class="p">,</span>
        <span class="n">size</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span>
        <span class="n">color</span><span class="o">=</span><span class="s1">&#39;black&#39;</span><span class="p">,</span>
    <span class="p">)</span>
    <span class="o">+</span><span class="n">theme</span><span class="p">(</span><span class="n">figure_size</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">6</span><span class="p">),</span>
           <span class="n">text</span><span class="o">=</span><span class="n">element_text</span><span class="p">(</span><span class="n">family</span><span class="o">=</span> <span class="n">font_prop</span><span class="o">.</span><span class="n">get_name</span><span class="p">(),</span> <span class="n">size</span><span class="o">=</span><span class="mi">8</span><span class="p">,</span> <span class="n">rotation</span><span class="o">=</span><span class="mi">0</span><span class="p">),</span>
           <span class="n">plot_title</span> <span class="o">=</span> <span class="n">element_text</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span> <span class="n">rotation</span><span class="o">=</span><span class="mi">0</span><span class="p">),</span>
           <span class="n">axis_text_x</span> <span class="o">=</span> <span class="n">element_text</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="mi">8</span><span class="p">,</span> <span class="n">rotation</span><span class="o">=</span><span class="mi">45</span><span class="p">)</span>
          <span class="p">)</span>
<span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">CPU times: user 8min 19s,
Wall time: 8min 44s
</code></pre></div><p><img loading="lazy" src="img/08-prov.png" alt=""  />
</p>
<br>
<h3 id="42-年份">4.2 年份</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">from</span> <span class="nn">plotnine</span> <span class="kn">import</span> <span class="o">*</span>
<span class="kn">from</span> <span class="nn">matplotlib.font_manager</span> <span class="kn">import</span> <span class="n">FontProperties</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="n">font_prop</span> <span class="o">=</span> <span class="n">FontProperties</span><span class="p">(</span><span class="n">fname</span><span class="o">=</span><span class="s1">&#39;文泉驿微米黑.ttf&#39;</span><span class="p">)</span>


<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;中国专利数据库.csv.gz&#39;</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">,</span> <span class="n">usecols</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;申请日&#39;</span><span class="p">])</span>

<span class="n">date_df</span> <span class="o">=</span> <span class="p">(</span>
    <span class="n">df</span><span class="p">[</span><span class="s1">&#39;申请日&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span>
        <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;申请日&#39;</span><span class="p">])</span><span class="o">.</span><span class="n">dt</span><span class="o">.</span><span class="n">year</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">y</span><span class="p">:</span> <span class="nb">str</span><span class="p">(</span><span class="n">y</span><span class="p">)</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s1">&#39;.0&#39;</span><span class="p">,</span> <span class="s1">&#39;&#39;</span><span class="p">)))</span>
    <span class="o">.</span><span class="n">size</span><span class="p">()</span>
    <span class="o">.</span><span class="n">reset_index</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s1">&#39;count&#39;</span><span class="p">)</span>
    <span class="o">.</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="s1">&#39;申请日&#39;</span><span class="p">:</span> <span class="s1">&#39;year&#39;</span><span class="p">})</span>
<span class="p">)</span>

<span class="n">date_df</span> <span class="o">=</span> <span class="n">date_df</span><span class="p">[</span><span class="n">date_df</span><span class="o">.</span><span class="n">year</span><span class="o">!=</span><span class="s1">&#39;nan&#39;</span><span class="p">]</span>
<span class="n">date_df</span><span class="o">.</span><span class="n">sort_values</span><span class="p">(</span><span class="s1">&#39;year&#39;</span><span class="p">,</span> <span class="n">ascending</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>

<span class="c1"># 转换为分类变量并保持原始顺序</span>
<span class="n">date_df</span><span class="p">[</span><span class="s1">&#39;year&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">Categorical</span><span class="p">(</span>
    <span class="n">date_df</span><span class="p">[</span><span class="s1">&#39;year&#39;</span><span class="p">],</span>
    <span class="n">categories</span><span class="o">=</span><span class="n">yc_df</span><span class="p">[</span><span class="s1">&#39;year&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">unique</span><span class="p">(),</span>
    <span class="n">ordered</span><span class="o">=</span><span class="kc">True</span>
<span class="p">)</span>



<span class="p">(</span>
    <span class="n">ggplot</span><span class="p">(</span><span class="n">date_df</span><span class="p">,</span> <span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="s1">&#39;year&#39;</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="s1">&#39;count&#39;</span><span class="p">))</span>
    <span class="o">+</span><span class="n">geom_bar</span><span class="p">(</span><span class="n">stat</span><span class="o">=</span><span class="s1">&#39;identity&#39;</span><span class="p">)</span>
    <span class="o">+</span><span class="n">labs</span><span class="p">(</span><span class="n">title</span><span class="o">=</span><span class="s1">&#39;专利年度申请量(1985-2025.1)&#39;</span><span class="p">,</span>
         <span class="n">x</span><span class="o">=</span><span class="s1">&#39;&#39;</span><span class="p">,</span>
         <span class="n">y</span><span class="o">=</span><span class="s1">&#39;&#39;</span><span class="p">)</span>
    <span class="o">+</span><span class="n">geom_text</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">label</span><span class="o">=</span><span class="s1">&#39;count&#39;</span><span class="p">),</span>
               <span class="n">va</span><span class="o">=</span><span class="s1">&#39;bottom&#39;</span><span class="p">,</span>
               <span class="n">size</span><span class="o">=</span><span class="mf">4.1</span><span class="p">,</span>
               <span class="n">format_string</span><span class="o">=</span><span class="s1">&#39;</span><span class="si">{}</span><span class="s1">&#39;</span><span class="p">)</span>
    <span class="o">+</span> <span class="n">annotate</span><span class="p">(</span>
        <span class="s1">&#39;text&#39;</span><span class="p">,</span>
        <span class="n">x</span><span class="o">=</span> <span class="n">date_df</span><span class="p">[</span><span class="s1">&#39;year&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">(),</span>
        <span class="n">y</span><span class="o">=</span> <span class="n">date_df</span><span class="p">[</span><span class="s1">&#39;count&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">()</span> <span class="o">*</span> <span class="mf">1.05</span><span class="p">,</span>
        <span class="n">label</span><span class="o">=</span><span class="s1">&#39;公众号: 大邓和他的Python&#39;</span><span class="p">,</span>
        <span class="n">ha</span><span class="o">=</span><span class="s1">&#39;left&#39;</span><span class="p">,</span>
        <span class="n">va</span><span class="o">=</span><span class="s1">&#39;top&#39;</span><span class="p">,</span>
        <span class="n">size</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span>
        <span class="n">color</span><span class="o">=</span><span class="s1">&#39;black&#39;</span><span class="p">,</span>
    <span class="p">)</span>
    <span class="o">+</span><span class="n">theme</span><span class="p">(</span><span class="n">figure_size</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">6</span><span class="p">),</span>
           <span class="n">text</span><span class="o">=</span><span class="n">element_text</span><span class="p">(</span><span class="n">family</span><span class="o">=</span> <span class="n">font_prop</span><span class="o">.</span><span class="n">get_name</span><span class="p">(),</span> <span class="n">size</span><span class="o">=</span><span class="mi">8</span><span class="p">,</span> <span class="n">rotation</span><span class="o">=</span><span class="mi">0</span><span class="p">),</span>
           <span class="n">plot_title</span> <span class="o">=</span> <span class="n">element_text</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span> <span class="n">rotation</span><span class="o">=</span><span class="mi">0</span><span class="p">),</span>
           <span class="n">axis_text_x</span> <span class="o">=</span> <span class="n">element_text</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="mi">7</span><span class="p">,</span> <span class="n">rotation</span><span class="o">=</span><span class="mi">45</span><span class="p">,</span> <span class="n">weight</span><span class="o">=</span><span class="s1">&#39;bold&#39;</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s1">&#39;black&#39;</span><span class="p">)</span>
          <span class="p">)</span>
    <span class="c1"># 显式指定坐标轴顺序（可选但保险）</span>
    <span class="o">+</span> <span class="n">scale_x_discrete</span><span class="p">(</span><span class="n">limits</span><span class="o">=</span><span class="p">[</span><span class="nb">str</span><span class="p">(</span><span class="n">y</span><span class="p">)</span> <span class="k">for</span> <span class="n">y</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1985</span><span class="p">,</span> <span class="mi">2026</span><span class="p">)])</span>
<span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">CPU times: user 7min 44s,
Wall time: 8min 10s
</code></pre></div><p><img loading="lazy" src="img/09-date.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="五相关文献">五、相关文献</h2>
<p>使用专利数据做研究的文献</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[1]Bellstam, Gustaf, Sanjai Bhagat, and J. Anthony Cookson. &#34;A text-based analysis of corporate innovation.&#34; _Management Science_ 67, no. 7 (2021): 4004-4031.
[2]Arts, Sam, Bruno Cassiman, and Jianan Hou. &#34;Position and Differentiation of Firms in Technology Space.&#34; Management Science (2023).
</code></pre></div><p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>数据集 | 上市公司招聘数据(2014~2023)</title>
      <link>https://textdata.cn/blog/2025-03-06-china-recruitment-dataset-of-listed-companies/</link>
      <pubDate>Thu, 06 Mar 2025 00:00:00 +0000</pubDate>
      
      <guid>/blog/2025-03-06-china-recruitment-dataset-of-listed-companies/</guid>
      <description>&lt;h2 id=&#34;一上市公司招聘数据集&#34;&gt;一、上市公司招聘数据集&lt;/h2&gt;
&lt;h3 id=&#34;11-概况&#34;&gt;1.1 概况&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;数据集名:  上市公司招聘数据集(2014~2023)
数据来源:  招聘网站(如智联招聘、Boss直聘等)
记录数两:  6933415
覆盖日期:  2014-01-07 ~ 2023-12-31
数据格式:  csv
数据体积:  7.37 G
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;12-字段介绍&#34;&gt;1.2 字段介绍&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;- company    企业名称
- listed_rel 与上市公司关系
- stkcd      关联股票代码
- job        招聘岗位
- city       工作城市
- area       工作区域
- min_sal    最低月薪
- max_sal    最高月薪
- desc       职位描述
- edu        学历要求
- exp        经验要求
- hires      招聘人数
- category   招聘类别
- class      招聘分级
- loc        公司地点
- work_loc   工作地点
- post_date  发布招聘日期
- close_date 结束招聘日期
- source     招聘发布的平台
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;13-说明&#34;&gt;1.3 说明&lt;/h3&gt;
&lt;p&gt;科研用途；如有问题， 请加微信372335839，备注「姓名-学校-专业」&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二查看数据&#34;&gt;二、查看数据&lt;/h2&gt;
&lt;h3 id=&#34;21-读取数据&#34;&gt;2.1 读取数据&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;上市公司招聘大数据2014-2023年.csv.gz&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;compression&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;gzip&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#或  解压得到csv再读取&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#df = pd.read_csv(&amp;#39;上市公司招聘大数据2014-2023年.csv&amp;#39;)&lt;/span&gt;

&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;记录条数:&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;记录条数: 6933415
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/02-df.jpg&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;22-覆盖日期&#34;&gt;2.2 覆盖日期&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;post_date&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;to_datetime&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;post_date&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;

&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;覆盖日期: &amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;post_date&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;min&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;date&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(),&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;~&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;post_date&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;max&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;date&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;覆盖日期:  2014-01-07 ~ 2023-12-31
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h3 id=&#34;23-字段缺失程度&#34;&gt;2.3 字段缺失程度&lt;/h3&gt;
&lt;p&gt;使用 &lt;em&gt;&lt;strong&gt;missingno库&lt;/strong&gt;&lt;/em&gt; 可视化数据集的字段缺失程度，&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;missingno&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;ms&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;ms&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;matrix&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/03-missingno.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;可以看到 &lt;em&gt;&lt;strong&gt;classs&lt;/strong&gt;&lt;/em&gt;、 &lt;em&gt;&lt;strong&gt;loc&lt;/strong&gt;&lt;/em&gt;、 &lt;em&gt;&lt;strong&gt;work_loc&lt;/strong&gt;&lt;/em&gt;  这几个字段缺失较多， 而其余字段缺失程度很轻。&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;h3 id=&#34;24-数据源&#34;&gt;2.4 数据源&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;source&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;nunique&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;())&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;source&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;unique&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;98
array([&amp;#39;猎聘网&amp;#39;, &amp;#39;百姓网&amp;#39;, &amp;#39;boss直聘&amp;#39;, &amp;#39;找工易&amp;#39;, &amp;#39;拉勾网&amp;#39;, &amp;#39;首都人才网&amp;#39;, &amp;#39;智联招聘&amp;#39;, &amp;#39;58同城&amp;#39;,
       &amp;#39;0577HR&amp;#39;, &amp;#39;猎聘&amp;#39;, &amp;#39;赶集网&amp;#39;, &amp;#39;看准网&amp;#39;, &amp;#39;BOSS直聘&amp;#39;, &amp;#39;北极星招聘&amp;#39;, &amp;#39;OFweek人才网&amp;#39;,
       &amp;#39;职友集&amp;#39;, &amp;#39;前程无忧&amp;#39;, &amp;#39;照明专业人才网&amp;#39;, &amp;#39;智通人才网&amp;#39;, &amp;#39;一览英才网&amp;#39;, &amp;#39;全才招聘网&amp;#39;, &amp;#39;博才网&amp;#39;, &amp;#39;无忧招聘&amp;#39;,
       &amp;#39;斗米兼职&amp;#39;, &amp;#39;云南招聘网&amp;#39;, &amp;#39;建筑英才网&amp;#39;, &amp;#39;597人才网&amp;#39;, &amp;#39;齐鲁人才网&amp;#39;, &amp;#39;大街网&amp;#39;, &amp;#39;百城招聘网&amp;#39;,
       &amp;#39;仟寻移动招聘&amp;#39;, &amp;#39;脉脉&amp;#39;, &amp;#39;领航印刷人才网&amp;#39;, &amp;#39;中国人才热线&amp;#39;, &amp;#39;香草招聘&amp;#39;, &amp;#39;厦门人才网&amp;#39;, &amp;#39;中华英才网&amp;#39;,
       &amp;#39;招才网&amp;#39;, &amp;#39;中国金融人才网&amp;#39;, &amp;#39;中国船舶人才网&amp;#39;, &amp;#39;中国石油人才网&amp;#39;, &amp;#39;智联卓聘&amp;#39;, &amp;#39;珠江人才网&amp;#39;, &amp;#39;中国食品人才网&amp;#39;,
       &amp;#39;中国汽车人才网&amp;#39;, &amp;#39;九州英才网&amp;#39;, &amp;#39;大众人才网&amp;#39;, &amp;#39;荆楚人才网&amp;#39;, &amp;#39;湖南人事人才网&amp;#39;, &amp;#39;普工招聘网&amp;#39;, &amp;#39;力聘网&amp;#39;,
       &amp;#39;纺织行业人才网&amp;#39;, &amp;#39;汇博人才网&amp;#39;, &amp;#39;斗米&amp;#39;, &amp;#39;51招聘英才网&amp;#39;, &amp;#39;康强医疗人才网&amp;#39;, &amp;#39;中国药业人才网&amp;#39;, &amp;#39;应届生&amp;#39;,
       &amp;#39;华北人才网&amp;#39;, &amp;#39;大上海人才&amp;#39;, &amp;#39;钱江人才网&amp;#39;, &amp;#39;必高环保人才网&amp;#39;, &amp;#39;线缆招聘网&amp;#39;, &amp;#39;数字英才网&amp;#39;,
       &amp;#39;CFW中国服装人才网&amp;#39;, &amp;#39;华西人才网&amp;#39;, &amp;#39;潇湘人才网&amp;#39;, &amp;#39;医疗专业人才网&amp;#39;, &amp;#39;中国服装人才网&amp;#39;, &amp;#39;国际人才网&amp;#39;,
       &amp;#39;通信人才网&amp;#39;, &amp;#39;台州人力网&amp;#39;, &amp;#39;燕赵人才网&amp;#39;, &amp;#39;约才网&amp;#39;, &amp;#39;俊才招聘网&amp;#39;, &amp;#39;建筑专业人才网&amp;#39;, &amp;#39;台州招聘网&amp;#39;,
       &amp;#39;钢结构招聘网&amp;#39;, &amp;#39;闽江人才网&amp;#39;, &amp;#39;智聪人才网&amp;#39;, &amp;#39;广西人才网&amp;#39;, &amp;#39;桂冠人才网&amp;#39;, &amp;#39;南宁招聘网&amp;#39;, &amp;#39;今日招聘&amp;#39;,
       &amp;#39;汽车人招聘网&amp;#39;, &amp;#39;求职直通车网&amp;#39;, &amp;#39;食品人才网&amp;#39;, &amp;#39;最佳东方&amp;#39;, &amp;#39;扬子人才网&amp;#39;, &amp;#39;汽车人招聘&amp;#39;, &amp;#39;联英人才网&amp;#39;,
       &amp;#39;销售人才网&amp;#39;, &amp;#39;天南地北人才网&amp;#39;, &amp;#39;中州人才网&amp;#39;, &amp;#39;江淮人才网&amp;#39;, &amp;#39;中国美容招聘网&amp;#39;, &amp;#39;关中人才网&amp;#39;, &amp;#39;兼职猫&amp;#39;],
      dtype=object)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;相关文献&#34;&gt;相关文献&lt;/h2&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;[1]Gao, Janet, Kenneth J. Merkley, Joseph Pacelli, and Joseph H. Schroeder. &amp;#34;Do internal control weaknesses affect firms’ demand for accounting skills? Evidence from US job postings.&amp;#34; The Accounting Review 98, no. 3 (2023): 203-228.
[2]Campello, Murillo, Gaurav Kankanhalli, and Pradeep Muthukrishnan. &amp;#34;Corporate hiring under Covid-19: Financial constraints and the nature of new jobs.&amp;#34; Journal of Financial and Quantitative Analysis 59, no. 4 (2024): 1541-1585.
[3]Cao, Yi, Shijun Cheng, Jennifer Wu Tucker, and Chi Wan. &amp;#34;Technological peer pressure and skill specificity of job postings.&amp;#34; Contemporary Accounting Research 40, no. 3 (2023): 2106-2139.
[4]马双, 肖翰, 李丁, 张鹏. 最低工资与异质性人力资本需求：基于招聘网站数据的研究[J]. 世界经济, 2023, 46 (12): 92-114.
[5]莫怡青, 李力行. 零工经济对创业的影响——以外卖平台的兴起为例[J]. 管理世界, 2022, 38 (02): 31-45+3.
[6]罗楚亮, 刘盼. 公共就业服务机构匹配效率及其地区差异[J]. 管理世界, 2022, 38 (07): 133-147.
[7]刘毓芸, 程宇玮. 重点产业政策与人才需求——来自企业招聘面试的微观证据[J]. 管理世界, 2020, 36 (06): 65-79+245.
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;br&gt;
&lt;h2 id=&#34;精选内容&#34;&gt;精选内容&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/datasets_available_for_management_science/&#34;&gt;LIST | 可供社科(经管)领域使用的数据集汇总&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/the_text_analysis_list_about_ms/&#34;&gt;LIST | 社科(经管)数据挖掘文献资料汇总&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/&#34;&gt;推荐 | 文本分析库cntext2.x使用手册&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/management_python_course/&#34;&gt;付费视频课 | Python实证指标构建与文本分析&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2025-02-14-using-online-large-model-api-to-transform-text-data-into-structured-data/&#34;&gt;教程 | 使用大模型将文本数据转化为结构化数据&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2025-03-05-scrape-consumer-complaint-data-with-python/&#34;&gt;爬虫代码 | 使用Python采集黑猫投诉网数据&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2025-03-05-consumer-complaint-dataset/&#34;&gt;&lt;strong&gt;数据集| 1500w+消费者投诉数据集(2018 ~ 2024.8)&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2025-03-06-chinese-fresh-graduates-recruitment-dataset/&#34;&gt;&lt;strong&gt;数据集 | 应届生招聘数据集(2014~2024.12)&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
</description>
      <content:encoded><![CDATA[<h2 id="一上市公司招聘数据集">一、上市公司招聘数据集</h2>
<h3 id="11-概况">1.1 概况</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">数据集名:  上市公司招聘数据集(2014~2023)
数据来源:  招聘网站(如智联招聘、Boss直聘等)
记录数两:  6933415
覆盖日期:  2014-01-07 ~ 2023-12-31
数据格式:  csv
数据体积:  7.37 G
</code></pre></div><br>
<h3 id="12-字段介绍">1.2 字段介绍</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">- company    企业名称
- listed_rel 与上市公司关系
- stkcd      关联股票代码
- job        招聘岗位
- city       工作城市
- area       工作区域
- min_sal    最低月薪
- max_sal    最高月薪
- desc       职位描述
- edu        学历要求
- exp        经验要求
- hires      招聘人数
- category   招聘类别
- class      招聘分级
- loc        公司地点
- work_loc   工作地点
- post_date  发布招聘日期
- close_date 结束招聘日期
- source     招聘发布的平台
</code></pre></div><br>
<h3 id="13-说明">1.3 说明</h3>
<p>科研用途；如有问题， 请加微信372335839，备注「姓名-学校-专业」</p>
<p><br><br></p>
<h2 id="二查看数据">二、查看数据</h2>
<h3 id="21-读取数据">2.1 读取数据</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;上市公司招聘大数据2014-2023年.csv.gz&#39;</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">)</span>
<span class="c1">#或  解压得到csv再读取</span>
<span class="c1">#df = pd.read_csv(&#39;上市公司招聘大数据2014-2023年.csv&#39;)</span>

<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;记录条数:&#39;</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">))</span>
<span class="n">df</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">记录条数: 6933415
</code></pre></div><p><img loading="lazy" src="img/02-df.jpg" alt=""  />
</p>
<br>
<h3 id="22-覆盖日期">2.2 覆盖日期</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s1">&#39;post_date&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;post_date&#39;</span><span class="p">])</span>

<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;覆盖日期: &#39;</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;post_date&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">min</span><span class="p">()</span><span class="o">.</span><span class="n">date</span><span class="p">(),</span> <span class="s1">&#39;~&#39;</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;post_date&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">()</span><span class="o">.</span><span class="n">date</span><span class="p">())</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">覆盖日期:  2014-01-07 ~ 2023-12-31
</code></pre></div><p><br><br></p>
<h3 id="23-字段缺失程度">2.3 字段缺失程度</h3>
<p>使用 <em><strong>missingno库</strong></em> 可视化数据集的字段缺失程度，</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">missingno</span> <span class="k">as</span> <span class="nn">ms</span>

<span class="n">ms</span><span class="o">.</span><span class="n">matrix</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/03-missingno.png" alt=""  />
</p>
<p>可以看到 <em><strong>classs</strong></em>、 <em><strong>loc</strong></em>、 <em><strong>work_loc</strong></em>  这几个字段缺失较多， 而其余字段缺失程度很轻。</p>
<br>
<br>
<h3 id="24-数据源">2.4 数据源</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="nb">print</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">source</span><span class="o">.</span><span class="n">nunique</span><span class="p">())</span>
<span class="nb">print</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">source</span><span class="o">.</span><span class="n">unique</span><span class="p">())</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">98
array([&#39;猎聘网&#39;, &#39;百姓网&#39;, &#39;boss直聘&#39;, &#39;找工易&#39;, &#39;拉勾网&#39;, &#39;首都人才网&#39;, &#39;智联招聘&#39;, &#39;58同城&#39;,
       &#39;0577HR&#39;, &#39;猎聘&#39;, &#39;赶集网&#39;, &#39;看准网&#39;, &#39;BOSS直聘&#39;, &#39;北极星招聘&#39;, &#39;OFweek人才网&#39;,
       &#39;职友集&#39;, &#39;前程无忧&#39;, &#39;照明专业人才网&#39;, &#39;智通人才网&#39;, &#39;一览英才网&#39;, &#39;全才招聘网&#39;, &#39;博才网&#39;, &#39;无忧招聘&#39;,
       &#39;斗米兼职&#39;, &#39;云南招聘网&#39;, &#39;建筑英才网&#39;, &#39;597人才网&#39;, &#39;齐鲁人才网&#39;, &#39;大街网&#39;, &#39;百城招聘网&#39;,
       &#39;仟寻移动招聘&#39;, &#39;脉脉&#39;, &#39;领航印刷人才网&#39;, &#39;中国人才热线&#39;, &#39;香草招聘&#39;, &#39;厦门人才网&#39;, &#39;中华英才网&#39;,
       &#39;招才网&#39;, &#39;中国金融人才网&#39;, &#39;中国船舶人才网&#39;, &#39;中国石油人才网&#39;, &#39;智联卓聘&#39;, &#39;珠江人才网&#39;, &#39;中国食品人才网&#39;,
       &#39;中国汽车人才网&#39;, &#39;九州英才网&#39;, &#39;大众人才网&#39;, &#39;荆楚人才网&#39;, &#39;湖南人事人才网&#39;, &#39;普工招聘网&#39;, &#39;力聘网&#39;,
       &#39;纺织行业人才网&#39;, &#39;汇博人才网&#39;, &#39;斗米&#39;, &#39;51招聘英才网&#39;, &#39;康强医疗人才网&#39;, &#39;中国药业人才网&#39;, &#39;应届生&#39;,
       &#39;华北人才网&#39;, &#39;大上海人才&#39;, &#39;钱江人才网&#39;, &#39;必高环保人才网&#39;, &#39;线缆招聘网&#39;, &#39;数字英才网&#39;,
       &#39;CFW中国服装人才网&#39;, &#39;华西人才网&#39;, &#39;潇湘人才网&#39;, &#39;医疗专业人才网&#39;, &#39;中国服装人才网&#39;, &#39;国际人才网&#39;,
       &#39;通信人才网&#39;, &#39;台州人力网&#39;, &#39;燕赵人才网&#39;, &#39;约才网&#39;, &#39;俊才招聘网&#39;, &#39;建筑专业人才网&#39;, &#39;台州招聘网&#39;,
       &#39;钢结构招聘网&#39;, &#39;闽江人才网&#39;, &#39;智聪人才网&#39;, &#39;广西人才网&#39;, &#39;桂冠人才网&#39;, &#39;南宁招聘网&#39;, &#39;今日招聘&#39;,
       &#39;汽车人招聘网&#39;, &#39;求职直通车网&#39;, &#39;食品人才网&#39;, &#39;最佳东方&#39;, &#39;扬子人才网&#39;, &#39;汽车人招聘&#39;, &#39;联英人才网&#39;,
       &#39;销售人才网&#39;, &#39;天南地北人才网&#39;, &#39;中州人才网&#39;, &#39;江淮人才网&#39;, &#39;中国美容招聘网&#39;, &#39;关中人才网&#39;, &#39;兼职猫&#39;],
      dtype=object)
</code></pre></div><p><br><br></p>
<h2 id="相关文献">相关文献</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[1]Gao, Janet, Kenneth J. Merkley, Joseph Pacelli, and Joseph H. Schroeder. &#34;Do internal control weaknesses affect firms’ demand for accounting skills? Evidence from US job postings.&#34; The Accounting Review 98, no. 3 (2023): 203-228.
[2]Campello, Murillo, Gaurav Kankanhalli, and Pradeep Muthukrishnan. &#34;Corporate hiring under Covid-19: Financial constraints and the nature of new jobs.&#34; Journal of Financial and Quantitative Analysis 59, no. 4 (2024): 1541-1585.
[3]Cao, Yi, Shijun Cheng, Jennifer Wu Tucker, and Chi Wan. &#34;Technological peer pressure and skill specificity of job postings.&#34; Contemporary Accounting Research 40, no. 3 (2023): 2106-2139.
[4]马双, 肖翰, 李丁, 张鹏. 最低工资与异质性人力资本需求：基于招聘网站数据的研究[J]. 世界经济, 2023, 46 (12): 92-114.
[5]莫怡青, 李力行. 零工经济对创业的影响——以外卖平台的兴起为例[J]. 管理世界, 2022, 38 (02): 31-45+3.
[6]罗楚亮, 刘盼. 公共就业服务机构匹配效率及其地区差异[J]. 管理世界, 2022, 38 (07): 133-147.
[7]刘毓芸, 程宇玮. 重点产业政策与人才需求——来自企业招聘面试的微观证据[J]. 管理世界, 2020, 36 (06): 65-79+245.
</code></pre></div><br>
<br>
<h2 id="精选内容">精选内容</h2>
<ul>
<li><a href="https://textdata.cn/blog/datasets_available_for_management_science/">LIST | 可供社科(经管)领域使用的数据集汇总</a></li>
<li><a href="https://textdata.cn/blog/the_text_analysis_list_about_ms/">LIST | 社科(经管)数据挖掘文献资料汇总</a></li>
<li><a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/">推荐 | 文本分析库cntext2.x使用手册</a></li>
<li><a href="https://textdata.cn/blog/management_python_course/">付费视频课 | Python实证指标构建与文本分析</a></li>
<li><a href="https://textdata.cn/blog/2025-02-14-using-online-large-model-api-to-transform-text-data-into-structured-data/">教程 | 使用大模型将文本数据转化为结构化数据</a></li>
<li><a href="https://textdata.cn/blog/2025-03-05-scrape-consumer-complaint-data-with-python/">爬虫代码 | 使用Python采集黑猫投诉网数据</a></li>
<li><a href="https://textdata.cn/blog/2025-03-05-consumer-complaint-dataset/"><strong>数据集| 1500w+消费者投诉数据集(2018 ~ 2024.8)</strong></a></li>
<li><a href="https://textdata.cn/blog/2025-03-06-chinese-fresh-graduates-recruitment-dataset/"><strong>数据集 | 应届生招聘数据集(2014~2024.12)</strong></a></li>
</ul>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>数据集 | 应届生招聘数据集(2014~2024.12)</title>
      <link>https://textdata.cn/blog/2025-03-06-chinese-fresh-graduates-recruitment-dataset/</link>
      <pubDate>Thu, 06 Mar 2025 00:00:00 +0000</pubDate>
      
      <guid>/blog/2025-03-06-chinese-fresh-graduates-recruitment-dataset/</guid>
      <description>&lt;h2 id=&#34;一应届生招聘数据集&#34;&gt;一、应届生招聘数据集&lt;/h2&gt;
&lt;h3 id=&#34;11-概况&#34;&gt;1.1 概况&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;数据集名:  应届生招聘数据集
数据来源:  招聘网站(如智联招聘、Boss直聘等)
记录数两:  6961787
覆盖日期:  2014-01-17 ~ 2024-12-16
数据格式:  csv
数据体积:  8 G

本文声明: 科研用途; 如有问题， 请加微信372335839，备注「姓名-学校-专业」
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;12-字段介绍&#34;&gt;1.2 字段介绍&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;- company    企业名称
- job        招聘岗位
- city       工作城市
- area       工作区域
- min_sal    最低月薪
- max_sal    最高月薪
- desc       职位描述
- edu        学历要求
- exp        经验要求
- hires      招聘人数
- category   招聘类别
- class      招聘分级
- loc        公司地点
- work_loc   工作地点
- post_date  发布招聘日期
- close_date 结束招聘日期
- source     招聘发布的平台
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;13-本文声明&#34;&gt;1.3 本文声明&lt;/h3&gt;
&lt;p&gt;科研用途; 如有问题， 请加微信372335839，备注「姓名-学校-专业」&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二查看数据&#34;&gt;二、查看数据&lt;/h2&gt;
&lt;h3 id=&#34;21-读取数据&#34;&gt;2.1 读取数据&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;应届生招聘数据2014-2024年.csv.gz&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;compression&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;gzip&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#或  解压得到csv再读取&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#df = pd.read_csv(&amp;#39;应届生招聘数据2014-2024年.csv&amp;#39;)&lt;/span&gt;

&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;记录条数:&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;记录条数: 6961787
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/02-df.jpg&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;22-覆盖日期&#34;&gt;2.2 覆盖日期&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;post_date&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;to_datetime&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;post_date&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;

&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;覆盖日期: &amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;post_date&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;min&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;date&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(),&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;~&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;post_date&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;max&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;date&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;覆盖日期:  2014-01-17 ~ 2024-12-16
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h3 id=&#34;23-字段缺失程度&#34;&gt;2.3 字段缺失程度&lt;/h3&gt;
&lt;p&gt;使用 &lt;em&gt;&lt;strong&gt;missingno库&lt;/strong&gt;&lt;/em&gt; 可视化数据集的字段缺失程度，&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;missingno&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;ms&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;ms&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;matrix&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/03-missingno.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;可以看到 &lt;em&gt;&lt;strong&gt;classs&lt;/strong&gt;&lt;/em&gt;、 &lt;em&gt;&lt;strong&gt;loc&lt;/strong&gt;&lt;/em&gt;、 &lt;em&gt;&lt;strong&gt;work_loc&lt;/strong&gt;&lt;/em&gt;  这几个字段缺失较多， 而其余字段缺失程度很轻。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;相关文献&#34;&gt;相关文献&lt;/h2&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;[1]Gao, Janet, Kenneth J. Merkley, Joseph Pacelli, and Joseph H. Schroeder. &amp;#34;Do internal control weaknesses affect firms’ demand for accounting skills? Evidence from US job postings.&amp;#34; The Accounting Review 98, no. 3 (2023): 203-228.
[2]Campello, Murillo, Gaurav Kankanhalli, and Pradeep Muthukrishnan. &amp;#34;Corporate hiring under Covid-19: Financial constraints and the nature of new jobs.&amp;#34; Journal of Financial and Quantitative Analysis 59, no. 4 (2024): 1541-1585.
[3]Cao, Yi, Shijun Cheng, Jennifer Wu Tucker, and Chi Wan. &amp;#34;Technological peer pressure and skill specificity of job postings.&amp;#34; Contemporary Accounting Research 40, no. 3 (2023): 2106-2139.
[4]马双, 肖翰, 李丁, 张鹏. 最低工资与异质性人力资本需求：基于招聘网站数据的研究[J]. 世界经济, 2023, 46 (12): 92-114.
[5]莫怡青, 李力行. 零工经济对创业的影响——以外卖平台的兴起为例[J]. 管理世界, 2022, 38 (02): 31-45+3.
[6]罗楚亮, 刘盼. 公共就业服务机构匹配效率及其地区差异[J]. 管理世界, 2022, 38 (07): 133-147.
[7]刘毓芸, 程宇玮. 重点产业政策与人才需求——来自企业招聘面试的微观证据[J]. 管理世界, 2020, 36 (06): 65-79+245.
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;br&gt;
&lt;h2 id=&#34;精选内容&#34;&gt;精选内容&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/datasets_available_for_management_science/&#34;&gt;LIST | 可供社科(经管)领域使用的数据集汇总&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/the_text_analysis_list_about_ms/&#34;&gt;LIST | 社科(经管)数据挖掘文献资料汇总&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/&#34;&gt;推荐 | 文本分析库cntext2.x使用手册&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/management_python_course/&#34;&gt;付费视频课 | Python实证指标构建与文本分析&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2025-02-14-using-online-large-model-api-to-transform-text-data-into-structured-data/&#34;&gt;教程 | 使用大模型将文本数据转化为结构化数据&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2025-03-05-scrape-consumer-complaint-data-with-python/&#34;&gt;爬虫代码 | 使用Python采集黑猫投诉网数据&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2025-03-05-consumer-complaint-dataset/&#34;&gt;&lt;strong&gt;数据集| 1500w+消费者投诉数据集(2018 ~ 2024.8)&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2025-03-06-china-recruitment-dataset-of-listed-companies/&#34;&gt;&lt;strong&gt;数据集 | 上市公司招聘数据(2014~2023)&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;br&gt;
&lt;br&gt;
</description>
      <content:encoded><![CDATA[<h2 id="一应届生招聘数据集">一、应届生招聘数据集</h2>
<h3 id="11-概况">1.1 概况</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">数据集名:  应届生招聘数据集
数据来源:  招聘网站(如智联招聘、Boss直聘等)
记录数两:  6961787
覆盖日期:  2014-01-17 ~ 2024-12-16
数据格式:  csv
数据体积:  8 G

本文声明: 科研用途; 如有问题， 请加微信372335839，备注「姓名-学校-专业」
</code></pre></div><br>
<h3 id="12-字段介绍">1.2 字段介绍</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">- company    企业名称
- job        招聘岗位
- city       工作城市
- area       工作区域
- min_sal    最低月薪
- max_sal    最高月薪
- desc       职位描述
- edu        学历要求
- exp        经验要求
- hires      招聘人数
- category   招聘类别
- class      招聘分级
- loc        公司地点
- work_loc   工作地点
- post_date  发布招聘日期
- close_date 结束招聘日期
- source     招聘发布的平台
</code></pre></div><br>
<h3 id="13-本文声明">1.3 本文声明</h3>
<p>科研用途; 如有问题， 请加微信372335839，备注「姓名-学校-专业」</p>
<p><br><br></p>
<h2 id="二查看数据">二、查看数据</h2>
<h3 id="21-读取数据">2.1 读取数据</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;应届生招聘数据2014-2024年.csv.gz&#39;</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">)</span>
<span class="c1">#或  解压得到csv再读取</span>
<span class="c1">#df = pd.read_csv(&#39;应届生招聘数据2014-2024年.csv&#39;)</span>

<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;记录条数:&#39;</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">))</span>
<span class="n">df</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">记录条数: 6961787
</code></pre></div><p><img loading="lazy" src="img/02-df.jpg" alt=""  />
</p>
<br>
<h3 id="22-覆盖日期">2.2 覆盖日期</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s1">&#39;post_date&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;post_date&#39;</span><span class="p">])</span>

<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;覆盖日期: &#39;</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;post_date&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">min</span><span class="p">()</span><span class="o">.</span><span class="n">date</span><span class="p">(),</span> <span class="s1">&#39;~&#39;</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;post_date&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">()</span><span class="o">.</span><span class="n">date</span><span class="p">())</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">覆盖日期:  2014-01-17 ~ 2024-12-16
</code></pre></div><p><br><br></p>
<h3 id="23-字段缺失程度">2.3 字段缺失程度</h3>
<p>使用 <em><strong>missingno库</strong></em> 可视化数据集的字段缺失程度，</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">missingno</span> <span class="k">as</span> <span class="nn">ms</span>

<span class="n">ms</span><span class="o">.</span><span class="n">matrix</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/03-missingno.png" alt=""  />
</p>
<p>可以看到 <em><strong>classs</strong></em>、 <em><strong>loc</strong></em>、 <em><strong>work_loc</strong></em>  这几个字段缺失较多， 而其余字段缺失程度很轻。</p>
<p><br><br></p>
<h2 id="相关文献">相关文献</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[1]Gao, Janet, Kenneth J. Merkley, Joseph Pacelli, and Joseph H. Schroeder. &#34;Do internal control weaknesses affect firms’ demand for accounting skills? Evidence from US job postings.&#34; The Accounting Review 98, no. 3 (2023): 203-228.
[2]Campello, Murillo, Gaurav Kankanhalli, and Pradeep Muthukrishnan. &#34;Corporate hiring under Covid-19: Financial constraints and the nature of new jobs.&#34; Journal of Financial and Quantitative Analysis 59, no. 4 (2024): 1541-1585.
[3]Cao, Yi, Shijun Cheng, Jennifer Wu Tucker, and Chi Wan. &#34;Technological peer pressure and skill specificity of job postings.&#34; Contemporary Accounting Research 40, no. 3 (2023): 2106-2139.
[4]马双, 肖翰, 李丁, 张鹏. 最低工资与异质性人力资本需求：基于招聘网站数据的研究[J]. 世界经济, 2023, 46 (12): 92-114.
[5]莫怡青, 李力行. 零工经济对创业的影响——以外卖平台的兴起为例[J]. 管理世界, 2022, 38 (02): 31-45+3.
[6]罗楚亮, 刘盼. 公共就业服务机构匹配效率及其地区差异[J]. 管理世界, 2022, 38 (07): 133-147.
[7]刘毓芸, 程宇玮. 重点产业政策与人才需求——来自企业招聘面试的微观证据[J]. 管理世界, 2020, 36 (06): 65-79+245.
</code></pre></div><br>
<br>
<h2 id="精选内容">精选内容</h2>
<ul>
<li><a href="https://textdata.cn/blog/datasets_available_for_management_science/">LIST | 可供社科(经管)领域使用的数据集汇总</a></li>
<li><a href="https://textdata.cn/blog/the_text_analysis_list_about_ms/">LIST | 社科(经管)数据挖掘文献资料汇总</a></li>
<li><a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/">推荐 | 文本分析库cntext2.x使用手册</a></li>
<li><a href="https://textdata.cn/blog/management_python_course/">付费视频课 | Python实证指标构建与文本分析</a></li>
<li><a href="https://textdata.cn/blog/2025-02-14-using-online-large-model-api-to-transform-text-data-into-structured-data/">教程 | 使用大模型将文本数据转化为结构化数据</a></li>
<li><a href="https://textdata.cn/blog/2025-03-05-scrape-consumer-complaint-data-with-python/">爬虫代码 | 使用Python采集黑猫投诉网数据</a></li>
<li><a href="https://textdata.cn/blog/2025-03-05-consumer-complaint-dataset/"><strong>数据集| 1500w+消费者投诉数据集(2018 ~ 2024.8)</strong></a></li>
<li><a href="https://textdata.cn/blog/2025-03-06-china-recruitment-dataset-of-listed-companies/"><strong>数据集 | 上市公司招聘数据(2014~2023)</strong></a></li>
</ul>
<br>
<br>
]]></content:encoded>
    </item>
    
    <item>
      <title>数据集 | 消费者金融投诉数据集(2011 ~ 2025.3)</title>
      <link>https://textdata.cn/blog/2025-03-06-consumer-finance-complaints-dataset/</link>
      <pubDate>Thu, 06 Mar 2025 00:00:00 +0000</pubDate>
      
      <guid>/blog/2025-03-06-consumer-finance-complaints-dataset/</guid>
      <description>消费者投诉数据集作为一种典型的**另类数据**(如非结构文本数据)，具有多方面重要科研价值，为多学科研究和企业实践提供了新视角与有力支持：1. **丰富另类数据研究**：该数据集为另类数据研究注入新活力。其数据**体量庞大、时效性好、真实性强且颗粒度细**，克服了传统研究依赖小样本数据的局限。通过对消费者投诉数据信息含量和投资价值的挖掘，能从数据类型和应用场景等多维度丰富相关研究文献，推动另类数据在学术领域的深入发展。2. **补充基本面预测研究**：在金融领域，寻找预测基本面的有效指标意义重大。消费者投诉数据集为该研究提供了新方向。以往研究发现消费者投诉对基本面预测有影响，本数据集利用中国数据和更广泛的消费类公司数据进行拓展，并探讨异质性影响，进一步补充了基本面预测影响因素的研究文献。3. **拓展企业口碑研究**：消费者投诉在很大程度上影响企业口碑。以往企业口碑研究多采用小样本实验或问卷调研，缺乏真实世界大数据支持。基于 “黑猫投诉” 平台的千万级别真实数据构建的数据集，能更准确地分析消费者投诉行为，为企业口碑相关研究提供丰富且可靠的数据支撑，拓展该领域研究深度与广度。4. **助力多主体决策研究**：对监管机构而言，可通过分析投诉数据，实现官方与非官方投诉渠道联动，确定监管重点领域，提升监管效能；对金融监管部门，鉴于投诉数据对公司基本面前瞻性预测能力，纳入监测体系有助于防范金融风险，维护金融市场稳定；对上市公司，利用投诉数据能发现经营问题，改进产品和服务，提高消费者满意度与管理水平；对专业投资者，投诉数据可作为投资决策参考，辅助构建投资组合，获取更高收益。这些应用场景为研究不同主体如何利用投诉数据进行科学决策提供了实践依据</description>
      <content:encoded><![CDATA[<h2 id="一消费者金融投诉数据集">一、消费者金融投诉数据集</h2>
<h3 id="11-概况">1.1 概况</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">数据集:  消费者金融投诉数据集
数据来源:  https://cfpb.github.io/api/ccdb/
记录数两:  7978798
覆盖日期:  2011-12-01 ~ 2025-03-03
数据格式:  csv
数据体积:  5 G
所含字段:  标题、投诉时间、投诉问题、投诉对象、消费者地址、公司回应等
</code></pre></div><p><img loading="lazy" src="img/01-consumer-complaint-database.png" alt=""  />
</p>
<h3 id="12-字段介绍">1.2 字段介绍</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">- Date_Received       收到投诉的日期
- Product             投诉的金融产品或服务类型(信用报告、债务催收、抵押贷款等)
- Sub_Product         投诉的子产品(更具体的类别)
- Issue               投诉问题或原因
- Sub_Iissue          投诉子问题(进一步详细说明问题)
- Complaint_Narrative 投诉内容(自由格式文本)
- Comp_Public_Resp    公司针对消费者投诉提供的公开回应
- Company             投诉公司名称
- State               消费者居住地
- Zip                 消费者的所在地邮政编码
- Tags                与投诉相关的额外标签或分类
- Consent_Provided    消费者是否同意其投诉信息被收集、处理或公开。
- Submitted_Via       投诉渠道（例如，网络、转介）
- Date_Sent_to_Comp   投诉转交给公司的日期
- Comp_Resp_to_Cons   公司对消费者的投诉的回应。
- Timely_Resp         及时响应，表明该公司是否及时回应了投诉。
- Disputed            消费者异议；表明消费者是否对公司的回应提出了异议
- Complaint_ID        投诉ID
</code></pre></div><br>
<h3 id="13-获取数据">1.3 获取数据</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">-  https://files.consumerfinance.gov/ccdb/complaints.csv.zip
-  备用链接: https://pan.baidu.com/s/1uvhi-waLwAM8yOPzktBnzQ?pwd=kwng 提取码: kwng
</code></pre></div><p><br><br></p>
<h2 id="二查看数据">二、查看数据</h2>
<h3 id="21-读取数据">2.1 读取数据</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;consumer_finance_complaints.csv.gz&#39;</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">)</span>
<span class="c1">#或  解压得到csv再读取</span>
<span class="c1">#df = pd.read_csv(&#39;consumer_finance_complaints.csv&#39;)</span>

<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;记录条数:&#39;</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">))</span>
<span class="n">df</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">记录条数: 7978798
</code></pre></div><p><img loading="lazy" src="img/02-df.jpg" alt=""  />
</p>
<br>
<h3 id="22-覆盖日期">2.2 覆盖日期</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s1">&#39;Date_Received&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;Date_Received&#39;</span><span class="p">])</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;覆盖日期:&#39;</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;Date_Received&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">min</span><span class="p">()</span><span class="o">.</span><span class="n">date</span><span class="p">(),</span> <span class="s1">&#39;~&#39;</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;Date_Received&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">()</span><span class="o">.</span><span class="n">date</span><span class="p">())</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">覆盖日期: 2011-12-01 ~ 2025-03-03
</code></pre></div><p><br><br></p>
<h3 id="23-字段缺失程度">2.3 字段缺失程度</h3>
<p>使用 <em><strong>missingno库</strong></em> 可视化数据集的字段缺失程度，</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">missingno</span> <span class="k">as</span> <span class="nn">ms</span>
<span class="n">ms</span><span class="o">.</span><span class="n">matrix</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/03-missingno.png" alt=""  />
</p>
<p>可以看到 <em><strong>Complaint_Narrative</strong></em>、 <em><strong>Comp_Public_Resp</strong></em>、 <em><strong>Tags</strong></em>、 <em><strong>Consent_provided</strong></em>、 <em><strong>Disputed</strong></em>  这几个字段缺失较多， 而其余字段缺失程度很轻甚至没有。</p>
<p><br><br></p>
<h2 id="三获取数据">三、获取数据</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">-  https://files.consumerfinance.gov/ccdb/complaints.csv.zip
-  备用链接: https://pan.baidu.com/s/1uvhi-waLwAM8yOPzktBnzQ?pwd=kwng 提取码: kwng
</code></pre></div><br>
<br>
<h2 id="精选内容">精选内容</h2>
<ul>
<li><a href="https://textdata.cn/blog/datasets_available_for_management_science/">LIST | 可供社科(经管)领域使用的数据集汇总</a></li>
<li><a href="https://textdata.cn/blog/the_text_analysis_list_about_ms/">LIST | 社科(经管)数据挖掘文献资料汇总</a></li>
<li><a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/">推荐 | 文本分析库cntext2.x使用手册</a></li>
<li><a href="https://textdata.cn/blog/management_python_course/">付费视频课 | Python实证指标构建与文本分析</a></li>
<li><a href="https://textdata.cn/blog/2025-02-14-using-online-large-model-api-to-transform-text-data-into-structured-data/">教程 | 使用大模型将文本数据转化为结构化数据</a></li>
<li><a href="https://textdata.cn/blog/2025-03-05-scrape-consumer-complaint-data-with-python/">爬虫代码 | 使用Python采集黑猫投诉网数据</a></li>
<li><a href="https://textdata.cn/blog/2025-03-05-consumer-complaint-dataset/"><strong>数据集| 1500w+消费者投诉数据集(2018 ~ 2024.8)</strong></a></li>
</ul>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>数据集 | 纽约时报新闻数据集(2000~2025.3.1)</title>
      <link>https://textdata.cn/blog/2025-03-05-nytimes-news-dataset-from-2000-to-2025/</link>
      <pubDate>Thu, 06 Mar 2025 00:00:00 +0000</pubDate>
      
      <guid>/blog/2025-03-05-nytimes-news-dataset-from-2000-to-2025/</guid>
      <description>媒体数据集研究价值大， 您可从中提取丰富的指标，包括但不限于经济政策不确定性指数EPU 、 媒体关注度指数、文本相似度、情感分析。而且可训练词向量，构建新的词典，开发新的指标指数。计算机自然语言处理、经济学、管理学、新闻传播学、公共管理等领域均可使用。</description>
      <content:encoded><![CDATA[<p>今日分享一个数据集 <a href="https://www.nytimes.com/"><em><strong>纽约时报 nytimes.com</strong></em></a>，该网站在墙内可正常访问。</p>
<p><img loading="lazy" src="img/01-nytimes-cover2025.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="一纽约时报新闻数据集">一、纽约时报新闻数据集</h2>
<h3 id="11-概况">1.1 概况</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">数据集:  纽约时报新闻数据集(2000~2025.3.1)
数据来源:  https://www.nytimes.com/
采集方式:  API（https://developer.nytimes.com/apis)
使用语言:  英文
记录数两:  293326
覆盖日期: 2000-01-01~2025.3.1
数据格式: csv
数据体积: 1.75 G
所含字段: title, pub_date, section, subsection, author, abstract, 
         lead_paragraph, keywords, img_url, web_url

本文声明: 如有问题， 请加微信372335839，备注「姓名-学校-专业」
</code></pre></div><br>
<h3 id="12-数据用途">1.2 数据用途</h3>
<p>可提取丰富的指标，包括但不限于 **经济政策不确定性指数 **、<strong>环境政策不确定性</strong>、 <strong>媒体关注度指数</strong>、<strong>文本相似度</strong>、<strong>情感分析</strong>。此外， 可训练词向量，开发新的概念词典。数据带时间， 参照前面指标， 依主体、日期、指标进行计算， 可构造面板数据，构建新的指标指数。因此在经济学、管理学、新闻传播学、公共管理、社会学等领域均有较高的研究价值。</p>
<p>相关参考文献</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[1]洪永淼,刘俸奇,薛涧坡.政府与市场心理因素的经济影响及其测度[J].管理世界,2023,39(03):30-51.
[2]刘景江,郑畅然,洪永淼.机器学习如何赋能管理学研究？——国内外前沿综述和未来展望[J].管理世界,2023,39(09):191-216.
[3]张一帆,林建浩,樊嘉诚.新闻文本大数据与消费增速实时预测——基于叙事经济学的视角[J].金融研究,2023,(05):152-169.
[4]Huang, Yun, and Paul Luk. &#34;Measuring economic policy uncertainty in China.&#34; China Economic Review 59 (2020): 101367
[5]欧阳资生,陈世丽,杨希特,刘凤根,周学伟.经济政策不确定性、网络舆情与金融机构系统性风险[J].管理科学学报,2023,26(04):62-86.
[6]逯东,宋昕倍.媒体报道、上市公司年报可读性与融资约束[J].管理科学学报,2021,24(12):45-61.
[7]彭涛,黄福广,孙凌霞.经济政策不确定性与风险承担:基于风险投资的证据[J].管理科学学报,2021,24(03):98-114.
[8]庞锐.采纳与内化：多重制度压力如何影响河长制创新扩散——基于省级政府的定向配对事件史分析[J].公共管理学报,2023,20(02):25-37+165-166.
</code></pre></div><p><br><br></p>
<h2 id="二查看数据">二、查看数据</h2>
<h3 id="21-读取数据">2.1 读取数据</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;纽约时报新闻数据集.csv.gz&#39;</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">)</span>
<span class="c1">#或  解压得到csv再读取</span>
<span class="c1">#df = pd.read_csv(&#39;纽约时报新闻数据集.csv&#39;)</span>

<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;记录条数:&#39;</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">))</span>
<span class="n">df</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">记录条数: 2191515
</code></pre></div><p><img loading="lazy" src="img/02-df.png" alt=""  />
</p>
<br>
<h3 id="22-所含字段">2.2 所含字段</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="k">for</span> <span class="n">col</span> <span class="ow">in</span> <span class="n">df</span><span class="o">.</span><span class="n">columns</span><span class="p">:</span>
    <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39; - </span><span class="si">{</span><span class="n">col</span><span class="si">}</span><span class="s1">&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"> - title               标题
 - pub_date            文章发布日期
 - section             栏目(如运动、观点、纽约、世界、美国等)
 - subsection          二级栏目(如运动、观点、纽约、世界、美国等)
 - author              作者 
 - abstract            摘要
 - lead_paragraph      文章导语
 - keywords            关键词
 - img_url             图片链接
 - web_url             文章原文链接
</code></pre></div><br>
<h3 id="23-覆盖日期">2.3 覆盖日期</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s1">&#39;pub_date&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;pub_date&#39;</span><span class="p">])</span>

<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;覆盖日期:&#39;</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;pub_date&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">min</span><span class="p">(),</span> <span class="s1">&#39;~&#39;</span> <span class="p">,</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;pub_date&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">())</span>

</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">覆盖日期: 2000-01-01 05:00:00+00:00 ~ 2025-03-01 00:39:55+00:00
</code></pre></div><p><br><br></p>
<h3 id="三可视化">三、可视化</h3>
<h3 id="31-字段缺失情况">3.1 字段缺失情况</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">missingno</span> <span class="k">as</span> <span class="nn">ms</span>

<span class="n">ms</span><span class="o">.</span><span class="n">matrix</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/04-nan.png" alt=""  />
</p>
<p>从上图可以看出 <em><strong>subsection</strong></em>、  <em><strong>img_url</strong></em> 这两个字段存在较为严重的缺失， <em><strong>author</strong></em> 、 <em><strong>keywords</strong></em>、 <em><strong>abstract</strong></em>、<em><strong>lead_paragraph</strong></em> 存在轻微的缺失情况。</p>
<br>
<h3 id="31-按年度统计发文量">3.1 按年度，统计发文量</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python">
<span class="kn">from</span> <span class="nn">plotnine</span> <span class="kn">import</span> <span class="o">*</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">from</span> <span class="nn">matplotlib.font_manager</span> <span class="kn">import</span> <span class="n">FontProperties</span>

<span class="c1">#文泉驿微米黑.ttf位于代码同文件夹</span>
<span class="n">font_prop</span> <span class="o">=</span> <span class="n">FontProperties</span><span class="p">(</span><span class="n">fname</span><span class="o">=</span><span class="s1">&#39;文泉驿微米黑.ttf&#39;</span><span class="p">)</span> 


<span class="n">volumes</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;pub_date&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;pub_date&#39;</span><span class="p">])</span>
<span class="n">df2</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">set_index</span><span class="p">(</span><span class="s1">&#39;pub_date&#39;</span><span class="p">)</span>
<span class="k">for</span> <span class="n">date</span><span class="p">,</span> <span class="n">y_df</span> <span class="ow">in</span> <span class="n">df2</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">pd</span><span class="o">.</span><span class="n">Grouper</span><span class="p">(</span><span class="n">freq</span><span class="o">=</span><span class="s1">&#39;YE&#39;</span><span class="p">)):</span>
    <span class="n">volumes</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">y_df</span><span class="p">))</span>

<span class="n">data</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="s1">&#39;year&#39;</span><span class="p">:</span> <span class="nb">range</span><span class="p">(</span><span class="mi">2000</span><span class="p">,</span> <span class="mi">2026</span><span class="p">),</span> 
                     <span class="s1">&#39;volume&#39;</span><span class="p">:</span> <span class="n">volumes</span><span class="p">})</span>


<span class="p">(</span>
    <span class="n">ggplot</span><span class="p">(</span><span class="n">data</span><span class="p">,</span>  <span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="s1">&#39;year&#39;</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="s1">&#39;volume&#39;</span><span class="p">))</span>
    <span class="o">+</span><span class="n">geom_bar</span><span class="p">(</span><span class="n">stat</span><span class="o">=</span><span class="s1">&#39;identity&#39;</span><span class="p">)</span>
    <span class="o">+</span><span class="n">labs</span><span class="p">(</span><span class="n">title</span><span class="o">=</span><span class="s1">&#39;纽约时报nytimes新闻年度发文量(2000-2025.3.1)&#39;</span><span class="p">,</span>
          <span class="n">x</span> <span class="o">=</span> <span class="s1">&#39;年度&#39;</span><span class="p">,</span> 
          <span class="n">y</span> <span class="o">=</span> <span class="s1">&#39;发文量(条)&#39;</span><span class="p">)</span>
    <span class="o">+</span><span class="n">geom_text</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">label</span><span class="o">=</span><span class="s1">&#39;volume&#39;</span><span class="p">),</span>  <span class="c1"># 添加数据标签</span>
               <span class="n">va</span><span class="o">=</span><span class="s1">&#39;bottom&#39;</span><span class="p">,</span>           <span class="c1"># 垂直对齐方式为底部（即在柱子顶部）</span>
               <span class="n">size</span><span class="o">=</span><span class="mi">8</span><span class="p">,</span>                <span class="c1"># 设置字体大小</span>
               <span class="n">format_string</span><span class="o">=</span><span class="s1">&#39;</span><span class="si">{}</span><span class="s1">&#39;</span><span class="p">)</span>     <span class="c1"># 格式化字符串</span>
    <span class="o">+</span><span class="n">theme</span><span class="p">(</span><span class="n">figure_size</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">6</span><span class="p">),</span>
           <span class="n">text</span> <span class="o">=</span> <span class="n">element_text</span><span class="p">(</span><span class="n">family</span> <span class="o">=</span> <span class="n">font_prop</span><span class="o">.</span><span class="n">get_name</span><span class="p">(),</span> <span class="n">size</span><span class="o">=</span><span class="mi">14</span><span class="p">),</span> 
           <span class="n">plot_title</span> <span class="o">=</span> <span class="n">element_text</span><span class="p">(</span><span class="n">family</span> <span class="o">=</span> <span class="n">font_prop</span><span class="o">.</span><span class="n">get_name</span><span class="p">(),</span> <span class="n">size</span><span class="o">=</span><span class="mi">18</span><span class="p">)</span>
          <span class="p">)</span>
    <span class="o">+</span><span class="n">scale_x_continuous</span><span class="p">(</span><span class="n">breaks</span><span class="o">=</span><span class="nb">range</span><span class="p">(</span><span class="mi">2000</span><span class="p">,</span> <span class="mi">2026</span><span class="p">,</span> <span class="mi">3</span><span class="p">))</span> 

<span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/03-vis.png" alt=""  />
</p>
<br>
<br>
<h2 id="相关内容">相关内容</h2>
<ul>
<li>
<p><a href="**https://textdata.cn/blog/2024-07-12-china-daily-dataset/**">数据集(中英) | ChinaDaily新闻数据集(2008 ~ 2024)</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2024-07-12-entrepreneur-dataset/">数据集 | 企业家Entrepreneur杂志数据集(1996 ~ 2024)</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-11-17-how-handle-mega-csv-that-far-exceed-memory/">代码 | 如何处理远超电脑内存的csv文件</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-12-18-how-to-generate-panel-data-from-daily-news-dataset/"><strong>代码 | 使用「新闻数据」构造概念词提及量「面板数据」</strong></a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-12-28-visualize-the-culture-change-using-people-daily-dataset/"><strong>可视化 | 人民日报语料反映七十年文化演变</strong></a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-12-20-measure-china-economic-policy-uncertainty/">代码 | 使用「新闻数据」测量 「经济政策不确定性EPU」指标</a></p>
</li>
</ul>
<br>
<br>
<h2 id="精选内容">精选内容</h2>
<ul>
<li>
<p><a href="https://textdata.cn/blog/datasets_available_for_management_science/">LIST | 可供社科(经管)领域使用的数据集汇总</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/the_text_analysis_list_about_ms/">LIST | 社科(经管)数据挖掘文献资料汇总</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/">推荐 | 文本分析库cntext使用手册</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/management_python_course/">付费视频课 | Python实证指标构建与文本分析</a></p>
</li>
</ul>
<br>
<br>
]]></content:encoded>
    </item>
    
    <item>
      <title>数据集 | 1500w&#43;消费者投诉数据集(2018 ~ 2024.8)</title>
      <link>https://textdata.cn/blog/2025-03-05-consumer-complaint-dataset/</link>
      <pubDate>Wed, 05 Mar 2025 00:00:00 +0000</pubDate>
      
      <guid>/blog/2025-03-05-consumer-complaint-dataset/</guid>
      <description>消费者投诉数据集作为一种典型的**另类数据**(如非结构文本数据)，具有多方面重要科研价值，为多学科研究和企业实践提供了新视角与有力支持：1. **丰富另类数据研究**：该数据集为另类数据研究注入新活力。其数据**体量庞大、时效性好、真实性强且颗粒度细**，克服了传统研究依赖小样本数据的局限。通过对消费者投诉数据信息含量和投资价值的挖掘，能从数据类型和应用场景等多维度丰富相关研究文献，推动另类数据在学术领域的深入发展。2. **补充基本面预测研究**：在金融领域，寻找预测基本面的有效指标意义重大。消费者投诉数据集为该研究提供了新方向。以往研究发现消费者投诉对基本面预测有影响，本数据集利用中国数据和更广泛的消费类公司数据进行拓展，并探讨异质性影响，进一步补充了基本面预测影响因素的研究文献。3. **拓展企业口碑研究**：消费者投诉在很大程度上影响企业口碑。以往企业口碑研究多采用小样本实验或问卷调研，缺乏真实世界大数据支持。基于 “黑猫投诉” 平台的千万级别真实数据构建的数据集，能更准确地分析消费者投诉行为，为企业口碑相关研究提供丰富且可靠的数据支撑，拓展该领域研究深度与广度。4. **助力多主体决策研究**：对监管机构而言，可通过分析投诉数据，实现官方与非官方投诉渠道联动，确定监管重点领域，提升监管效能；对金融监管部门，鉴于投诉数据对公司基本面前瞻性预测能力，纳入监测体系有助于防范金融风险，维护金融市场稳定；对上市公司，利用投诉数据能发现经营问题，改进产品和服务，提高消费者满意度与管理水平；对专业投资者，投诉数据可作为投资决策参考，辅助构建投资组合，获取更高收益。这些应用场景为研究不同主体如何利用投诉数据进行科学决策提供了实践依据</description>
      <content:encoded><![CDATA[<p><img loading="lazy" src="img/01-tousu.jpg" alt=""  />
</p>
<h2 id="一消费者投诉数据集">一、消费者投诉数据集</h2>
<h3 id="11-概况">1.1 概况</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">数据集:  消费者投诉数据集
数据来源:  https://tousu.sina.com.cn/
记录数两:  15314398
覆盖日期:  2018-01-25 ~ 2024-08-26
数据格式:  csv
数据体积:  3.6 G
所含字段:  标题、投诉时间、投诉问题、投诉要求、涉诉金额、投诉对象、使用服务、链接、投诉进度、进度时间
本文声明: 科研用途； 如有问题， 加微信 372335839， 备注「姓名-学校-专业-消费者投诉」
</code></pre></div><br>
<h3 id="12-使用价值">1.2 使用价值</h3>
<p>蔡卫星, 蒲雨琦, 李浩民. 顾客至上：消费者在线投诉的基本面预测能力研究[J]. 管理世界, 2024, 40 (05): 139-154.</p>
<p>消费者投诉数据集作为一种典型的<strong>另类数据</strong>(如非结构文本数据)，具有多方面重要科研价值，为多学科研究和企业实践提供了新视角与有力支持：</p>
<ol>
<li><strong>丰富另类数据研究</strong>：该数据集为另类数据研究注入新活力。其数据<strong>体量庞大、时效性好、真实性强且颗粒度细</strong>，克服了传统研究依赖小样本数据的局限。通过对消费者投诉数据信息含量和投资价值的挖掘，能从数据类型和应用场景等多维度丰富相关研究文献，推动另类数据在学术领域的深入发展。</li>
<li><strong>补充基本面预测研究</strong>：在金融领域，寻找预测基本面的有效指标意义重大。消费者投诉数据集为该研究提供了新方向。以往研究发现消费者投诉对基本面预测有影响，本数据集利用中国数据和更广泛的消费类公司数据进行拓展，并探讨异质性影响，进一步补充了基本面预测影响因素的研究文献。</li>
<li><strong>拓展企业口碑研究</strong>：消费者投诉在很大程度上影响企业口碑。以往企业口碑研究多采用小样本实验或问卷调研，缺乏真实世界大数据支持。基于 “黑猫投诉” 平台的千万级别真实数据构建的数据集，能更准确地分析消费者投诉行为，为企业口碑相关研究提供丰富且可靠的数据支撑，拓展该领域研究深度与广度。</li>
<li><strong>助力多主体决策研究</strong>：对监管机构而言，可通过分析投诉数据，实现官方与非官方投诉渠道联动，确定监管重点领域，提升监管效能；对金融监管部门，鉴于投诉数据对公司基本面前瞻性预测能力，纳入监测体系有助于防范金融风险，维护金融市场稳定；对上市公司，利用投诉数据能发现经营问题，改进产品和服务，提高消费者满意度与管理水平；对专业投资者，投诉数据可作为投资决策参考，辅助构建投资组合，获取更高收益。这些应用场景为研究不同主体如何利用投诉数据进行科学决策提供了实践依据</li>
</ol>
<p><br><br></p>
<h2 id="二查看数据">二、查看数据</h2>
<h3 id="21-读取数据">2.1 读取数据</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;消费者黑猫投诉数据.csv.gz&#39;</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">)</span>
<span class="c1">#或  解压得到csv再读取</span>
<span class="c1">#df = pd.read_csv(&#39;消费者黑猫投诉数据.csv&#39;)</span>

<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;记录条数:&#39;</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">))</span>
<span class="n">df</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">记录条数: 15314398
</code></pre></div><p><img loading="lazy" src="img/02-df.png" alt=""  />
</p>
<br>
<h3 id="22-所含字段">2.2 所含字段</h3>
<p><img loading="lazy" src="img/03-field.png" alt=""  />
</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="k">for</span> <span class="n">col</span> <span class="ow">in</span> <span class="n">df</span><span class="o">.</span><span class="n">columns</span><span class="p">:</span>
    <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39; - </span><span class="si">{</span><span class="n">col</span><span class="si">}</span><span class="s1">&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"> - 标题
 - 投诉时间
 - 投诉问题
 - 投诉要求
 - 涉诉金额
 - 投诉对象
 - 使用服务
 - 链接
 - 投诉进度
 - 进度时间(当前投诉进度时对应的时间)
</code></pre></div><br>
<h3 id="23-覆盖日期">2.3 覆盖日期</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s1">&#39;投诉时间&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;投诉时间&#39;</span><span class="p">])</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;进度时间&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;进度时间&#39;</span><span class="p">])</span>

<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;投诉时间: &#39;</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;投诉时间&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">min</span><span class="p">(),</span>  <span class="s1">&#39;~ &#39;</span><span class="p">,</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;投诉时间&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">())</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;进度时间: &#39;</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;进度时间&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">min</span><span class="p">(),</span>  <span class="s1">&#39;~ &#39;</span><span class="p">,</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;进度时间&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">())</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">投诉时间:  2018-01-25 09:18:00 ~  2024-08-26 09:54:00
进度时间:  2018-01-26 09:18:33 ~  2024-11-12 13:00:21
</code></pre></div><br>
<h3 id="24-筛选企业">2.4 筛选企业</h3>
<p>假设你对「拼多多」感兴趣， 想筛选出涉及「拼多多」的所有投诉信息</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">selected_company</span> <span class="o">=</span> <span class="s1">&#39;拼多多&#39;</span>
<span class="n">mask1</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;投诉对象&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="n">selected_company</span><span class="p">)</span>
<span class="n">mask2</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;使用服务&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="n">selected_company</span><span class="p">)</span>

<span class="n">selected_df</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">mask1</span> <span class="o">|</span> <span class="n">mask2</span><span class="p">]</span>
<span class="n">selected_df</span>
</code></pre></div><p><img loading="lazy" src="img/04-pinduoduo.jpg" alt=""  />
</p>
<br>
<br>
<h2 id="三可视化">三、可视化</h2>
<h3 id="31-字段缺失率">3.1 字段缺失率</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">missingno</span> <span class="k">as</span> <span class="nn">ms</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">import</span> <span class="nn">platform</span>
<span class="c1"># 根据操作系统设置字体</span>
<span class="n">plt</span><span class="o">.</span><span class="n">rcParams</span><span class="p">[</span><span class="s1">&#39;font.family&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="s1">&#39;SimHei&#39;</span> <span class="k">if</span> <span class="n">platform</span><span class="o">.</span><span class="n">system</span><span class="p">()</span> <span class="o">==</span> <span class="s1">&#39;Windows&#39;</span> <span class="k">else</span> <span class="s1">&#39;Arial Unicode MS&#39;</span> <span class="k">if</span> <span class="n">platform</span><span class="o">.</span><span class="n">system</span><span class="p">()</span> <span class="o">==</span> <span class="s1">&#39;Darwin&#39;</span> <span class="k">else</span> <span class="s1">&#39;sans-serif&#39;</span>
<span class="n">plt</span><span class="o">.</span><span class="n">rcParams</span><span class="p">[</span><span class="s1">&#39;axes.unicode_minus&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="kc">False</span>  <span class="c1"># 解决负号显示问题</span>

<span class="c1"># 绘制缺失值矩阵图</span>
<span class="n">ms</span><span class="o">.</span><span class="n">matrix</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>

<span class="c1"># 显示图表</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/06-nan.png" alt=""  />
</p>
<p>只有字段 <em><strong>使用服务</strong></em> 有较为严重的缺失情况，其余字段无明显缺失情况。</p>
<br>
<h3 id="32-年度投诉量">3.2 年度投诉量</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">plotnine</span> <span class="kn">import</span> <span class="o">*</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">from</span> <span class="nn">matplotlib.font_manager</span> <span class="kn">import</span> <span class="n">FontProperties</span>

<span class="c1">#文泉驿微米黑.ttf位于代码同文件夹</span>
<span class="n">font_prop</span> <span class="o">=</span> <span class="n">FontProperties</span><span class="p">(</span><span class="n">fname</span><span class="o">=</span><span class="s1">&#39;文泉驿微米黑.ttf&#39;</span><span class="p">)</span> 


<span class="n">volumes</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;投诉时间&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;投诉时间&#39;</span><span class="p">])</span>
<span class="n">df2</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">set_index</span><span class="p">(</span><span class="s1">&#39;投诉时间&#39;</span><span class="p">)</span>
<span class="k">for</span> <span class="n">date</span><span class="p">,</span> <span class="n">y_df</span> <span class="ow">in</span> <span class="n">df2</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">pd</span><span class="o">.</span><span class="n">Grouper</span><span class="p">(</span><span class="n">freq</span><span class="o">=</span><span class="s1">&#39;YE&#39;</span><span class="p">)):</span>
    <span class="n">volumes</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">y_df</span><span class="p">))</span>

<span class="n">data</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="s1">&#39;year&#39;</span><span class="p">:</span> <span class="nb">range</span><span class="p">(</span><span class="mi">2018</span><span class="p">,</span> <span class="mi">2025</span><span class="p">),</span> 
                     <span class="s1">&#39;volume&#39;</span><span class="p">:</span> <span class="n">volumes</span><span class="p">})</span>

<span class="p">(</span>
    <span class="n">ggplot</span><span class="p">(</span><span class="n">data</span><span class="p">,</span>  <span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="s1">&#39;year&#39;</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="s1">&#39;volume&#39;</span><span class="p">))</span>
    <span class="o">+</span><span class="n">geom_bar</span><span class="p">(</span><span class="n">stat</span><span class="o">=</span><span class="s1">&#39;identity&#39;</span><span class="p">)</span>
    <span class="o">+</span><span class="n">labs</span><span class="p">(</span><span class="n">title</span><span class="o">=</span><span class="s1">&#39;黑猫投诉年度投诉量(2000-2024)&#39;</span><span class="p">,</span>
          <span class="n">x</span> <span class="o">=</span> <span class="s1">&#39;年度&#39;</span><span class="p">,</span> 
          <span class="n">y</span> <span class="o">=</span> <span class="s1">&#39;投诉量(条)&#39;</span><span class="p">)</span>
    <span class="o">+</span><span class="n">geom_text</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">label</span><span class="o">=</span><span class="s1">&#39;volume&#39;</span><span class="p">),</span>  <span class="c1"># 添加数据标签</span>
               <span class="n">va</span><span class="o">=</span><span class="s1">&#39;bottom&#39;</span><span class="p">,</span>           <span class="c1"># 垂直对齐方式为底部（即在柱子顶部）</span>
               <span class="n">size</span><span class="o">=</span><span class="mi">8</span><span class="p">,</span>                <span class="c1"># 设置字体大小</span>
               <span class="n">format_string</span><span class="o">=</span><span class="s1">&#39;</span><span class="si">{}</span><span class="s1">&#39;</span><span class="p">)</span>     <span class="c1"># 格式化字符串</span>
    <span class="o">+</span><span class="n">theme</span><span class="p">(</span><span class="n">figure_size</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">6</span><span class="p">),</span>
           <span class="n">text</span> <span class="o">=</span> <span class="n">element_text</span><span class="p">(</span><span class="n">family</span> <span class="o">=</span> <span class="n">font_prop</span><span class="o">.</span><span class="n">get_name</span><span class="p">(),</span> <span class="n">size</span><span class="o">=</span><span class="mi">14</span><span class="p">),</span> 
           <span class="n">plot_title</span> <span class="o">=</span> <span class="n">element_text</span><span class="p">(</span><span class="n">family</span> <span class="o">=</span> <span class="n">font_prop</span><span class="o">.</span><span class="n">get_name</span><span class="p">(),</span> <span class="n">size</span><span class="o">=</span><span class="mi">18</span><span class="p">)</span>
          <span class="p">)</span>
    <span class="o">+</span><span class="n">scale_x_continuous</span><span class="p">(</span><span class="n">breaks</span><span class="o">=</span><span class="nb">range</span><span class="p">(</span><span class="mi">2018</span><span class="p">,</span> <span class="mi">2025</span><span class="p">))</span> 

<span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/05-vis.png" alt=""  />
</p>
<br>
<br>
<h2 id="精选内容">精选内容</h2>
<ul>
<li>
<p><a href="https://textdata.cn/blog/datasets_available_for_management_science/">LIST | 可供社科(经管)领域使用的数据集汇总</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/the_text_analysis_list_about_ms/">LIST | 社科(经管)数据挖掘文献资料汇总</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/">推荐 | 文本分析库cntext使用手册</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/management_python_course/">付费视频课 | Python实证指标构建与文本分析</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2025-02-14-using-online-large-model-api-to-transform-text-data-into-structured-data/">教程 | 使用大模型将文本数据转化为结构化数据</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2025-03-05-scrape-consumer-complaint-data-with-python/">爬虫代码 | 使用Python采集黑猫投诉网数据</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2025-03-06-consumer-finance-complaints-dataset/"><strong>数据集| 消费者金融投诉数据集(2011 ~ 2025.3)</strong></a></p>
</li>
</ul>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>数据集| NOS.nl荷兰新闻数据集(2015~2025.2.28)</title>
      <link>https://textdata.cn/blog/2025-03-05-netherlands-daily-news-dataset-from-2015-to-2025/</link>
      <pubDate>Wed, 05 Mar 2025 00:00:00 +0000</pubDate>
      
      <guid>/blog/2025-03-05-netherlands-daily-news-dataset-from-2015-to-2025/</guid>
      <description>媒体数据集研究价值大， 您可从中提取丰富的指标，包括但不限于经济政策不确定性指数EPU 、 媒体关注度指数、文本相似度、情感分析。而且可训练词向量，构建新的词典，开发新的指标指数。计算机自然语言处理、经济学、管理学、新闻传播学、公共管理等领域均可使用。</description>
      <content:encoded><![CDATA[<p>今日分享一个数据集「<a href="https://nos.nl/">NOS.nl</a>」，该网站需通过科学地方式连网访问。</p>
<p><em><strong>NOS.nl</strong></em> 是荷兰公共广播组织（Nederlandse Omroep Stichting）的官方新闻网站，提供涵盖荷兰本土及全球的综合性新闻报道‌1。该网站以新闻时效性和深度分析为核心，内容涉及政治、经济、体育、文化等多个领域，并通过文字报道、实时更新及多媒体内容满足用户需求‌。</p>
<p><br><br></p>
<h2 id="一nosnl新闻数据集">一、NOS.nl新闻数据集</h2>
<h3 id="11-概况">1.1 概况</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">数据集:  NOS.nl新闻数据集
数据来源:  https://nos.nl/
使用语言: 荷兰文
记录数两:  293326
覆盖日期: 2015-01-01 ~2025-02-28
数据格式: csv
数据体积: 886 M
所含字段: channel, url, type, title, keywords, section, description,
       published_time, modified_time, image, content

本文声明: 如有问题， 请加微信372335839，备注「姓名-学校-专业」
</code></pre></div><br>
<h3 id="12-数据用途">1.2 数据用途</h3>
<p>可提取丰富的指标，包括但不限于 **经济政策不确定性指数 **、<strong>环境政策不确定性</strong>、 <strong>媒体关注度指数</strong>、<strong>文本相似度</strong>、<strong>情感分析</strong>。此外， 可训练词向量，开发新的概念词典。数据带时间， 参照前面指标， 依主体、日期、指标进行计算， 可构造面板数据，构建新的指标指数。因此在经济学、管理学、新闻传播学、公共管理、社会学等领域均有较高的研究价值。</p>
<p>相关参考文献</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[1]洪永淼,刘俸奇,薛涧坡.政府与市场心理因素的经济影响及其测度[J].管理世界,2023,39(03):30-51.
[2]刘景江,郑畅然,洪永淼.机器学习如何赋能管理学研究？——国内外前沿综述和未来展望[J].管理世界,2023,39(09):191-216.
[3]张一帆,林建浩,樊嘉诚.新闻文本大数据与消费增速实时预测——基于叙事经济学的视角[J].金融研究,2023,(05):152-169.
[4]Huang, Yun, and Paul Luk. &#34;Measuring economic policy uncertainty in China.&#34; China Economic Review 59 (2020): 101367
[5]欧阳资生,陈世丽,杨希特,刘凤根,周学伟.经济政策不确定性、网络舆情与金融机构系统性风险[J].管理科学学报,2023,26(04):62-86.
[6]逯东,宋昕倍.媒体报道、上市公司年报可读性与融资约束[J].管理科学学报,2021,24(12):45-61.
[7]彭涛,黄福广,孙凌霞.经济政策不确定性与风险承担:基于风险投资的证据[J].管理科学学报,2021,24(03):98-114.
[8]庞锐.采纳与内化：多重制度压力如何影响河长制创新扩散——基于省级政府的定向配对事件史分析[J].公共管理学报,2023,20(02):25-37+165-166.
</code></pre></div><p><br><br></p>
<h2 id="二查看数据">二、查看数据</h2>
<h3 id="21-读取数据">2.1 读取数据</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;NOL荷兰新闻数据集.csv.gz&#39;</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">)</span>
<span class="c1">#或  解压得到csv再读取</span>
<span class="c1">#df = pd.read_csv(&#39;NOL荷兰新闻数据集.csv&#39;)</span>

<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;记录条数:&#39;</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">))</span>
<span class="n">df</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">记录条数: 293326
</code></pre></div><p><img loading="lazy" src="img/02-df.png" alt=""  />
</p>
<br>
<h3 id="22-所含字段">2.2 所含字段</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="k">for</span> <span class="n">col</span> <span class="ow">in</span> <span class="n">df</span><span class="o">.</span><span class="n">columns</span><span class="p">:</span>
    <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39; - </span><span class="si">{</span><span class="n">col</span><span class="si">}</span><span class="s1">&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">- channel        频道 [两个不同的频道：nos, nieuwsuur]
- url            文章链接[NOS.nl 网站]
- type           文章类型 [2 种类型： article, liveblog]
- title          文章的标题
- keywords       关键词 [例如：moord谋杀，liquidatie暗杀，afrekening清算]
- section        例如：体育sports, 经济economie
- description    描述 [文章内容的简要概述]
- published_time 发布日期 [格式: 2024-10-31 23:00:42]
- modified_time  修改日期 [格式: 2024-10-31 23:00:42]
- image          图片链接
- content        原文html内容
</code></pre></div><br>
<h3 id="23-覆盖日期">2.3 覆盖日期</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s1">&#39;published_time&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;published_time&#39;</span><span class="p">])</span>

<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;覆盖日期:&#39;</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;published_time&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">min</span><span class="p">(),</span> <span class="s1">&#39;~&#39;</span> <span class="p">,</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;published_time&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">())</span>

</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">覆盖日期: 2015-01-01 00:32:52 ~ 2025-02-28 23:34:07
</code></pre></div><br>
<h3 id="三可视化">三、可视化</h3>
<h3 id="31-字段缺失情况">3.1 字段缺失情况</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">missingno</span> <span class="k">as</span> <span class="nn">ms</span>

<span class="n">ms</span><span class="o">.</span><span class="n">matrix</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/04-nan.png" alt=""  />
</p>
<p>该数据集只有 <em><strong>keywords</strong></em>、 <em><strong>section</strong></em>、<em><strong>image</strong></em> 存在轻微的字段缺失情况。</p>
<br>
<h3 id="32-按年度统计发文量">3.2 按年度，统计发文量</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python">
<span class="kn">from</span> <span class="nn">plotnine</span> <span class="kn">import</span> <span class="o">*</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">from</span> <span class="nn">matplotlib.font_manager</span> <span class="kn">import</span> <span class="n">FontProperties</span>

<span class="c1">#文泉驿微米黑.ttf位于代码同文件夹</span>
<span class="n">font_prop</span> <span class="o">=</span> <span class="n">FontProperties</span><span class="p">(</span><span class="n">fname</span><span class="o">=</span><span class="s1">&#39;文泉驿微米黑.ttf&#39;</span><span class="p">)</span> 


<span class="n">volumes</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;published_time&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;published_time&#39;</span><span class="p">])</span>
<span class="n">df2</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">set_index</span><span class="p">(</span><span class="s1">&#39;published_time&#39;</span><span class="p">)</span>
<span class="k">for</span> <span class="n">date</span><span class="p">,</span> <span class="n">y_df</span> <span class="ow">in</span> <span class="n">df2</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">pd</span><span class="o">.</span><span class="n">Grouper</span><span class="p">(</span><span class="n">freq</span><span class="o">=</span><span class="s1">&#39;YE&#39;</span><span class="p">)):</span>
    <span class="n">volumes</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">y_df</span><span class="p">))</span>

<span class="n">data</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="s1">&#39;year&#39;</span><span class="p">:</span> <span class="nb">range</span><span class="p">(</span><span class="mi">2015</span><span class="p">,</span> <span class="mi">2026</span><span class="p">),</span> 
                     <span class="s1">&#39;volume&#39;</span><span class="p">:</span> <span class="n">volumes</span><span class="p">})</span>


<span class="p">(</span>
    <span class="n">ggplot</span><span class="p">(</span><span class="n">data</span><span class="p">,</span>  <span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="s1">&#39;year&#39;</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="s1">&#39;volume&#39;</span><span class="p">))</span>
    <span class="o">+</span><span class="n">geom_bar</span><span class="p">(</span><span class="n">stat</span><span class="o">=</span><span class="s1">&#39;identity&#39;</span><span class="p">)</span>
    <span class="o">+</span><span class="n">labs</span><span class="p">(</span><span class="n">title</span><span class="o">=</span><span class="s1">&#39;荷兰NOS.nl新闻年度发文量(2015-2025.2.28)&#39;</span><span class="p">,</span>
          <span class="n">x</span> <span class="o">=</span> <span class="s1">&#39;年度&#39;</span><span class="p">,</span> 
          <span class="n">y</span> <span class="o">=</span> <span class="s1">&#39;发文量(条)&#39;</span><span class="p">)</span>
    <span class="o">+</span><span class="n">geom_text</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">label</span><span class="o">=</span><span class="s1">&#39;volume&#39;</span><span class="p">),</span>  <span class="c1"># 添加数据标签</span>
               <span class="n">va</span><span class="o">=</span><span class="s1">&#39;bottom&#39;</span><span class="p">,</span>           <span class="c1"># 垂直对齐方式为底部（即在柱子顶部）</span>
               <span class="n">size</span><span class="o">=</span><span class="mi">8</span><span class="p">,</span>                <span class="c1"># 设置字体大小</span>
               <span class="n">format_string</span><span class="o">=</span><span class="s1">&#39;</span><span class="si">{}</span><span class="s1">&#39;</span><span class="p">)</span>     <span class="c1"># 格式化字符串</span>
    <span class="o">+</span><span class="n">theme</span><span class="p">(</span><span class="n">figure_size</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">6</span><span class="p">),</span>
           <span class="n">text</span> <span class="o">=</span> <span class="n">element_text</span><span class="p">(</span><span class="n">family</span> <span class="o">=</span> <span class="n">font_prop</span><span class="o">.</span><span class="n">get_name</span><span class="p">(),</span> <span class="n">size</span><span class="o">=</span><span class="mi">14</span><span class="p">),</span> 
           <span class="n">plot_title</span> <span class="o">=</span> <span class="n">element_text</span><span class="p">(</span><span class="n">family</span> <span class="o">=</span> <span class="n">font_prop</span><span class="o">.</span><span class="n">get_name</span><span class="p">(),</span> <span class="n">size</span><span class="o">=</span><span class="mi">18</span><span class="p">)</span>
          <span class="p">)</span>
    <span class="o">+</span><span class="n">scale_x_continuous</span><span class="p">(</span><span class="n">breaks</span><span class="o">=</span><span class="nb">range</span><span class="p">(</span><span class="mi">2015</span><span class="p">,</span> <span class="mi">2026</span><span class="p">,</span> <span class="mi">3</span><span class="p">))</span> 

<span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/03-vis.png" alt=""  />
</p>
<br>
<br>
<h2 id="精选内容">精选内容</h2>
<ul>
<li>
<p><a href="https://textdata.cn/blog/datasets_available_for_management_science/">LIST | 可供社科(经管)领域使用的数据集汇总</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/the_text_analysis_list_about_ms/">LIST | 社科(经管)数据挖掘文献资料汇总</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/">推荐 | 文本分析库cntext使用手册</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/management_python_course/">付费视频课 | Python实证指标构建与文本分析</a></p>
</li>
<li>
<p><a href="**https://textdata.cn/blog/2024-07-12-china-daily-dataset/**">数据集(中英) | ChinaDaily新闻数据集(2008 ~ 2024)</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2024-07-12-entrepreneur-dataset/">数据集 | 企业家Entrepreneur杂志数据集(1996 ~ 2024)</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-11-17-how-handle-mega-csv-that-far-exceed-memory/">代码 | 如何处理远超电脑内存的csv文件</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-12-18-how-to-generate-panel-data-from-daily-news-dataset/"><strong>代码 | 使用「新闻数据」构造概念词提及量「面板数据」</strong></a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-12-28-visualize-the-culture-change-using-people-daily-dataset/"><strong>可视化 | 人民日报语料反映七十年文化演变</strong></a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-12-20-measure-china-economic-policy-uncertainty/">代码 | 使用「新闻数据」测量 「经济政策不确定性EPU」指标</a></p>
</li>
</ul>
<br>
<br>
]]></content:encoded>
    </item>
    
    <item>
      <title>爬虫代码 | 使用Python采集黑猫投诉数据</title>
      <link>https://textdata.cn/blog/2025-03-05-scrape-consumer-complaint-data-with-python/</link>
      <pubDate>Tue, 04 Mar 2025 00:00:00 +0000</pubDate>
      
      <guid>/blog/2025-03-05-scrape-consumer-complaint-data-with-python/</guid>
      <description>python爬虫, 黑猫投诉tousu.sina.com.cn</description>
      <content:encoded><![CDATA[<p>采集黑猫投诉数据如下， 该爬虫代码如何写？</p>
<p><img loading="lazy" src="img/06-df.jpg" alt=""  />
</p>
<br>
<h2 id="撰写写爬虫的步骤">撰写写爬虫的步骤</h2>
<ol>
<li>寻找网址规律<em><strong>urls</strong></em> ；</li>
<li>任选一个<em><strong>url</strong></em>， 对其发起访问<em><strong>requests</strong></em>，得到响应数据<em><strong>response</strong></em></li>
<li>从响应数据<em><strong>response</strong></em>中提取自己感兴趣的字段<em><strong>field</strong></em>， 将数据字段存入数据文件(如csv、xlsx等)</li>
<li>使用 <em><strong>for循环</strong></em> 重复 <strong>2~3</strong>， 将所有的网址 <em><strong>urls</strong></em>  均依次进行访问、提取、存储，爬虫结束</li>
</ol>
<p>五个步骤中 <em><strong>1. 寻找网址规律</strong></em>  是最难的一步， 搞定这一步，后面的都很简单。本文将详细分享第一步的操作细节， 其余步骤一笔带过。</p>
<p><br><br></p>
<h2 id="一寻找网址规律">一、寻找网址规律</h2>
<h3 id="11-初步分析">1.1 初步分析</h3>
<p>网站的网址规律分为静态和动态两种类型， 今天分享的 <strong>黑猫投诉网</strong> 是动态的， 即切换网页时，如最热投诉、最新投诉、已回复、已完成， 网址栏的网址始终不变  <a href="https://tousu.sina.com.cn/">https://tousu.sina.com.cn/</a></p>
<p><img loading="lazy" src="img/01-tousu.jpg" alt=""  />
</p>
<br>
<h3 id="12-开发者分析">1.2 开发者分析</h3>
<p>确定 <strong>黑猫投诉网</strong>的网址规律 是动态型， 我以<strong>最新投诉</strong>为例， 我们需要依次进行如下操作</p>
<ol>
<li>打开 <strong>浏览器开发者工具</strong>，</li>
<li>选择 <strong>Network面板</strong> ，刷新网页或向下滚动页面</li>
<li>注意 <strong>Network面板</strong> 中出现很多网址， 依次查看每个链接， 寻找网址规律的线索</li>
</ol>
<p>下面截图是大邓的操作截图， 最终找到 <strong>黑猫投诉·最新投诉</strong> 对应的网址规律目标template，template中可变参数有：</p>
<ul>
<li><em><strong>page(页面数)</strong></em></li>
<li><em><strong>ts(未知)</strong></em>、 <em><strong>rs(未知)</strong></em>、<em><strong>signature(未知)</strong></em> ；</li>
</ul>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">template = f&#39;https://tousu.sina.cn/api/index/feed?ts={ts}&amp;rs={rs}&amp;signature={signature}&amp;type=2&amp;page_size=10&amp;page={page}&#39;
</code></pre></div><p><img loading="lazy" src="img/02-url-pattern.jpg" alt=""  />
</p>
<br>
<h3 id="13-借助网络力量">1.3 借助网络力量</h3>
<p>ts、rs、signature 这三个参数太难，一般情况大邓会放弃写这个爬虫，花钱找技术达人接手这个数据采集任务。</p>
<p>现在做这一步，希望网络中有高人(前人)解决这三参数构造，这样咱们可以爬到数据(还能省点钱)</p>
<p>念咒语(嘛哩嘛哩哄)， 借助网络的力量:</p>
<ol>
<li>打开github， 搜 <strong>黑猫投诉</strong>， 语言选择 <strong>Python</strong></li>
<li>依次查看搜索结果，把黑猫投诉相关仓库代码都浏览一遍， 寻找代码中是否出现 <em><strong>ts</strong></em>、 <em><strong>rs</strong></em>、<em><strong>signature</strong></em> 等字眼。</li>
<li>最终在 <code>https://github.com/yeyeye777/toususina/blob/main/demo.py</code> 中找到三参数构造法，且实验后发现算法依然可行。</li>
</ol>
<p><img loading="lazy" src="img/03-github.png" alt=""  />
</p>
<p><img loading="lazy" src="img/04-github-method.png" alt=""  />
</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">gen_rs_ts_signature</span><span class="p">(</span><span class="n">page</span><span class="o">=</span><span class="mi">1</span><span class="p">):</span>
    <span class="kn">import</span> <span class="nn">time</span>
    <span class="kn">import</span> <span class="nn">random</span>
    <span class="kn">import</span> <span class="nn">hashlib</span>
    <span class="kn">import</span> <span class="nn">json</span>
    <span class="n">sha256</span><span class="o">=</span><span class="n">hashlib</span><span class="o">.</span><span class="n">sha256</span><span class="p">()</span>
    <span class="n">c</span> <span class="o">=</span> <span class="nb">str</span><span class="p">(</span><span class="nb">int</span><span class="p">(</span><span class="n">time</span><span class="o">.</span><span class="n">time</span><span class="p">()</span> <span class="o">*</span> <span class="mi">1000</span><span class="p">))</span>   <span class="c1">#13位时间戳</span>
    <span class="n">a</span> <span class="o">=</span> <span class="p">[</span><span class="s2">&#34;0&#34;</span><span class="p">,</span> <span class="s2">&#34;1&#34;</span><span class="p">,</span> <span class="s2">&#34;2&#34;</span><span class="p">,</span> <span class="s2">&#34;3&#34;</span><span class="p">,</span> <span class="s2">&#34;4&#34;</span><span class="p">,</span> <span class="s2">&#34;5&#34;</span><span class="p">,</span> <span class="s2">&#34;6&#34;</span><span class="p">,</span> <span class="s2">&#34;7&#34;</span><span class="p">,</span> <span class="s2">&#34;8&#34;</span><span class="p">,</span> <span class="s2">&#34;9&#34;</span><span class="p">,</span> <span class="s2">&#34;a&#34;</span><span class="p">,</span> <span class="s2">&#34;b&#34;</span><span class="p">,</span> <span class="s2">&#34;c&#34;</span><span class="p">,</span> <span class="s2">&#34;d&#34;</span><span class="p">,</span> <span class="s2">&#34;e&#34;</span><span class="p">,</span> <span class="s2">&#34;f&#34;</span><span class="p">,</span> <span class="s2">&#34;g&#34;</span><span class="p">,</span> <span class="s2">&#34;h&#34;</span><span class="p">,</span> <span class="s2">&#34;i&#34;</span><span class="p">,</span> <span class="s2">&#34;j&#34;</span><span class="p">,</span> <span class="s2">&#34;k&#34;</span><span class="p">,</span> <span class="s2">&#34;l&#34;</span><span class="p">,</span> <span class="s2">&#34;m&#34;</span><span class="p">,</span> <span class="s2">&#34;n&#34;</span><span class="p">,</span> <span class="s2">&#34;o&#34;</span><span class="p">,</span> <span class="s2">&#34;p&#34;</span><span class="p">,</span> <span class="s2">&#34;q&#34;</span><span class="p">,</span> <span class="s2">&#34;r&#34;</span><span class="p">,</span> <span class="s2">&#34;s&#34;</span><span class="p">,</span> <span class="s2">&#34;t&#34;</span><span class="p">,</span> <span class="s2">&#34;u&#34;</span><span class="p">,</span> <span class="s2">&#34;v&#34;</span><span class="p">,</span> <span class="s2">&#34;w&#34;</span><span class="p">,</span> <span class="s2">&#34;x&#34;</span><span class="p">,</span> <span class="s2">&#34;y&#34;</span><span class="p">,</span> <span class="s2">&#34;z&#34;</span><span class="p">,</span> <span class="s2">&#34;A&#34;</span><span class="p">,</span> <span class="s2">&#34;B&#34;</span><span class="p">,</span> <span class="s2">&#34;C&#34;</span><span class="p">,</span> <span class="s2">&#34;D&#34;</span><span class="p">,</span> <span class="s2">&#34;E&#34;</span><span class="p">,</span> <span class="s2">&#34;F&#34;</span><span class="p">,</span> <span class="s2">&#34;G&#34;</span><span class="p">,</span> <span class="s2">&#34;H&#34;</span><span class="p">,</span> <span class="s2">&#34;I&#34;</span><span class="p">,</span> <span class="s2">&#34;J&#34;</span><span class="p">,</span> <span class="s2">&#34;K&#34;</span><span class="p">,</span> <span class="s2">&#34;L&#34;</span><span class="p">,</span> <span class="s2">&#34;M&#34;</span><span class="p">,</span> <span class="s2">&#34;N&#34;</span><span class="p">,</span> <span class="s2">&#34;O&#34;</span><span class="p">,</span> <span class="s2">&#34;P&#34;</span><span class="p">,</span> <span class="s2">&#34;Q&#34;</span><span class="p">,</span> <span class="s2">&#34;R&#34;</span><span class="p">,</span> <span class="s2">&#34;S&#34;</span><span class="p">,</span> <span class="s2">&#34;T&#34;</span><span class="p">,</span> <span class="s2">&#34;U&#34;</span><span class="p">,</span> <span class="s2">&#34;V&#34;</span><span class="p">,</span> <span class="s2">&#34;W&#34;</span><span class="p">,</span> <span class="s2">&#34;X&#34;</span><span class="p">,</span> <span class="s2">&#34;Y&#34;</span><span class="p">,</span> <span class="s2">&#34;Z&#34;</span><span class="p">]</span>
    <span class="n">h</span> <span class="o">=</span><span class="s1">&#39;&#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">random</span><span class="o">.</span><span class="n">choice</span><span class="p">(</span><span class="n">a</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">16</span><span class="p">))</span>   <span class="c1">#随机16个字符</span>
    <span class="n">d</span> <span class="o">=</span><span class="s1">&#39;$d6eb7ff91ee257475%&#39;</span>   <span class="c1">#默认值</span>
    <span class="n">e</span> <span class="o">=</span> <span class="s1">&#39;2&#39;</span>       <span class="c1">#最新信息为2</span>
    <span class="n">u</span> <span class="o">=</span><span class="s1">&#39;10&#39;</span>      <span class="c1">#每页数量</span>
    <span class="n">page</span><span class="o">=</span><span class="nb">str</span><span class="p">(</span><span class="n">page</span><span class="p">)</span>   <span class="c1">#页码</span>
    <span class="n">ts</span> <span class="o">=</span> <span class="n">c</span>
    <span class="n">rs</span> <span class="o">=</span> <span class="n">h</span>
    <span class="n">bb</span> <span class="o">=</span> <span class="p">[</span><span class="n">d</span><span class="p">,</span><span class="n">u</span><span class="p">,</span><span class="n">c</span><span class="p">,</span><span class="n">e</span><span class="p">,</span><span class="n">page</span><span class="p">,</span><span class="n">h</span><span class="p">]</span>
    <span class="n">bb</span><span class="o">.</span><span class="n">sort</span><span class="p">()</span>
    <span class="n">signature</span><span class="o">=</span><span class="n">hashlib</span><span class="o">.</span><span class="n">sha256</span><span class="p">((</span><span class="s1">&#39;&#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">bb</span><span class="p">))</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s1">&#39;utf-8&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">hexdigest</span><span class="p">()</span>
    <span class="k">return</span> <span class="n">ts</span><span class="p">,</span><span class="n">rs</span><span class="p">,</span><span class="n">signature</span>



<span class="k">for</span> <span class="n">page</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">6</span><span class="p">):</span>
    <span class="n">ts</span><span class="p">,</span><span class="n">rs</span><span class="p">,</span><span class="n">signature</span> <span class="o">=</span> <span class="n">gen_rs_ts_signature</span><span class="p">(</span><span class="n">page</span><span class="o">=</span><span class="n">page</span><span class="p">)</span>
    <span class="n">url</span> <span class="o">=</span> <span class="sa">f</span><span class="s1">&#39;https://tousu.sina.cn/api/index/feed?ts=</span><span class="si">{</span><span class="n">ts</span><span class="si">}</span><span class="s1">&amp;rs=</span><span class="si">{</span><span class="n">rs</span><span class="si">}</span><span class="s1">&amp;signature=</span><span class="si">{</span><span class="n">signature</span><span class="si">}</span><span class="s1">&amp;type=2&amp;page_size=10&amp;page=</span><span class="si">{</span><span class="n">page</span><span class="si">}</span><span class="s1">&#39;</span>
    <span class="nb">print</span><span class="p">(</span><span class="n">page</span><span class="p">,</span> <span class="n">url</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">1 https://tousu.sina.cn/api/index/feed?ts=1741183219941&amp;rs=mqMovDIpI7FxhiiW&amp;signature=007812aaec720c655748db472bb15fbaa7b026546ddfa9b520edaffe370d507d&amp;type=2&amp;page_size=10&amp;page=1
2 https://tousu.sina.cn/api/index/feed?ts=1741183219941&amp;rs=XjOZVfZiJ9o6NEHH&amp;signature=282f5a6c30d4720b3d653dfa13bf1d6c31a4bb5b7baccc5c0e3b5445d67821bc&amp;type=2&amp;page_size=10&amp;page=2
3 https://tousu.sina.cn/api/index/feed?ts=1741183219941&amp;rs=N1eBlKMeQTnLQ26y&amp;signature=b2cad655fb7beac6ca5ae1b6abb72debd0309de6c1e79f21e68b6ae4c6663e8a&amp;type=2&amp;page_size=10&amp;page=3
4 https://tousu.sina.cn/api/index/feed?ts=1741183219941&amp;rs=LVKITB2UrpLdIZIH&amp;signature=5de4b51e8fdec631e0de47a2a34e797a99ea77beb4ca21abe91ec8955523dbd8&amp;type=2&amp;page_size=10&amp;page=4
5 https://tousu.sina.cn/api/index/feed?ts=1741183219941&amp;rs=FkUmQOXIFOzWrpfu&amp;signature=f571a16f3ac90dae121fa444dde637294358abf34cf97ac56ea0001d48312f19&amp;type=2&amp;page_size=10&amp;page=5
</code></pre></div><p><br><br></p>
<h2 id="二发起访问">二、发起访问</h2>
<p>任选一个<em><strong>url</strong></em>（page=1)， 对其发起访问<em><strong>requests</strong></em>，得到响应数据<em><strong>response</strong></em></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">requests</span>

<span class="c1">#避免网站反爬，使用伪装头</span>
<span class="n">header</span> <span class="o">=</span> <span class="p">{</span><span class="s2">&#34;user-agent&#34;</span><span class="p">:</span> <span class="s2">&#34;Mozilla/5.0 (Macintosh; Intel Mac OS X 12_0_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36&#34;</span><span class="p">}</span>

<span class="c1">#page=1的url</span>
<span class="n">url</span> <span class="o">=</span> <span class="s1">&#39;https://tousu.sina.cn/api/index/feed?ts=1741183219941&amp;rs=mqMovDIpI7FxhiiW&amp;signature=007812aaec720c655748db472bb15fbaa7b026546ddfa9b520edaffe370d507d&amp;type=2&amp;page_size=10&amp;page=1&#39;</span>

<span class="c1">#发起访问</span>
<span class="n">resp</span> <span class="o">=</span> <span class="n">requests</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">url</span><span class="p">,</span><span class="n">headers</span> <span class="o">=</span> <span class="n">header</span><span class="p">)</span>
<span class="n">resp</span><span class="o">.</span><span class="n">json</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">{&#39;result&#39;: {&#39;status&#39;: {&#39;code&#39;: 0, &#39;msg&#39;: &#39;ok&#39;},
  &#39;timestamp&#39;: &#39;Wed Mar 05 22:03:08 +0800 2025&#39;,
  
  &#39;data&#39;: {&#39;lists&#39;: [
      {&#39;main&#39;: {&#39;id&#39;: &#39;33181177&#39;,
      &#39;sn&#39;: &#39;17380385543&#39;,
      &#39;title&#39;: &#39;中国移动流量异常使用消耗&#39;,
      &#39;couid&#39;: &#39;1991428685&#39;,
      &#39;cotitle&#39;: &#39;中国移动10086&#39;,
      &#39;appeal&#39;: &#39;改善服务,赔偿解释&#39;,
      &#39;issue&#39;: &#39;流量消耗问题&#39;,
      &#39;comment_id&#39;: &#39;tousu_complaint_33181177&#39;,
      &#39;timestamp&#39;: &#39;1740577724&#39;,
      &#39;status&#39;: 6,
      &#39;upvote_amount&#39;: 0,
      &#39;share_amount&#39;: 0,
      &#39;summary&#39;: &#39;我于2024年12月20日在淘宝平台上购买了一张中国移动的流量电话卡，于今日遇到流量消耗异常的问题。我的手机上都有流量消耗的记录，快手是30g的流量消耗。可是中国移动中的套餐定向流量包含着快手，但是只消耗了15个g，剩余的15个g被通用流量消耗。询问客服，客服说他们的定向流量没有问题，然后让他们给出我的通用流量的...&#39;,
      &#39;url&#39;: &#39;//tousu.sina.cn/complaint/view/17380385543/?sld=ec54e136b3aaaae3b5e6ac7a929271ec&#39;,
      &#39;evaluate_u&#39;: None,
      &#39;ext_src&#39;: &#39;0&#39;,
      &#39;field&#39;: &#39;55&#39;,
      &#39;cost&#39;: &#39;50&#39;,
      &#39;tpl&#39;: &#39;0&#39;,
      &#39;comment_amount&#39;: 0,
      &#39;has_jury&#39;: False,
      &#39;is_upvote&#39;: False},
     &#39;author&#39;: {&#39;title&#39;: &#39;机灵喵&#39;,
      &#39;avatar&#39;: &#39;//n.sinaimg.cn/finance/235fa465/20230314/3.png&#39;}},
      ...
      ...
      {&#39;main&#39;: {&#39;id&#39;: &#39;33396892&#39;,
      &#39;sn&#39;: &#39;17380601258&#39;,
      &#39;title&#39;: &#39;驰诚农机专营店微耕机已退货，不退款&#39;,
      &#39;couid&#39;: &#39;6244211375&#39;,
      &#39;cotitle&#39;: &#39;拼多多客户服务&#39;,
      &#39;appeal&#39;: &#39;退货退款,作出处罚,道歉&#39;,
      &#39;issue&#39;: &#39;退货不退款,机器无法正常运行工作,响应时间长,客服态度差&#39;,
      &#39;comment_id&#39;: &#39;tousu_complaint_33396892&#39;,
      &#39;timestamp&#39;: &#39;1741182983&#39;,
      &#39;status&#39;: 4,
      &#39;upvote_amount&#39;: 0,
      &#39;share_amount&#39;: 0,
      &#39;summary&#39;: &#39;我于2月20日在拼多多驰诚农机专营店买了一台微耕机，总价值为2150元，收到货后拼装组装好无法使用后退货退款，根据协商，拼多多开通退货退款窗口，商家收到货后一直不退款，平台客服也不处理，一直让等待，也没有处理方案，推皮球，要求尽快处理&#39;,
      &#39;url&#39;: &#39;//tousu.sina.cn/complaint/view/17380601258/?sld=f0b469a54216a2335b2dc544cf97924f&#39;,
      &#39;evaluate_u&#39;: None,
      &#39;ext_src&#39;: &#39;0&#39;,
      &#39;field&#39;: &#39;6&#39;,
      &#39;cost&#39;: &#39;2150&#39;,
      &#39;tpl&#39;: &#39;3&#39;,
      &#39;comment_amount&#39;: 0,
      &#39;has_jury&#39;: False,
      &#39;is_upvote&#39;: False},
     &#39;author&#39;: {&#39;title&#39;: &#39;机灵喵&#39;,
      &#39;avatar&#39;: &#39;//n.sinaimg.cn/finance/235fa465/20230314/3.png&#39;}}],
   
   
   &#39;pager&#39;: {&#39;current&#39;: 1,
    &#39;next&#39;: 2,
    &#39;page_amount&#39;: 11974,
    &#39;page_size&#39;: 10,
    &#39;item_count&#39;: 119734}}}}
pd.to_datetime(&#39;1741169556&#39;, unit=&#39;s&#39;)

</code></pre></div><p><br><br></p>
<h2 id="三提取字段存储到csv">三、提取字段&amp;存储到csv</h2>
<p><strong>黑猫投诉</strong>网站返回的响应数据是 <em><strong>json</strong></em> 格式，格式整洁，  非常容易进行字段筛选和提取。每个响应数据，会返 含10 个投诉的列表，</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="nb">print</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">response</span><span class="o">.</span><span class="n">json</span><span class="p">()[</span><span class="s1">&#39;result&#39;</span><span class="p">][</span><span class="s1">&#39;data&#39;</span><span class="p">][</span><span class="s1">&#39;lists&#39;</span><span class="p">]))</span>

<span class="n">response</span><span class="o">.</span><span class="n">json</span><span class="p">()[</span><span class="s1">&#39;result&#39;</span><span class="p">][</span><span class="s1">&#39;data&#39;</span><span class="p">][</span><span class="s1">&#39;lists&#39;</span><span class="p">]</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">10

[{&#39;main&#39;: {&#39;id&#39;: &#39;33172281&#39;,
   &#39;sn&#39;: &#39;17380376647&#39;,
   &#39;title&#39;: &#39;卖假鞋强买强卖&#39;,
   &#39;couid&#39;: &#39;6020086612&#39;,
   &#39;cotitle&#39;: &#39;抖音&#39;,
   &#39;appeal&#39;: &#39;退货退款,下架产品,道歉&#39;,
   &#39;issue&#39;: &#39;退货不退款,卖假货,客服态度差,强买强卖污蔑消费者&#39;,
   &#39;comment_id&#39;: &#39;tousu_complaint_33172281&#39;,
   &#39;timestamp&#39;: &#39;1740558656&#39;,
   &#39;status&#39;: 6,
   &#39;upvote_amount&#39;: 0,
   &#39;share_amount&#39;: 0,
   &#39;summary&#39;: &#39;我在2025年2月14日在抖音阿宝家小铺购买一双白色运动鞋，2月18日收到货后发现其是抄袭品牌smfk的鞋子，于是在2月18日里面申请了退货退款，但是其收到货后一口咬定我调换了他们家鞋子，并且无法提供完整的收发货视频。&#39;,
   &#39;url&#39;: &#39;//tousu.sina.com.cn/complaint/view/17380376647/?sld=b965241622ff70025f17ad3fed955f76&#39;,
   &#39;evaluate_u&#39;: None,
   &#39;ext_src&#39;: &#39;0&#39;,
   &#39;field&#39;: &#39;37&#39;,
   &#39;cost&#39;: &#39;499&#39;,
   &#39;tpl&#39;: &#39;3&#39;,
   &#39;comment_amount&#39;: 0,
   &#39;has_jury&#39;: False,
   &#39;is_upvote&#39;: False},
  &#39;author&#39;: {&#39;title&#39;: &#39;机灵喵&#39;,
   &#39;avatar&#39;: &#39;//n.sinaimg.cn/finance/235fa465/20230314/3.png&#39;}},
   ......
   ......
   {&#39;main&#39;: {&#39;id&#39;: &#39;33172911&#39;,
   &#39;sn&#39;: &#39;17380377277&#39;,
   &#39;title&#39;: &#39;一天八百个电话太吓人&#39;,
   &#39;couid&#39;: &#39;7894766771&#39;,
   &#39;cotitle&#39;: &#39;众利数字科技&#39;,
   &#39;appeal&#39;: &#39;解释,作出处罚&#39;,
   &#39;issue&#39;: &#39;暴力催收&#39;,
   &#39;comment_id&#39;: &#39;tousu_complaint_33172911&#39;,
   &#39;timestamp&#39;: &#39;1740559661&#39;,
   &#39;status&#39;: 6,
   &#39;upvote_amount&#39;: 0,
   &#39;share_amount&#39;: 0,
   &#39;summary&#39;: &#39;一天打八百个电话，每天每天打， 希望降低催收频率， 合规催收&#39;,
   &#39;url&#39;: &#39;//tousu.sina.com.cn/complaint/view/17380377277/?sld=2a0c1176d0ce70a8654b4a716d05b1ff&#39;,
   &#39;evaluate_u&#39;: None,
   &#39;ext_src&#39;: &#39;0&#39;,
   &#39;field&#39;: &#39;37&#39;,
   &#39;cost&#39;: &#39;8400&#39;,
   &#39;tpl&#39;: &#39;0&#39;,
   &#39;comment_amount&#39;: 0,
   &#39;has_jury&#39;: False,
   &#39;is_upvote&#39;: False},
  &#39;author&#39;: {&#39;title&#39;: &#39;洞察喵&#39;,
   &#39;avatar&#39;: &#39;//n.sinaimg.cn/finance/235fa465/20230314/4.png&#39;}}]
</code></pre></div><br>
<p>假使我们要确定要提取的字段为 <em><strong>title</strong></em>、<em><strong>timestamp</strong></em>、<em><strong>summary</strong></em>、<em><strong>cotitle</strong></em>、<em><strong>appeal</strong></em>、<em><strong>issue</strong></em>、<em><strong>url</strong></em>。这里使用 <strong>for循环</strong>， 依次将数据保存到csv中。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">import</span> <span class="nn">os</span>

<span class="c1"># 确定CSV文件路径</span>
<span class="n">csv_file_path</span> <span class="o">=</span> <span class="s1">&#39;黑猫投诉.csv&#39;</span>

<span class="c1"># 创建一个标志变量，用于判断是否是第一次写入</span>
<span class="n">first_write</span> <span class="o">=</span> <span class="ow">not</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">exists</span><span class="p">(</span><span class="n">csv_file_path</span><span class="p">)</span>

<span class="k">for</span> <span class="n">complaint_card</span> <span class="ow">in</span> <span class="n">response</span><span class="o">.</span><span class="n">json</span><span class="p">()[</span><span class="s1">&#39;result&#39;</span><span class="p">][</span><span class="s1">&#39;data&#39;</span><span class="p">][</span><span class="s1">&#39;lists&#39;</span><span class="p">]:</span>
    <span class="n">data</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span>
    <span class="n">data</span><span class="p">[</span><span class="s1">&#39;title&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">complaint_card</span><span class="p">[</span><span class="s1">&#39;main&#39;</span><span class="p">][</span><span class="s1">&#39;title&#39;</span><span class="p">]</span>
    <span class="n">data</span><span class="p">[</span><span class="s1">&#39;timestamp&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">complaint_card</span><span class="p">[</span><span class="s1">&#39;main&#39;</span><span class="p">][</span><span class="s1">&#39;timestamp&#39;</span><span class="p">]</span>
    <span class="n">data</span><span class="p">[</span><span class="s1">&#39;summary&#39;</span><span class="p">]</span> <span class="o">=</span><span class="n">complaint_card</span><span class="p">[</span><span class="s1">&#39;main&#39;</span><span class="p">][</span><span class="s1">&#39;summary&#39;</span><span class="p">]</span>
    <span class="n">data</span><span class="p">[</span><span class="s1">&#39;cotitle&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">complaint_card</span><span class="p">[</span><span class="s1">&#39;main&#39;</span><span class="p">][</span><span class="s1">&#39;cotitle&#39;</span><span class="p">]</span>
    <span class="n">data</span><span class="p">[</span><span class="s1">&#39;appeal&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">complaint_card</span><span class="p">[</span><span class="s1">&#39;main&#39;</span><span class="p">][</span><span class="s1">&#39;appeal&#39;</span><span class="p">]</span>
    <span class="n">data</span><span class="p">[</span><span class="s1">&#39;issue&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">complaint_card</span><span class="p">[</span><span class="s1">&#39;main&#39;</span><span class="p">][</span><span class="s1">&#39;issue&#39;</span><span class="p">]</span>
    <span class="n">data</span><span class="p">[</span><span class="s1">&#39;url&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">complaint_card</span><span class="p">[</span><span class="s1">&#39;main&#39;</span><span class="p">][</span><span class="s1">&#39;url&#39;</span><span class="p">]</span>
    <span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">([</span><span class="n">data</span><span class="p">])</span>
    
    <span class="c1"># 写入DataFrame到CSV文件中</span>
    <span class="k">if</span> <span class="n">first_write</span><span class="p">:</span>
        <span class="c1"># 第一次写入时包含表头</span>
        <span class="n">df</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="n">csv_file_path</span><span class="p">,</span> <span class="n">mode</span><span class="o">=</span><span class="s1">&#39;a&#39;</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8-sig&#39;</span><span class="p">)</span>
        <span class="n">first_write</span> <span class="o">=</span> <span class="kc">False</span>  <span class="c1"># 修改标志变量为False，以便后续不再写入表头</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="c1"># 后续追加数据时不包含表头</span>
        <span class="n">df</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="n">csv_file_path</span><span class="p">,</span> <span class="n">mode</span><span class="o">=</span><span class="s1">&#39;a&#39;</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">header</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8-sig&#39;</span><span class="p">)</span>
</code></pre></div><br>
<p>查看新生成的 <em><strong>黑猫投诉.csv</strong></em>，</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">pd.read_csv(&#39;黑猫投诉.csv&#39;)
</code></pre></div><p><img loading="lazy" src="img/05-df.jpg" alt=""  />
</p>
<br>
<h2 id="四最终完整代码">四、最终完整代码</h2>
<p>黑猫投诉最大能翻看50页，即向下翻看更多时，最多能翻50次。结果存储 <em><strong>黑猫投诉.csv</strong></em> 。需要设置的参数</p>
<ul>
<li><em><strong>max_page=50</strong></em></li>
<li><em><strong>csv_file_path = &lsquo;黑猫投诉.csv&rsquo;</strong></em></li>
</ul>
<p>本章节将汇总步骤一(网址规律)、步骤二(访问一个网页)、步骤三(提取&amp;保存)， 得到实践中完整的爬虫代码(<em><strong>tousu-spider.py</strong></em>)</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">requests</span>
<span class="kn">from</span> <span class="nn">tqdm</span> <span class="kn">import</span> <span class="n">tqdm</span>
<span class="kn">import</span> <span class="nn">time</span>

<span class="c1">#避免网站反爬，使用伪装头</span>
<span class="n">header</span> <span class="o">=</span> <span class="p">{</span><span class="s2">&#34;user-agent&#34;</span><span class="p">:</span> <span class="s2">&#34;Mozilla/5.0 (Macintosh; Intel Mac OS X 12_0_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36&#34;</span><span class="p">}</span>

<span class="k">def</span> <span class="nf">gen_rs_ts_signature</span><span class="p">(</span><span class="n">page</span><span class="o">=</span><span class="mi">1</span><span class="p">):</span>
    <span class="kn">import</span> <span class="nn">random</span>
    <span class="kn">import</span> <span class="nn">time</span>
    <span class="kn">import</span> <span class="nn">hashlib</span>
    <span class="kn">import</span> <span class="nn">json</span>
    <span class="n">sha256</span><span class="o">=</span><span class="n">hashlib</span><span class="o">.</span><span class="n">sha256</span><span class="p">()</span>
    <span class="n">c</span> <span class="o">=</span> <span class="nb">str</span><span class="p">(</span><span class="nb">int</span><span class="p">(</span><span class="n">time</span><span class="o">.</span><span class="n">time</span><span class="p">()</span> <span class="o">*</span> <span class="mi">1000</span><span class="p">))</span>   <span class="c1">#13位时间戳</span>
    <span class="n">a</span> <span class="o">=</span> <span class="p">[</span><span class="s2">&#34;0&#34;</span><span class="p">,</span> <span class="s2">&#34;1&#34;</span><span class="p">,</span> <span class="s2">&#34;2&#34;</span><span class="p">,</span> <span class="s2">&#34;3&#34;</span><span class="p">,</span> <span class="s2">&#34;4&#34;</span><span class="p">,</span> <span class="s2">&#34;5&#34;</span><span class="p">,</span> <span class="s2">&#34;6&#34;</span><span class="p">,</span> <span class="s2">&#34;7&#34;</span><span class="p">,</span> <span class="s2">&#34;8&#34;</span><span class="p">,</span> <span class="s2">&#34;9&#34;</span><span class="p">,</span> <span class="s2">&#34;a&#34;</span><span class="p">,</span> <span class="s2">&#34;b&#34;</span><span class="p">,</span> <span class="s2">&#34;c&#34;</span><span class="p">,</span> <span class="s2">&#34;d&#34;</span><span class="p">,</span> <span class="s2">&#34;e&#34;</span><span class="p">,</span> <span class="s2">&#34;f&#34;</span><span class="p">,</span> <span class="s2">&#34;g&#34;</span><span class="p">,</span> <span class="s2">&#34;h&#34;</span><span class="p">,</span> <span class="s2">&#34;i&#34;</span><span class="p">,</span> <span class="s2">&#34;j&#34;</span><span class="p">,</span> <span class="s2">&#34;k&#34;</span><span class="p">,</span> <span class="s2">&#34;l&#34;</span><span class="p">,</span> <span class="s2">&#34;m&#34;</span><span class="p">,</span> <span class="s2">&#34;n&#34;</span><span class="p">,</span> <span class="s2">&#34;o&#34;</span><span class="p">,</span> <span class="s2">&#34;p&#34;</span><span class="p">,</span> <span class="s2">&#34;q&#34;</span><span class="p">,</span> <span class="s2">&#34;r&#34;</span><span class="p">,</span> <span class="s2">&#34;s&#34;</span><span class="p">,</span> <span class="s2">&#34;t&#34;</span><span class="p">,</span> <span class="s2">&#34;u&#34;</span><span class="p">,</span> <span class="s2">&#34;v&#34;</span><span class="p">,</span> <span class="s2">&#34;w&#34;</span><span class="p">,</span> <span class="s2">&#34;x&#34;</span><span class="p">,</span> <span class="s2">&#34;y&#34;</span><span class="p">,</span> <span class="s2">&#34;z&#34;</span><span class="p">,</span> <span class="s2">&#34;A&#34;</span><span class="p">,</span> <span class="s2">&#34;B&#34;</span><span class="p">,</span> <span class="s2">&#34;C&#34;</span><span class="p">,</span> <span class="s2">&#34;D&#34;</span><span class="p">,</span> <span class="s2">&#34;E&#34;</span><span class="p">,</span> <span class="s2">&#34;F&#34;</span><span class="p">,</span> <span class="s2">&#34;G&#34;</span><span class="p">,</span> <span class="s2">&#34;H&#34;</span><span class="p">,</span> <span class="s2">&#34;I&#34;</span><span class="p">,</span> <span class="s2">&#34;J&#34;</span><span class="p">,</span> <span class="s2">&#34;K&#34;</span><span class="p">,</span> <span class="s2">&#34;L&#34;</span><span class="p">,</span> <span class="s2">&#34;M&#34;</span><span class="p">,</span> <span class="s2">&#34;N&#34;</span><span class="p">,</span> <span class="s2">&#34;O&#34;</span><span class="p">,</span> <span class="s2">&#34;P&#34;</span><span class="p">,</span> <span class="s2">&#34;Q&#34;</span><span class="p">,</span> <span class="s2">&#34;R&#34;</span><span class="p">,</span> <span class="s2">&#34;S&#34;</span><span class="p">,</span> <span class="s2">&#34;T&#34;</span><span class="p">,</span> <span class="s2">&#34;U&#34;</span><span class="p">,</span> <span class="s2">&#34;V&#34;</span><span class="p">,</span> <span class="s2">&#34;W&#34;</span><span class="p">,</span> <span class="s2">&#34;X&#34;</span><span class="p">,</span> <span class="s2">&#34;Y&#34;</span><span class="p">,</span> <span class="s2">&#34;Z&#34;</span><span class="p">]</span>
    <span class="n">h</span> <span class="o">=</span><span class="s1">&#39;&#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">random</span><span class="o">.</span><span class="n">choice</span><span class="p">(</span><span class="n">a</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">16</span><span class="p">))</span>   <span class="c1">#随机16个字符</span>
    <span class="n">d</span> <span class="o">=</span><span class="s1">&#39;$d6eb7ff91ee257475%&#39;</span>   <span class="c1">#默认值</span>
    <span class="n">e</span> <span class="o">=</span> <span class="s1">&#39;2&#39;</span>       <span class="c1">#最新信息为2</span>
    <span class="n">u</span> <span class="o">=</span><span class="s1">&#39;10&#39;</span>      <span class="c1">#每页数量</span>
    <span class="n">page</span><span class="o">=</span><span class="nb">str</span><span class="p">(</span><span class="n">page</span><span class="p">)</span>   <span class="c1">#页码</span>
    <span class="n">ts</span> <span class="o">=</span> <span class="n">c</span>
    <span class="n">rs</span> <span class="o">=</span> <span class="n">h</span>
    <span class="n">bb</span> <span class="o">=</span> <span class="p">[</span><span class="n">d</span><span class="p">,</span><span class="n">u</span><span class="p">,</span><span class="n">c</span><span class="p">,</span><span class="n">e</span><span class="p">,</span><span class="n">page</span><span class="p">,</span><span class="n">h</span><span class="p">]</span>
    <span class="n">bb</span><span class="o">.</span><span class="n">sort</span><span class="p">()</span>
    <span class="n">signature</span><span class="o">=</span><span class="n">hashlib</span><span class="o">.</span><span class="n">sha256</span><span class="p">((</span><span class="s1">&#39;&#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">bb</span><span class="p">))</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s1">&#39;utf-8&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">hexdigest</span><span class="p">()</span>
    <span class="k">return</span> <span class="n">ts</span><span class="p">,</span><span class="n">rs</span><span class="p">,</span><span class="n">signature</span>




<span class="c1"># 确定CSV文件路径</span>
<span class="n">csv_file_path</span> <span class="o">=</span> <span class="s1">&#39;黑猫投诉.csv&#39;</span>

<span class="c1">#假设采集1-50页</span>
<span class="n">max_page</span> <span class="o">=</span> <span class="mi">50</span>

<span class="c1">#for循环， 遍历每个url，均进行访问、提取、保存</span>
<span class="k">for</span> <span class="n">page</span> <span class="ow">in</span> <span class="n">tqdm</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">max_page</span><span class="o">+</span><span class="mi">1</span><span class="p">),</span> <span class="s1">&#39;采集进度&#39;</span><span class="p">):</span>
    <span class="n">time</span><span class="o">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
    <span class="n">ts</span><span class="p">,</span><span class="n">rs</span><span class="p">,</span><span class="n">signature</span> <span class="o">=</span> <span class="n">gen_rs_ts_signature</span><span class="p">(</span><span class="n">page</span><span class="o">=</span><span class="n">page</span><span class="p">)</span>
    <span class="n">page_url</span> <span class="o">=</span> <span class="sa">f</span><span class="s1">&#39;https://tousu.sina.cn/api/index/feed?ts=</span><span class="si">{</span><span class="n">ts</span><span class="si">}</span><span class="s1">&amp;rs=</span><span class="si">{</span><span class="n">rs</span><span class="si">}</span><span class="s1">&amp;signature=</span><span class="si">{</span><span class="n">signature</span><span class="si">}</span><span class="s1">&amp;type=2&amp;page_size=10&amp;page=</span><span class="si">{</span><span class="n">page</span><span class="si">}</span><span class="s1">&#39;</span>
    <span class="n">page_resp</span> <span class="o">=</span> <span class="n">requests</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">page_url</span><span class="p">,</span><span class="n">headers</span> <span class="o">=</span> <span class="n">header</span><span class="p">)</span>
    

    <span class="c1"># 创建一个标志变量，用于判断是否是第一次写入</span>
    <span class="n">first_write</span> <span class="o">=</span> <span class="ow">not</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">exists</span><span class="p">(</span><span class="n">csv_file_path</span><span class="p">)</span>
    <span class="k">for</span> <span class="n">complaint_card</span> <span class="ow">in</span> <span class="n">page_resp</span><span class="o">.</span><span class="n">json</span><span class="p">()[</span><span class="s1">&#39;result&#39;</span><span class="p">][</span><span class="s1">&#39;data&#39;</span><span class="p">][</span><span class="s1">&#39;lists&#39;</span><span class="p">]:</span>
        <span class="n">data</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span>
        <span class="n">data</span><span class="p">[</span><span class="s1">&#39;title&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">complaint_card</span><span class="p">[</span><span class="s1">&#39;main&#39;</span><span class="p">][</span><span class="s1">&#39;title&#39;</span><span class="p">]</span>
        <span class="n">data</span><span class="p">[</span><span class="s1">&#39;timestamp&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">complaint_card</span><span class="p">[</span><span class="s1">&#39;main&#39;</span><span class="p">][</span><span class="s1">&#39;timestamp&#39;</span><span class="p">]</span>
        <span class="n">data</span><span class="p">[</span><span class="s1">&#39;summary&#39;</span><span class="p">]</span> <span class="o">=</span><span class="n">complaint_card</span><span class="p">[</span><span class="s1">&#39;main&#39;</span><span class="p">][</span><span class="s1">&#39;summary&#39;</span><span class="p">]</span>
        <span class="n">data</span><span class="p">[</span><span class="s1">&#39;cotitle&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">complaint_card</span><span class="p">[</span><span class="s1">&#39;main&#39;</span><span class="p">][</span><span class="s1">&#39;cotitle&#39;</span><span class="p">]</span>
        <span class="n">data</span><span class="p">[</span><span class="s1">&#39;appeal&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">complaint_card</span><span class="p">[</span><span class="s1">&#39;main&#39;</span><span class="p">][</span><span class="s1">&#39;appeal&#39;</span><span class="p">]</span>
        <span class="n">data</span><span class="p">[</span><span class="s1">&#39;issue&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">complaint_card</span><span class="p">[</span><span class="s1">&#39;main&#39;</span><span class="p">][</span><span class="s1">&#39;issue&#39;</span><span class="p">]</span>
        <span class="n">data</span><span class="p">[</span><span class="s1">&#39;url&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">complaint_card</span><span class="p">[</span><span class="s1">&#39;main&#39;</span><span class="p">][</span><span class="s1">&#39;url&#39;</span><span class="p">]</span>
        <span class="n">df_</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">([</span><span class="n">data</span><span class="p">])</span>
        
        <span class="c1"># 写入DataFrame到CSV文件中</span>
        <span class="k">if</span> <span class="n">first_write</span><span class="p">:</span>
            <span class="c1"># 第一次写入时包含表头</span>
            <span class="n">df_</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="n">csv_file_path</span><span class="p">,</span> <span class="n">mode</span><span class="o">=</span><span class="s1">&#39;a&#39;</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8&#39;</span><span class="p">)</span>
            <span class="n">first_write</span> <span class="o">=</span> <span class="kc">False</span>  <span class="c1"># 修改标志变量为False，以便后续不再写入表头</span>
        <span class="k">else</span><span class="p">:</span>
            <span class="c1"># 后续追加数据时不包含表头</span>
            <span class="n">df_</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="n">csv_file_path</span><span class="p">,</span> <span class="n">mode</span><span class="o">=</span><span class="s1">&#39;a&#39;</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">header</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8&#39;</span><span class="p">)</span>
</code></pre></div><br>
<p>代码运行结束后，我们查看下最终数据</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">final_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;黑猫投诉.csv&#39;</span><span class="p">)</span>
<span class="n">final_df</span>
</code></pre></div><p><img loading="lazy" src="img/06-df.jpg" alt=""  />
</p>
<br>
<h2 id="下载代码">下载代码</h2>
<p><a href="tousu-spider.py"><strong>点击下载tousu-spider.py</strong></a></p>
<br>
<br>
<h2 id="精选内容">精选内容</h2>
<ul>
<li><a href="https://textdata.cn/blog/datasets_available_for_management_science/">LIST | 可供社科(经管)领域使用的数据集汇总</a></li>
<li><a href="https://textdata.cn/blog/the_text_analysis_list_about_ms/">LIST | 社科(经管)数据挖掘文献资料汇总</a></li>
<li><a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/">推荐 | 文本分析库cntext使用手册</a></li>
<li><a href="https://textdata.cn/blog/management_python_course/">付费视频课 | Python实证指标构建与文本分析</a></li>
<li><a href="https://textdata.cn/blog/2025-02-14-using-online-large-model-api-to-transform-text-data-into-structured-data/">教程 | 使用大模型将文本数据转化为结构化数据</a></li>
<li><a href="https://textdata.cn/blog/2025-03-05-consumer-complaint-dataset/">数据集| 1500w+消费者投诉数据集(2018 ~ 2024.8)</a></li>
</ul>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>数据集 | 4877w 条全球手机蜂窝基站数据(2006~2024.5)</title>
      <link>https://textdata.cn/blog/2025-03-03-global-cell-towers-dataset/</link>
      <pubDate>Mon, 03 Mar 2025 00:00:00 +0000</pubDate>
      
      <guid>/blog/2025-03-03-global-cell-towers-dataset/</guid>
      <description>&lt;h2 id=&#34;一数据集介绍&#34;&gt;一、数据集介绍&lt;/h2&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;数据名称: 全球蜂窝基站数据集
数据格式: csv
记录数量: 48775421
覆盖年度: 2006 ~ 2024
使用方法: 可探索全球基站位置的分布与特征，分析网络覆盖模式，并调查基站更新的时间趋势。
数据来源: OpenCelliD
本文声明: 科研用途； 如有问题，请加微信372335839，备注「姓名-学校-专业」
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;所含字段&#34;&gt;所含字段&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Radio&lt;/strong&gt;：宽带蜂窝网络技术的代际（例如，LTE，GSM）。&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;MCC&lt;/strong&gt;：移动国家代码，是每个国家在移动网络中的唯一标识符。&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;MNC&lt;/strong&gt;：移动网络代码，在一个国家内识别移动网络。&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;LAC&lt;/strong&gt;：位置区域代码、跟踪区域代码或网络标识符。&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;CID&lt;/strong&gt;：每个基站收发器（BTS）或扇区的唯一标识符。&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Longitude&lt;/strong&gt;：指定东西方向位置的地理坐标。&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latitude&lt;/strong&gt;：指定南北方向位置的地理坐标。&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Range&lt;/strong&gt;：大约的覆盖范围，表示基站覆盖延伸的区域（以米为单位）。&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Samples&lt;/strong&gt;：为了导出该数据点而处理的测量数量。&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Changeable&lt;/strong&gt;：指示基站位置是否通过样本处理确定（1），或是直接从电信公司获取（0）。&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Created&lt;/strong&gt;：表示基站首次添加到数据库的时间戳（UNIX格式）。&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Updated&lt;/strong&gt;：表示基站最后被观察或更新到数据库的时间戳（UNIX格式）。&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AverageSignal&lt;/strong&gt;：表示基站位置的平均信号强度。&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Country&lt;/strong&gt;：基站所在的国家(或地区)。&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Network&lt;/strong&gt;：拥有基站的公司。&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Continent&lt;/strong&gt;：基站所在的大陆。&lt;/li&gt;
&lt;/ul&gt;
&lt;br&gt;
&lt;br&gt;
&lt;h2 id=&#34;二查看数据&#34;&gt;二、查看数据&lt;/h2&gt;
&lt;h3 id=&#34;21-读取数据&#34;&gt;2.1 读取数据&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;全球蜂窝基站(更新至2024-05).csv.gz&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;compression&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;gzip&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#或将gz文件解压得到csv文件，再读取&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#df = pd.read_csv(&amp;#39;全球蜂窝基站(更新至2024-05).csv&amp;#39;, compression=&amp;#39;gzip&amp;#39;)&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;created&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;to_datetime&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;created&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;updated&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;to_datetime&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;updated&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/02-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;22-字段缺失程度&#34;&gt;2.2 字段缺失程度&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;missingno&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;ms&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;ms&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;matrix&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/05-nan.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;从图中可以看出，本数据集每个字段对应的条形图都是很饱满的黑色， 基本上不存在明显的缺失情况。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;23-各大陆蜂窝基站数量&#34;&gt;2.3 各大陆蜂窝基站数量&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Continent&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;value_counts&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;Continent
Europe           17119424
Asia             15870848
North America     9541565
South America     3150035
Africa            2346316
Oceania            747233
Name: count, dtype: int64
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;24-覆盖时间&#34;&gt;2.4 覆盖时间&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;基站设立时间: &amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;created&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;min&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(),&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;~&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;created&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;max&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;())&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;基站更新时间: &amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;updated&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;min&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(),&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;~&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;updated&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;max&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;基站设立时间范围:  1970-01-01 00:00:00 ~ 2024-05-09 23:57:18
基站更新时间范围:  1970-01-01 18:33:27 ~ 2024-05-09 23:59:02
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;注意:  时间范围虽然很广，但1970~2006之间是极其稀疏的， 绝大多数记录是落在2006-2024。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;25-筛选指定国家地区&#34;&gt;2.5 筛选指定国家(地区)&lt;/h3&gt;
&lt;p&gt;将中国的基站数据筛选&amp;amp;保存到新文件中。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;#筛选含中国的站点数据&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;china_df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Country&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;contains&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;China&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)]&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#保存&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;china_df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;to_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;中国手机蜂窝基站数据.csv.gz&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;index&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;False&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;compression&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;gzip&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;china_df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/03-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;h2 id=&#34;三可视化&#34;&gt;三、可视化&lt;/h2&gt;
&lt;h3 id=&#34;31-全球年度建站&#34;&gt;3.1 全球年度建站&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;plotnine&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;*&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;plt&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;matplotlib.font_manager&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;FontProperties&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#文泉驿微米黑.ttf位于代码同文件夹&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;font_prop&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;FontProperties&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fname&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;文泉驿微米黑.ttf&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; 

&lt;span class=&#34;n&#34;&gt;years&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[]&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;volumes&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[]&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df2&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;copy&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;deep&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df2&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;set_index&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;created&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;inplace&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;date&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;y_df&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df2&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;groupby&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Grouper&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;freq&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;YE&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)):&lt;/span&gt;
    &lt;span class=&#34;k&#34;&gt;if&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;date&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;year&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;&amp;gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;2006&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
        &lt;span class=&#34;n&#34;&gt;years&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;append&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;date&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;year&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
        &lt;span class=&#34;n&#34;&gt;volumes&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;append&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;y_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;data&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;DataFrame&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;({&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;year&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;years&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
                     &lt;span class=&#34;s1&#34;&gt;&amp;#39;volume&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;volumes&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;})&lt;/span&gt;


&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;ggplot&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;data&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;  &lt;span class=&#34;n&#34;&gt;aes&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;year&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;y&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;volume&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
    &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;geom_bar&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;stat&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;identity&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;labs&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;title&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;全球蜂窝基站年度建站数(2006-2024)&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
          &lt;span class=&#34;n&#34;&gt;x&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;年度&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
          &lt;span class=&#34;n&#34;&gt;y&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;基站数量&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;geom_text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;aes&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;label&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;volume&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt;  &lt;span class=&#34;c1&#34;&gt;# 添加数据标签&lt;/span&gt;
               &lt;span class=&#34;n&#34;&gt;va&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;bottom&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;           &lt;span class=&#34;c1&#34;&gt;# 垂直对齐方式为底部（即在柱子顶部）&lt;/span&gt;
               &lt;span class=&#34;n&#34;&gt;size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;8&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;                &lt;span class=&#34;c1&#34;&gt;# 设置字体大小&lt;/span&gt;
               &lt;span class=&#34;n&#34;&gt;format_string&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{}&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;     &lt;span class=&#34;c1&#34;&gt;# 格式化字符串&lt;/span&gt;
    &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;theme&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;figure_size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;10&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;6&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt;
           &lt;span class=&#34;n&#34;&gt;text&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;element_text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;family&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;font_prop&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;get_name&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(),&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;14&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt; 
           &lt;span class=&#34;n&#34;&gt;plot_title&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;element_text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;family&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;font_prop&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;get_name&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(),&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;18&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
          &lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;scale_x_continuous&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;breaks&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;range&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;2006&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2025&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;5&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt; 

&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/03-global.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;32-中国年度建站&#34;&gt;3.2 中国年度建站&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;plotnine&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;*&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;plt&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;matplotlib.font_manager&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;FontProperties&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#文泉驿微米黑.ttf位于代码同文件夹&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;font_prop&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;FontProperties&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fname&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;文泉驿微米黑.ttf&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; 

&lt;span class=&#34;n&#34;&gt;years&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[]&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;volumes&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[]&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#筛选中国&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;china_df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Country&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;contains&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;China&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)]&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;china_df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;set_index&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;updated&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;inplace&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;date&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;y_df&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;china_df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;groupby&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Grouper&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;freq&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;YE&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)):&lt;/span&gt;
    &lt;span class=&#34;k&#34;&gt;if&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;date&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;year&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;&amp;gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;2006&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
        &lt;span class=&#34;n&#34;&gt;years&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;append&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;date&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;year&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
        &lt;span class=&#34;n&#34;&gt;volumes&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;append&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;y_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;data&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;DataFrame&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;({&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;year&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;years&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
                     &lt;span class=&#34;s1&#34;&gt;&amp;#39;volume&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;volumes&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;})&lt;/span&gt;


&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;ggplot&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;data&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;  &lt;span class=&#34;n&#34;&gt;aes&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;year&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;y&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;volume&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
    &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;geom_bar&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;stat&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;identity&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;labs&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;title&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;中国蜂窝基站年度建站数（2006-2024）&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
          &lt;span class=&#34;n&#34;&gt;x&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;年度&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
          &lt;span class=&#34;n&#34;&gt;y&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;基站数量&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;geom_text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;aes&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;label&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;volume&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt;  &lt;span class=&#34;c1&#34;&gt;# 添加数据标签&lt;/span&gt;
               &lt;span class=&#34;n&#34;&gt;va&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;bottom&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;           &lt;span class=&#34;c1&#34;&gt;# 垂直对齐方式为底部（即在柱子顶部）&lt;/span&gt;
               &lt;span class=&#34;n&#34;&gt;size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;8&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;                &lt;span class=&#34;c1&#34;&gt;# 设置字体大小&lt;/span&gt;
               &lt;span class=&#34;n&#34;&gt;format_string&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{}&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;     &lt;span class=&#34;c1&#34;&gt;# 格式化字符串&lt;/span&gt;
    &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;theme&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;figure_size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;10&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;6&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt;
           &lt;span class=&#34;n&#34;&gt;text&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;element_text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;family&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;font_prop&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;get_name&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(),&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;14&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt; 
           &lt;span class=&#34;n&#34;&gt;plot_title&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;element_text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;family&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;font_prop&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;get_name&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(),&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;18&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
          &lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;scale_x_continuous&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;breaks&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;range&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;2006&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2025&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;5&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt; 

&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/04-china.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
</description>
      <content:encoded><![CDATA[<h2 id="一数据集介绍">一、数据集介绍</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">数据名称: 全球蜂窝基站数据集
数据格式: csv
记录数量: 48775421
覆盖年度: 2006 ~ 2024
使用方法: 可探索全球基站位置的分布与特征，分析网络覆盖模式，并调查基站更新的时间趋势。
数据来源: OpenCelliD
本文声明: 科研用途； 如有问题，请加微信372335839，备注「姓名-学校-专业」
</code></pre></div><br>
<h3 id="所含字段">所含字段</h3>
<ul>
<li><strong>Radio</strong>：宽带蜂窝网络技术的代际（例如，LTE，GSM）。</li>
<li><strong>MCC</strong>：移动国家代码，是每个国家在移动网络中的唯一标识符。</li>
<li><strong>MNC</strong>：移动网络代码，在一个国家内识别移动网络。</li>
<li><strong>LAC</strong>：位置区域代码、跟踪区域代码或网络标识符。</li>
<li><strong>CID</strong>：每个基站收发器（BTS）或扇区的唯一标识符。</li>
<li><strong>Longitude</strong>：指定东西方向位置的地理坐标。</li>
<li><strong>Latitude</strong>：指定南北方向位置的地理坐标。</li>
<li><strong>Range</strong>：大约的覆盖范围，表示基站覆盖延伸的区域（以米为单位）。</li>
<li><strong>Samples</strong>：为了导出该数据点而处理的测量数量。</li>
<li><strong>Changeable</strong>：指示基站位置是否通过样本处理确定（1），或是直接从电信公司获取（0）。</li>
<li><strong>Created</strong>：表示基站首次添加到数据库的时间戳（UNIX格式）。</li>
<li><strong>Updated</strong>：表示基站最后被观察或更新到数据库的时间戳（UNIX格式）。</li>
<li><strong>AverageSignal</strong>：表示基站位置的平均信号强度。</li>
<li><strong>Country</strong>：基站所在的国家(或地区)。</li>
<li><strong>Network</strong>：拥有基站的公司。</li>
<li><strong>Continent</strong>：基站所在的大陆。</li>
</ul>
<br>
<br>
<h2 id="二查看数据">二、查看数据</h2>
<h3 id="21-读取数据">2.1 读取数据</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;全球蜂窝基站(更新至2024-05).csv.gz&#39;</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">)</span>
<span class="c1">#或将gz文件解压得到csv文件，再读取</span>
<span class="c1">#df = pd.read_csv(&#39;全球蜂窝基站(更新至2024-05).csv&#39;, compression=&#39;gzip&#39;)</span>

<span class="n">df</span><span class="p">[</span><span class="s1">&#39;created&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">created</span><span class="p">)</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;updated&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">updated</span><span class="p">)</span>
<span class="n">df</span>
</code></pre></div><p><img loading="lazy" src="img/02-df.png" alt=""  />
</p>
<br>
<h3 id="22-字段缺失程度">2.2 字段缺失程度</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">missingno</span> <span class="k">as</span> <span class="nn">ms</span>

<span class="n">ms</span><span class="o">.</span><span class="n">matrix</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/05-nan.png" alt=""  />
</p>
<p>从图中可以看出，本数据集每个字段对应的条形图都是很饱满的黑色， 基本上不存在明显的缺失情况。</p>
<br>
<h3 id="23-各大陆蜂窝基站数量">2.3 各大陆蜂窝基站数量</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="o">.</span><span class="n">Continent</span><span class="o">.</span><span class="n">value_counts</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">Continent
Europe           17119424
Asia             15870848
North America     9541565
South America     3150035
Africa            2346316
Oceania            747233
Name: count, dtype: int64
</code></pre></div><br>
<h3 id="24-覆盖时间">2.4 覆盖时间</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="nb">print</span><span class="p">(</span><span class="s1">&#39;基站设立时间: &#39;</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;created&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">min</span><span class="p">(),</span> <span class="s1">&#39;~&#39;</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;created&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">())</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;基站更新时间: &#39;</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;updated&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">min</span><span class="p">(),</span> <span class="s1">&#39;~&#39;</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;updated&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">())</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">基站设立时间范围:  1970-01-01 00:00:00 ~ 2024-05-09 23:57:18
基站更新时间范围:  1970-01-01 18:33:27 ~ 2024-05-09 23:59:02
</code></pre></div><p>注意:  时间范围虽然很广，但1970~2006之间是极其稀疏的， 绝大多数记录是落在2006-2024。</p>
<br>
<h3 id="25-筛选指定国家地区">2.5 筛选指定国家(地区)</h3>
<p>将中国的基站数据筛选&amp;保存到新文件中。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#筛选含中国的站点数据</span>
<span class="n">china_df</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="o">.</span><span class="n">Country</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">&#39;China&#39;</span><span class="p">)]</span>

<span class="c1">#保存</span>
<span class="n">china_df</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s1">&#39;中国手机蜂窝基站数据.csv.gz&#39;</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">)</span>
<span class="n">china_df</span>
</code></pre></div><p><img loading="lazy" src="img/03-df.png" alt=""  />
</p>
<br>
<br>
<h2 id="三可视化">三、可视化</h2>
<h3 id="31-全球年度建站">3.1 全球年度建站</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">plotnine</span> <span class="kn">import</span> <span class="o">*</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">from</span> <span class="nn">matplotlib.font_manager</span> <span class="kn">import</span> <span class="n">FontProperties</span>

<span class="c1">#文泉驿微米黑.ttf位于代码同文件夹</span>
<span class="n">font_prop</span> <span class="o">=</span> <span class="n">FontProperties</span><span class="p">(</span><span class="n">fname</span><span class="o">=</span><span class="s1">&#39;文泉驿微米黑.ttf&#39;</span><span class="p">)</span> 

<span class="n">years</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">volumes</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">df2</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">copy</span><span class="p">(</span><span class="n">deep</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">df2</span><span class="o">.</span><span class="n">set_index</span><span class="p">(</span><span class="s1">&#39;created&#39;</span><span class="p">,</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="k">for</span> <span class="n">date</span><span class="p">,</span> <span class="n">y_df</span> <span class="ow">in</span> <span class="n">df2</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">pd</span><span class="o">.</span><span class="n">Grouper</span><span class="p">(</span><span class="n">freq</span><span class="o">=</span><span class="s1">&#39;YE&#39;</span><span class="p">)):</span>
    <span class="k">if</span> <span class="n">date</span><span class="o">.</span><span class="n">year</span><span class="o">&gt;=</span><span class="mi">2006</span><span class="p">:</span>
        <span class="n">years</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">date</span><span class="o">.</span><span class="n">year</span><span class="p">)</span>
        <span class="n">volumes</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">y_df</span><span class="p">))</span>

<span class="n">data</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="s1">&#39;year&#39;</span><span class="p">:</span> <span class="n">years</span><span class="p">,</span> 
                     <span class="s1">&#39;volume&#39;</span><span class="p">:</span> <span class="n">volumes</span><span class="p">})</span>


<span class="p">(</span>
    <span class="n">ggplot</span><span class="p">(</span><span class="n">data</span><span class="p">,</span>  <span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="s1">&#39;year&#39;</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="s1">&#39;volume&#39;</span><span class="p">))</span>
    <span class="o">+</span><span class="n">geom_bar</span><span class="p">(</span><span class="n">stat</span><span class="o">=</span><span class="s1">&#39;identity&#39;</span><span class="p">)</span>
    <span class="o">+</span><span class="n">labs</span><span class="p">(</span><span class="n">title</span><span class="o">=</span><span class="s1">&#39;全球蜂窝基站年度建站数(2006-2024)&#39;</span><span class="p">,</span>
          <span class="n">x</span> <span class="o">=</span> <span class="s1">&#39;年度&#39;</span><span class="p">,</span> 
          <span class="n">y</span> <span class="o">=</span> <span class="s1">&#39;基站数量&#39;</span><span class="p">)</span>
    <span class="o">+</span><span class="n">geom_text</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">label</span><span class="o">=</span><span class="s1">&#39;volume&#39;</span><span class="p">),</span>  <span class="c1"># 添加数据标签</span>
               <span class="n">va</span><span class="o">=</span><span class="s1">&#39;bottom&#39;</span><span class="p">,</span>           <span class="c1"># 垂直对齐方式为底部（即在柱子顶部）</span>
               <span class="n">size</span><span class="o">=</span><span class="mi">8</span><span class="p">,</span>                <span class="c1"># 设置字体大小</span>
               <span class="n">format_string</span><span class="o">=</span><span class="s1">&#39;</span><span class="si">{}</span><span class="s1">&#39;</span><span class="p">)</span>     <span class="c1"># 格式化字符串</span>
    <span class="o">+</span><span class="n">theme</span><span class="p">(</span><span class="n">figure_size</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">6</span><span class="p">),</span>
           <span class="n">text</span> <span class="o">=</span> <span class="n">element_text</span><span class="p">(</span><span class="n">family</span> <span class="o">=</span> <span class="n">font_prop</span><span class="o">.</span><span class="n">get_name</span><span class="p">(),</span> <span class="n">size</span><span class="o">=</span><span class="mi">14</span><span class="p">),</span> 
           <span class="n">plot_title</span> <span class="o">=</span> <span class="n">element_text</span><span class="p">(</span><span class="n">family</span> <span class="o">=</span> <span class="n">font_prop</span><span class="o">.</span><span class="n">get_name</span><span class="p">(),</span> <span class="n">size</span><span class="o">=</span><span class="mi">18</span><span class="p">)</span>
          <span class="p">)</span>
    <span class="o">+</span><span class="n">scale_x_continuous</span><span class="p">(</span><span class="n">breaks</span><span class="o">=</span><span class="nb">range</span><span class="p">(</span><span class="mi">2006</span><span class="p">,</span> <span class="mi">2025</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span> 

<span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/03-global.png" alt=""  />
</p>
<br>
<h3 id="32-中国年度建站">3.2 中国年度建站</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">plotnine</span> <span class="kn">import</span> <span class="o">*</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">from</span> <span class="nn">matplotlib.font_manager</span> <span class="kn">import</span> <span class="n">FontProperties</span>

<span class="c1">#文泉驿微米黑.ttf位于代码同文件夹</span>
<span class="n">font_prop</span> <span class="o">=</span> <span class="n">FontProperties</span><span class="p">(</span><span class="n">fname</span><span class="o">=</span><span class="s1">&#39;文泉驿微米黑.ttf&#39;</span><span class="p">)</span> 

<span class="n">years</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">volumes</span> <span class="o">=</span> <span class="p">[]</span>
<span class="c1">#筛选中国</span>
<span class="n">china_df</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="o">.</span><span class="n">Country</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">&#39;China&#39;</span><span class="p">)]</span>
<span class="n">china_df</span><span class="o">.</span><span class="n">set_index</span><span class="p">(</span><span class="s1">&#39;updated&#39;</span><span class="p">,</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="k">for</span> <span class="n">date</span><span class="p">,</span> <span class="n">y_df</span> <span class="ow">in</span> <span class="n">china_df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">pd</span><span class="o">.</span><span class="n">Grouper</span><span class="p">(</span><span class="n">freq</span><span class="o">=</span><span class="s1">&#39;YE&#39;</span><span class="p">)):</span>
    <span class="k">if</span> <span class="n">date</span><span class="o">.</span><span class="n">year</span><span class="o">&gt;=</span><span class="mi">2006</span><span class="p">:</span>
        <span class="n">years</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">date</span><span class="o">.</span><span class="n">year</span><span class="p">)</span>
        <span class="n">volumes</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">y_df</span><span class="p">))</span>

<span class="n">data</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="s1">&#39;year&#39;</span><span class="p">:</span> <span class="n">years</span><span class="p">,</span> 
                     <span class="s1">&#39;volume&#39;</span><span class="p">:</span> <span class="n">volumes</span><span class="p">})</span>


<span class="p">(</span>
    <span class="n">ggplot</span><span class="p">(</span><span class="n">data</span><span class="p">,</span>  <span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="s1">&#39;year&#39;</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="s1">&#39;volume&#39;</span><span class="p">))</span>
    <span class="o">+</span><span class="n">geom_bar</span><span class="p">(</span><span class="n">stat</span><span class="o">=</span><span class="s1">&#39;identity&#39;</span><span class="p">)</span>
    <span class="o">+</span><span class="n">labs</span><span class="p">(</span><span class="n">title</span><span class="o">=</span><span class="s1">&#39;中国蜂窝基站年度建站数（2006-2024）&#39;</span><span class="p">,</span>
          <span class="n">x</span> <span class="o">=</span> <span class="s1">&#39;年度&#39;</span><span class="p">,</span> 
          <span class="n">y</span> <span class="o">=</span> <span class="s1">&#39;基站数量&#39;</span><span class="p">)</span>
    <span class="o">+</span><span class="n">geom_text</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">label</span><span class="o">=</span><span class="s1">&#39;volume&#39;</span><span class="p">),</span>  <span class="c1"># 添加数据标签</span>
               <span class="n">va</span><span class="o">=</span><span class="s1">&#39;bottom&#39;</span><span class="p">,</span>           <span class="c1"># 垂直对齐方式为底部（即在柱子顶部）</span>
               <span class="n">size</span><span class="o">=</span><span class="mi">8</span><span class="p">,</span>                <span class="c1"># 设置字体大小</span>
               <span class="n">format_string</span><span class="o">=</span><span class="s1">&#39;</span><span class="si">{}</span><span class="s1">&#39;</span><span class="p">)</span>     <span class="c1"># 格式化字符串</span>
    <span class="o">+</span><span class="n">theme</span><span class="p">(</span><span class="n">figure_size</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">6</span><span class="p">),</span>
           <span class="n">text</span> <span class="o">=</span> <span class="n">element_text</span><span class="p">(</span><span class="n">family</span> <span class="o">=</span> <span class="n">font_prop</span><span class="o">.</span><span class="n">get_name</span><span class="p">(),</span> <span class="n">size</span><span class="o">=</span><span class="mi">14</span><span class="p">),</span> 
           <span class="n">plot_title</span> <span class="o">=</span> <span class="n">element_text</span><span class="p">(</span><span class="n">family</span> <span class="o">=</span> <span class="n">font_prop</span><span class="o">.</span><span class="n">get_name</span><span class="p">(),</span> <span class="n">size</span><span class="o">=</span><span class="mi">18</span><span class="p">)</span>
          <span class="p">)</span>
    <span class="o">+</span><span class="n">scale_x_continuous</span><span class="p">(</span><span class="n">breaks</span><span class="o">=</span><span class="nb">range</span><span class="p">(</span><span class="mi">2006</span><span class="p">,</span> <span class="mi">2025</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span> 

<span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/04-china.png" alt=""  />
</p>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>数据集 | 536w条「上证e互动、深证互动易」问答记录(2011-2024.12.31)</title>
      <link>https://textdata.cn/blog/2025-03-03-china-share-market-interaction-platform-dataset/</link>
      <pubDate>Mon, 03 Mar 2025 00:00:00 +0000</pubDate>
      
      <guid>/blog/2025-03-03-china-share-market-interaction-platform-dataset/</guid>
      <description>「上证e互动、深证互动易」问答记录数据集是研究中国资本市场信息披露、投资者关系管理及市场行为的重要非结构化数据源。</description>
      <content:encoded><![CDATA[<h2 id="一数据集介绍">一、数据集介绍</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">「上证e互动、深证互动易」问答记录

覆盖日期: 2009-04-09 ~ 2024-12-31

数据来源: 上证e互动、深证互动易

数据格式: csv

企业数: 5344

记录条数: 5364719

所含字段:
 -  symbol 股票代码
 -  shortName 公司简称
 -  indexId 网址ID(供爬虫使用)
 -  question 提问内容
 -  questionDate 提问时间
 -  authorName 提问者昵称
 -  authorCode 提问者ID
 -  answer 回答内容
 -  answerDate 回答时间
  
本文声明: 如有问题， 请加微信372335839，备注「姓名-学校-专业」
</code></pre></div><p><img loading="lazy" src="img/01-irm.cninfo.com.cn.png" alt=""  />
</p>
<p><img loading="lazy" src="img/02-sns.sseinfo.com.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="二查看数据">二、查看数据</h2>
<h3 id="21-读取数据">2.1 读取数据</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;互动平台问答文本.csv.gz&#39;</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/03-df1.png" alt=""  />
</p>
<br>
<h3 id="22-数据量">2.2 数据量</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#数据量</span>
<span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">    5364719
</code></pre></div><br>
<h3 id="23-企业数">2.3 企业数</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#企业数</span>
<span class="n">df</span><span class="o">.</span><span class="n">symbol</span><span class="o">.</span><span class="n">nunique</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">5344
</code></pre></div><br>
<br>
<h2 id="三可视化">三、可视化</h2>
<h3 id="31-字段缺失情况">3.1 字段缺失情况</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">missingno</span> <span class="k">as</span> <span class="nn">ms</span>

<span class="n">ms</span><span class="o">.</span><span class="n">matrix</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/05-nan.png" alt=""  />
</p>
<p>所有字段均是饱满的黑柱， 看不到条纹。 因此该数据集字段不存在明显的数据缺失情况。</p>
<br>
<h3 id="32-按年度显示问答记录量条数">3.2 按年度显示问答记录量(条数)</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">plotnine</span> <span class="kn">import</span> <span class="o">*</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">from</span> <span class="nn">matplotlib.font_manager</span> <span class="kn">import</span> <span class="n">FontProperties</span>

<span class="c1">#文泉驿微米黑.ttf位于代码同文件夹</span>
<span class="n">font_prop</span> <span class="o">=</span> <span class="n">FontProperties</span><span class="p">(</span><span class="n">fname</span><span class="o">=</span><span class="s1">&#39;文泉驿微米黑.ttf&#39;</span><span class="p">)</span> 


<span class="n">volumes</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;questionDate&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;questionDate&#39;</span><span class="p">])</span>
<span class="n">df</span><span class="o">.</span><span class="n">set_index</span><span class="p">(</span><span class="s1">&#39;questionDate&#39;</span><span class="p">,</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="k">for</span> <span class="n">date</span><span class="p">,</span> <span class="n">y_df</span> <span class="ow">in</span> <span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">pd</span><span class="o">.</span><span class="n">Grouper</span><span class="p">(</span><span class="n">freq</span><span class="o">=</span><span class="s1">&#39;YE&#39;</span><span class="p">)):</span>
    <span class="n">years</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">date</span><span class="o">.</span><span class="n">year</span><span class="p">)</span>
    <span class="n">volumes</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">y_df</span><span class="p">))</span>

<span class="n">data</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="s1">&#39;year&#39;</span><span class="p">:</span> <span class="nb">range</span><span class="p">(</span><span class="mi">2009</span><span class="p">,</span> <span class="mi">2025</span><span class="p">),</span> 
                     <span class="s1">&#39;volume&#39;</span><span class="p">:</span> <span class="n">volumes</span><span class="p">})</span>


<span class="p">(</span>
    <span class="n">ggplot</span><span class="p">(</span><span class="n">data</span><span class="p">,</span>  <span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="s1">&#39;year&#39;</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="s1">&#39;volume&#39;</span><span class="p">))</span>
    <span class="o">+</span><span class="n">geom_bar</span><span class="p">(</span><span class="n">stat</span><span class="o">=</span><span class="s1">&#39;identity&#39;</span><span class="p">)</span>
    <span class="o">+</span><span class="n">labs</span><span class="p">(</span><span class="n">title</span><span class="o">=</span><span class="s1">&#39;上证e互动、深证互动易年度问答记录量(2009-2024)&#39;</span><span class="p">,</span>
          <span class="n">x</span> <span class="o">=</span> <span class="s1">&#39;年度&#39;</span><span class="p">,</span> 
          <span class="n">y</span> <span class="o">=</span> <span class="s1">&#39;记录量(条)&#39;</span><span class="p">)</span>
    <span class="o">+</span><span class="n">geom_text</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">label</span><span class="o">=</span><span class="s1">&#39;volume&#39;</span><span class="p">),</span>  <span class="c1"># 添加数据标签</span>
               <span class="n">va</span><span class="o">=</span><span class="s1">&#39;bottom&#39;</span><span class="p">,</span>           <span class="c1"># 垂直对齐方式为底部（即在柱子顶部）</span>
               <span class="n">size</span><span class="o">=</span><span class="mi">8</span><span class="p">,</span>                <span class="c1"># 设置字体大小</span>
               <span class="n">format_string</span><span class="o">=</span><span class="s1">&#39;</span><span class="si">{}</span><span class="s1">&#39;</span><span class="p">)</span>     <span class="c1"># 格式化字符串</span>
    <span class="o">+</span><span class="n">theme</span><span class="p">(</span><span class="n">figure_size</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">6</span><span class="p">),</span>
           <span class="n">text</span> <span class="o">=</span> <span class="n">element_text</span><span class="p">(</span><span class="n">family</span> <span class="o">=</span> <span class="n">font_prop</span><span class="o">.</span><span class="n">get_name</span><span class="p">(),</span> <span class="n">size</span><span class="o">=</span><span class="mi">14</span><span class="p">),</span> 
           <span class="n">plot_title</span> <span class="o">=</span> <span class="n">element_text</span><span class="p">(</span><span class="n">family</span> <span class="o">=</span> <span class="n">font_prop</span><span class="o">.</span><span class="n">get_name</span><span class="p">(),</span> <span class="n">size</span><span class="o">=</span><span class="mi">18</span><span class="p">)</span>
          <span class="p">)</span>
    <span class="o">+</span><span class="n">scale_x_continuous</span><span class="p">(</span><span class="n">breaks</span><span class="o">=</span><span class="nb">range</span><span class="p">(</span><span class="mi">2009</span><span class="p">,</span> <span class="mi">2025</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span> 

<span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/04-vis.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="相关文献">相关文献</h2>
<p>丁慧, 吕长江, 陈运佳. 投资者信息能力:意见分歧与股价崩盘风险——来自社交媒体“上证e互动”的证据[J]. 管理世界, 2018, 34 (09): 161-171.</p>
<p>丁慧, 吕长江, 黄海杰. 社交媒体、投资者信息获取和解读能力与盈余预期——来自“上证e互动”平台的证据[J]. 经济研究, 2018, 53 (01): 153-168.</p>
<p>高敬忠, 杨朝, 彭正银. 网络平台互动能够缓解企业融资约束吗——来自交易所互动平台问答的证据[J]. 会计研究, 2021, (06): 59-75.</p>
<p>卞世博, 陈曜, 汪训孝. 高质量的互动可以提高股票市场定价效率吗?——基于“上证e互动”的研究[J]. 经济学(季刊), 2022, 22 (03): 749-772.</p>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>数据集 | 1998-2023年中国基金年度报告</title>
      <link>https://textdata.cn/blog/2025-02-25-china-fund-annual-report-dataset/</link>
      <pubDate>Tue, 25 Feb 2025 00:00:00 +0000</pubDate>
      
      <guid>/blog/2025-02-25-china-fund-annual-report-dataset/</guid>
      <description>&lt;h2 id=&#34;一数据集介绍&#34;&gt;一、数据集介绍&lt;/h2&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;数据集名称: 中国基金年度报告数据集
基金数量: 12113
会计年度: 1998 ~ 2023
数据源: http://eid.csrc.gov.cn/fund/disclose/index.html
数据格式: pdf、csv(46196个pdf汇总到一个csv中)
获取: 1000元；如购买，请加微信 372335839，  备注「姓名-学校-专业」
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/01-choose.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h3 id=&#34;字段&#34;&gt;字段&lt;/h3&gt;
&lt;p&gt;1998 ~ 2023年基金年报的字段有&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;会计年度year&lt;/li&gt;
&lt;li&gt;代码code&lt;/li&gt;
&lt;li&gt;基金简称name&lt;/li&gt;
&lt;li&gt;基金年度报告文本text&lt;/li&gt;
&lt;/ul&gt;
&lt;br&gt;
&lt;br&gt;
&lt;h2 id=&#34;二读取数据&#34;&gt;二、读取数据&lt;/h2&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;基金年报.csv.gz&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;compression&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;gzip&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/02-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;h2 id=&#34;三可视化&#34;&gt;三、可视化&lt;/h2&gt;
&lt;p&gt;基金数&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;code&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;astype&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;nunique&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;12113
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;plotnine&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;*&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;plt&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;matplotlib.font_manager&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;FontProperties&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#文泉驿微米黑.ttf位于代码同文件夹&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;font_prop&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;FontProperties&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fname&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;文泉驿微米黑.ttf&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; 

&lt;span class=&#34;n&#34;&gt;volumes&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[]&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;year&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;range&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;1998&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2024&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;record_num&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;year&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;==&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;year&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;volumes&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;append&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;record_num&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;data&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;year&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;range&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;1998&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2024&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt;
        &lt;span class=&#34;s2&#34;&gt;&amp;#34;volume&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;volumes&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;data&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;DataFrame&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;data&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;ggplot&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;data&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;  &lt;span class=&#34;n&#34;&gt;aes&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;year&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;y&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;volume&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
    &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;geom_bar&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;stat&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;identity&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;labs&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;title&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;中国基金年度报告数量(1998-2024.6)&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
          &lt;span class=&#34;n&#34;&gt;x&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;年度&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
          &lt;span class=&#34;n&#34;&gt;y&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;报告数&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;geom_text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;aes&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;label&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;volume&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt;  &lt;span class=&#34;c1&#34;&gt;# 添加数据标签&lt;/span&gt;
               &lt;span class=&#34;n&#34;&gt;va&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;bottom&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;           &lt;span class=&#34;c1&#34;&gt;# 垂直对齐方式为底部（即在柱子顶部）&lt;/span&gt;
               &lt;span class=&#34;n&#34;&gt;size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;8&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;                &lt;span class=&#34;c1&#34;&gt;# 设置字体大小&lt;/span&gt;
               &lt;span class=&#34;n&#34;&gt;format_string&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{}&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;     &lt;span class=&#34;c1&#34;&gt;# 格式化字符串&lt;/span&gt;
    &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;theme&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;figure_size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;10&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;6&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt;
           &lt;span class=&#34;n&#34;&gt;text&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;element_text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;family&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;font_prop&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;get_name&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(),&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;14&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt; 
           &lt;span class=&#34;n&#34;&gt;plot_title&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;element_text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;family&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;font_prop&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;get_name&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(),&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;18&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
          &lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;scale_x_continuous&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;breaks&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;range&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;1998&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2024&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;5&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt; 

&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/03-plot.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;四获取&#34;&gt;四、获取&lt;/h2&gt;
&lt;p&gt;1000元；如购买，请加微信 372335839，  备注「姓名-学校-专业」&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;相关内容&#34;&gt;相关内容&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2024-01-21-hk-stock-market-anual-report/&#34;&gt;&lt;strong&gt;数据集 | 港股年报文本数据集(2007 ~ 2023.12)&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2024-01-18-neeq-china-listed-on-nation-equities-exchange-and-quotation-system-anunal-year-report/&#34;&gt;&lt;strong&gt;数据集(付费) | 三板上市公司年报2002-2023.12&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2024-01-14-usa-sec-10k-report-dataset/&#34;&gt;&lt;strong&gt;数据集 | 美股年报10-K、20-F数据(2000-2023.12)&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-03-24-load-w2v-and-expand-your-concpet-dicitonary/&#34;&gt;&lt;strong&gt;词向量(付费) | 使用MD&amp;amp;A2001-2022语料训练Word2Vec模型&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-01-06-mda_informative_content/&#34;&gt;中国工业经济 | MD&amp;amp;A信息含量指标构建代码实现&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-01-13-information-content-of-critical-audit/&#34;&gt;金融研究 | 使用Python构建「关键审计事项信息含量」&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-09-08-earnings-communication-conference-forward-looking-statements-information/&#34;&gt;中国管理科学 | 使用业绩说明会文本数据测量上市公司前瞻性信息&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2024-04-25-firm-economic-policy-uncertainty/&#34;&gt;代码 | 使用 MD&amp;amp;A文本测量「企业不确定性感知FEPU」&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2024-04-16-china-listed-company-information-dataset/&#34;&gt;&lt;strong&gt;数据集 | A股上市公司基本信息&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
</description>
      <content:encoded><![CDATA[<h2 id="一数据集介绍">一、数据集介绍</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">数据集名称: 中国基金年度报告数据集
基金数量: 12113
会计年度: 1998 ~ 2023
数据源: http://eid.csrc.gov.cn/fund/disclose/index.html
数据格式: pdf、csv(46196个pdf汇总到一个csv中)
获取: 1000元；如购买，请加微信 372335839，  备注「姓名-学校-专业」
</code></pre></div><p><img loading="lazy" src="img/01-choose.png" alt=""  />
</p>
<p><br><br></p>
<h3 id="字段">字段</h3>
<p>1998 ~ 2023年基金年报的字段有</p>
<ul>
<li>会计年度year</li>
<li>代码code</li>
<li>基金简称name</li>
<li>基金年度报告文本text</li>
</ul>
<br>
<br>
<h2 id="二读取数据">二、读取数据</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;基金年报.csv.gz&#39;</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">)</span>
<span class="n">df</span>
</code></pre></div><p><img loading="lazy" src="img/02-df.png" alt=""  />
</p>
<br>
<br>
<h2 id="三可视化">三、可视化</h2>
<p>基金数</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="o">.</span><span class="n">code</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="nb">str</span><span class="p">)</span><span class="o">.</span><span class="n">nunique</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">12113
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">plotnine</span> <span class="kn">import</span> <span class="o">*</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">from</span> <span class="nn">matplotlib.font_manager</span> <span class="kn">import</span> <span class="n">FontProperties</span>

<span class="c1">#文泉驿微米黑.ttf位于代码同文件夹</span>
<span class="n">font_prop</span> <span class="o">=</span> <span class="n">FontProperties</span><span class="p">(</span><span class="n">fname</span><span class="o">=</span><span class="s1">&#39;文泉驿微米黑.ttf&#39;</span><span class="p">)</span> 

<span class="n">volumes</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">year</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1998</span><span class="p">,</span> <span class="mi">2024</span><span class="p">):</span>
    <span class="n">record_num</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="o">.</span><span class="n">year</span><span class="o">==</span><span class="n">year</span><span class="p">])</span>
    <span class="n">volumes</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">record_num</span><span class="p">)</span>
<span class="n">data</span> <span class="o">=</span> <span class="p">{</span><span class="s2">&#34;year&#34;</span><span class="p">:</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1998</span><span class="p">,</span> <span class="mi">2024</span><span class="p">),</span>
        <span class="s2">&#34;volume&#34;</span><span class="p">:</span> <span class="n">volumes</span><span class="p">}</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>

<span class="p">(</span>
    <span class="n">ggplot</span><span class="p">(</span><span class="n">data</span><span class="p">,</span>  <span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="s1">&#39;year&#39;</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="s1">&#39;volume&#39;</span><span class="p">))</span>
    <span class="o">+</span><span class="n">geom_bar</span><span class="p">(</span><span class="n">stat</span><span class="o">=</span><span class="s1">&#39;identity&#39;</span><span class="p">)</span>
    <span class="o">+</span><span class="n">labs</span><span class="p">(</span><span class="n">title</span><span class="o">=</span><span class="s1">&#39;中国基金年度报告数量(1998-2024.6)&#39;</span><span class="p">,</span>
          <span class="n">x</span> <span class="o">=</span> <span class="s1">&#39;年度&#39;</span><span class="p">,</span> 
          <span class="n">y</span> <span class="o">=</span> <span class="s1">&#39;报告数&#39;</span><span class="p">)</span>
    <span class="o">+</span><span class="n">geom_text</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">label</span><span class="o">=</span><span class="s1">&#39;volume&#39;</span><span class="p">),</span>  <span class="c1"># 添加数据标签</span>
               <span class="n">va</span><span class="o">=</span><span class="s1">&#39;bottom&#39;</span><span class="p">,</span>           <span class="c1"># 垂直对齐方式为底部（即在柱子顶部）</span>
               <span class="n">size</span><span class="o">=</span><span class="mi">8</span><span class="p">,</span>                <span class="c1"># 设置字体大小</span>
               <span class="n">format_string</span><span class="o">=</span><span class="s1">&#39;</span><span class="si">{}</span><span class="s1">&#39;</span><span class="p">)</span>     <span class="c1"># 格式化字符串</span>
    <span class="o">+</span><span class="n">theme</span><span class="p">(</span><span class="n">figure_size</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">6</span><span class="p">),</span>
           <span class="n">text</span> <span class="o">=</span> <span class="n">element_text</span><span class="p">(</span><span class="n">family</span> <span class="o">=</span> <span class="n">font_prop</span><span class="o">.</span><span class="n">get_name</span><span class="p">(),</span> <span class="n">size</span><span class="o">=</span><span class="mi">14</span><span class="p">),</span> 
           <span class="n">plot_title</span> <span class="o">=</span> <span class="n">element_text</span><span class="p">(</span><span class="n">family</span> <span class="o">=</span> <span class="n">font_prop</span><span class="o">.</span><span class="n">get_name</span><span class="p">(),</span> <span class="n">size</span><span class="o">=</span><span class="mi">18</span><span class="p">)</span>
          <span class="p">)</span>
    <span class="o">+</span><span class="n">scale_x_continuous</span><span class="p">(</span><span class="n">breaks</span><span class="o">=</span><span class="nb">range</span><span class="p">(</span><span class="mi">1998</span><span class="p">,</span> <span class="mi">2024</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span> 

<span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/03-plot.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="四获取">四、获取</h2>
<p>1000元；如购买，请加微信 372335839，  备注「姓名-学校-专业」</p>
<p><br><br></p>
<h2 id="相关内容">相关内容</h2>
<ul>
<li><a href="https://textdata.cn/blog/2024-01-21-hk-stock-market-anual-report/"><strong>数据集 | 港股年报文本数据集(2007 ~ 2023.12)</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-01-18-neeq-china-listed-on-nation-equities-exchange-and-quotation-system-anunal-year-report/"><strong>数据集(付费) | 三板上市公司年报2002-2023.12</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-01-14-usa-sec-10k-report-dataset/"><strong>数据集 | 美股年报10-K、20-F数据(2000-2023.12)</strong></a></li>
<li><a href="https://textdata.cn/blog/2023-03-24-load-w2v-and-expand-your-concpet-dicitonary/"><strong>词向量(付费) | 使用MD&amp;A2001-2022语料训练Word2Vec模型</strong></a></li>
<li><a href="https://textdata.cn/blog/2023-01-06-mda_informative_content/">中国工业经济 | MD&amp;A信息含量指标构建代码实现</a></li>
<li><a href="https://textdata.cn/blog/2023-01-13-information-content-of-critical-audit/">金融研究 | 使用Python构建「关键审计事项信息含量」</a></li>
<li><a href="https://textdata.cn/blog/2023-09-08-earnings-communication-conference-forward-looking-statements-information/">中国管理科学 | 使用业绩说明会文本数据测量上市公司前瞻性信息</a></li>
<li><a href="https://textdata.cn/blog/2024-04-25-firm-economic-policy-uncertainty/">代码 | 使用 MD&amp;A文本测量「企业不确定性感知FEPU」</a></li>
<li><a href="https://textdata.cn/blog/2024-04-16-china-listed-company-information-dataset/"><strong>数据集 | A股上市公司基本信息</strong></a></li>
</ul>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>实验 | 使用大模型从图片中提取结构化数据</title>
      <link>https://textdata.cn/blog/2025-02-22-extracting-structured-data-from-images-with-ollama-and-large-language-models/</link>
      <pubDate>Sat, 22 Feb 2025 00:00:00 +0000</pubDate>
      
      <guid>/blog/2025-02-22-extracting-structured-data-from-images-with-ollama-and-large-language-models/</guid>
      <description>在快速发展的人工智能领域，将视觉功能集成到大型语言模型中，**可以用于解读图片语义， 从图片中提取出结构化数据**。</description>
      <content:encoded><![CDATA[<p>在快速发展的人工智能领域，将视觉功能集成到大型语言模型中，<strong>可以用于解读图片语义， 从图片中提取出结构化数据</strong>。</p>
<p><br><br></p>
<h2 id="一环境配置">一、环境配置</h2>
<p>在Python中调用大模型，  先要配置好相应的环境。</p>
<h3 id="11-安装python包">1.1 安装python包</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">pip3 install ollama
pip3 install pydantic
pip3 install instructor
</code></pre></div><h3 id="12-安装ollama">1.2 安装Ollama</h3>
<p><a href="https://ollama.ai/"><strong>Ollama</strong></a>是一款开源应用程序，可让您使用 MacOS、Linux 和 Windows 上的命令行界面在本地运行、创建和共享大型语言模型。</p>
<p>Ollama 可以直接从其库中访问各种 LLM，只需一个命令即可下载。下载后，只需执行一个命令即可开始使用。这对于工作量围绕终端窗口的用户非常有帮助。Ollama的安装、配置、使用的详细教程可阅读  <a href="https://textdata.cn/blog/2024-06-14-how-to-download-large-language-model-with-ollama/"><strong>教程 | 如何使用 Ollama 下载 &amp; 使用本地大语言模型</strong></a></p>
<p><img loading="lazy" src="img/01-ollama-gui.png" alt=""  />
</p>
<br>
<h3 id="13-安装大模型">1.3 安装大模型</h3>
<p>截止2025.2.22， 在 <a href="https://ollama.ai/"><strong>Ollama</strong></a> 网站中公开的 <em><strong>视觉类大模型</strong></em> 有7个， 这里简单介绍其中的两个</p>
<ul>
<li><em><strong>llama3.2-vision</strong></em> 更擅长识别图片中的英文信息</li>
<li><em><strong>minicpm-v</strong></em>  模型基于qwen， 更擅长识别图片中的中文信息</li>
</ul>
<p><img loading="lazy" src="img/02-vision-llm.png" alt=""  />
</p>
<p>打开命令行 <em><strong>cmd</strong></em> (在mac中对应terminal) ， 执行安装命令</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">ollama pull llama3.2-vision:11b
ollama pull minicpm-v:8b
</code></pre></div><br>
<h3 id="14-启动ollama服务">1.4 启动Ollama服务</h3>
<p>打开命令行 <em><strong>cmd</strong></em> (在mac中对应terminal) ， 执行启动服务命令</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">ollama serve
</code></pre></div><br>
<br>
<h2 id="二实验代码">二、实验代码</h2>
<h3 id="21-非结构化输出">2.1 非结构化输出</h3>
<p>截图的文件名 <em><strong>test_screen.png</strong></em></p>
<p><img loading="lazy" src="img/03-test.png" alt=""  />
</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">ollama</span>


<span class="c1">#论文的截图文件 test_screen.png</span>
<span class="c1">#注意，代码文件与截图文件同处于一个文件夹内</span>

<span class="n">response</span> <span class="o">=</span> <span class="n">ollama</span><span class="o">.</span><span class="n">chat</span><span class="p">(</span>
    <span class="n">model</span><span class="o">=</span><span class="s1">&#39;minicpm-v&#39;</span><span class="p">,</span>  
    <span class="n">messages</span><span class="o">=</span><span class="p">[{</span>
        <span class="s1">&#39;role&#39;</span><span class="p">:</span> <span class="s1">&#39;user&#39;</span><span class="p">,</span>
        <span class="s1">&#39;content&#39;</span><span class="p">:</span> <span class="s1">&#39;这是一篇什么领域的论文？&#39;</span><span class="p">,</span>
        <span class="s1">&#39;images&#39;</span><span class="p">:</span> <span class="p">[</span><span class="s1">&#39;test_screen.png&#39;</span><span class="p">]</span>
    <span class="p">}]</span>
<span class="p">)</span>

<span class="nb">print</span><span class="p">(</span><span class="n">response</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">ChatResponse(model=&#39;minicpm-v&#39;, created_at=&#39;2025-02-22T13:11:25.766017Z&#39;, done=True, done_reason=&#39;stop&#39;, total_duration=12956488125, load_duration=819433041, prompt_eval_count=461, prompt_eval_duration=9630000000, eval_count=147, eval_duration=2499000000, message=Message(role=&#39;assistant&#39;, content=&#39;这张图片是关于一篇题为“开或关在轨：如何（破碎）的线索影响消费者决策”的文章标题页。该文章由杰基·西尔弗曼和亚历山德拉·巴拉斯奇撰写，探讨了消费者行为的新技术追踪的后果。研究发现，在七项研究中，持续的行为轨迹会引发高消费后的强化，并且如果打破了这些轨迹，则会产生相反的效果，从而影响消费者的决策。所用的研究方法包括跟踪、行为分析以及追踪和监测等工具和技术，以了解线索对不同领域（如体育、学习）的影响。关键词列出了文章的焦点领域：断路器、行为追踪和记录、消费者动机、参与度。&#39;, images=None, tool_calls=None))
</code></pre></div><br>
<h3 id="22-结构化输出">2.2 结构化输出</h3>
<p>设计更详细的提示prompt， 通过使用<em><strong>typing</strong></em> 和<em><strong>pydantic</strong></em> 设计数据结构，输出为字典类数据。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">instructor</span>
<span class="kn">import</span> <span class="nn">os</span>
<span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">List</span>
<span class="kn">from</span> <span class="nn">pydantic</span> <span class="kn">import</span> <span class="n">BaseModel</span>
<span class="kn">from</span> <span class="nn">openai</span> <span class="kn">import</span> <span class="n">OpenAI</span>

<span class="n">PROMPT</span> <span class="o">=</span> <span class="s2">&#34;&#34;&#34;请分析提供的图片，并从中提取以下信息：
</span><span class="s2">- 标题(title)
</span><span class="s2">- 学科(subject)
</span><span class="s2">- 领域(field)
</span><span class="s2">
</span><span class="s2">
</span><span class="s2">请以如下格式返回结果：
</span><span class="s2">{
</span><span class="s2">    &#34;title&#34;: &#34;论文的标题&#34;,
</span><span class="s2">    &#34;subject&#34;: &#34;论文所属学科&#34;,
</span><span class="s2">    &#34;field&#34;: &#34;论文的研究领域&#34;,
</span><span class="s2">}&#34;&#34;&#34;</span>


<span class="c1">#本地已安装大模型minicpm-v</span>
<span class="n">model_name</span> <span class="o">=</span> <span class="s1">&#39;minicpm-v&#39;</span>
<span class="n">base_url</span> <span class="o">=</span> <span class="s1">&#39;http://127.0.0.1:11434/v1&#39;</span>
<span class="n">api_key</span> <span class="o">=</span> <span class="s1">&#39;NA&#39;</span>


<span class="c1">#论文的截图文件test_screen.png</span>
<span class="c1">#注意，代码文件与截图文件同处于一个文件夹内</span>
<span class="n">image</span> <span class="o">=</span> <span class="n">instructor</span><span class="o">.</span><span class="n">Image</span><span class="o">.</span><span class="n">from_path</span><span class="p">(</span><span class="s2">&#34;test_screen.png&#34;</span><span class="p">)</span>




<span class="n">client</span> <span class="o">=</span> <span class="n">instructor</span><span class="o">.</span><span class="n">from_openai</span><span class="p">(</span>
        <span class="n">OpenAI</span><span class="p">(</span>
            <span class="n">base_url</span><span class="o">=</span><span class="n">base_url</span><span class="p">,</span>
            <span class="n">api_key</span><span class="o">=</span><span class="n">api_key</span><span class="p">,</span>  <span class="c1"># required, but unused</span>
        <span class="p">),</span>
        <span class="n">mode</span><span class="o">=</span><span class="n">instructor</span><span class="o">.</span><span class="n">Mode</span><span class="o">.</span><span class="n">MD_JSON</span><span class="p">,</span>
<span class="p">)</span>


<span class="k">class</span> <span class="nc">Paper</span><span class="p">(</span><span class="n">BaseModel</span><span class="p">):</span>
    <span class="n">title</span><span class="p">:</span> <span class="nb">str</span>
    <span class="n">subject</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span>
    <span class="n">field</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span>




<span class="c1"># Create structured output</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">client</span><span class="o">.</span><span class="n">chat</span><span class="o">.</span><span class="n">completions</span><span class="o">.</span><span class="n">create</span><span class="p">(</span>
    <span class="n">model</span><span class="o">=</span><span class="n">model_name</span><span class="p">,</span>
    <span class="n">messages</span><span class="o">=</span><span class="p">[</span>
        <span class="p">{</span><span class="s2">&#34;role&#34;</span><span class="p">:</span> <span class="s2">&#34;asistant&#34;</span><span class="p">,</span> <span class="s2">&#34;content&#34;</span><span class="p">:</span> <span class="n">PROMPT</span><span class="p">},</span>
        <span class="p">{</span><span class="s2">&#34;role&#34;</span><span class="p">:</span> <span class="s2">&#34;user&#34;</span><span class="p">,</span> <span class="s2">&#34;content&#34;</span><span class="p">:</span> <span class="n">image</span><span class="p">},</span>
    <span class="p">],</span>
    <span class="n">response_model</span> <span class="o">=</span> <span class="n">Paper</span><span class="p">,</span>
    <span class="n">temperature</span><span class="o">=</span><span class="mf">0.0</span>
<span class="p">)</span>


<span class="n">result</span><span class="o">.</span><span class="n">model_dump</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">{&#39;title&#39;: &#39;On or Off Track: How (Broken) Streaks Affect Consumer Decisions&#39;,
 &#39;subject&#39;: [&#39;streaks, behavioral tracking and logging, technology, goals and motivation&#39;],
 &#39;field&#39;: [&#39;consumer behavior&#39;, &#39;marketing research&#39;, &#39;engagement strategies&#39;]}
</code></pre></div><br>
<p><br><br></p>
<h2 id="三讨论">三、讨论</h2>
<p>大邓测试发现 <em><strong>结构化输出</strong></em> 很容易出错， 相比之下 <em><strong>非结构化输出</strong></em> 更稳定一些。</p>
<p><br><br></p>
<h2 id="相关内容">相关内容</h2>
<ul>
<li><a href="https://textdata.cn/blog/2025-02-17-gpt-is-an-effective-tool-for-multilingual-psychological-text-analysis/"><strong>PNAS | GPT 是多语言心理文本分析的有效工具</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-06-14-how-to-download-large-language-model-with-ollama/"><strong>教程 | 如何使用 Ollama 下载 &amp; 使用本地大语言模型</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-08-06-using-the-ollama-local-large-model-to-predict-the-sentiment-category-of-online-comments/"><strong>实验 | 使用本地大模型预测在线评论情感类别和分值</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-08-07-structured-outputs-with-ollama/"><strong>实验 | 如何使 Ollama 结构化输出 JSON 样式的结果</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/"><strong>推荐 | 文本分析库cntext2.x使用手册</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-06-14-using-large-language-model-to-extract-structure-data-from-raw-text/"><strong>实验 | 使用本地大模型从文本中提取结构化信息</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-07-10-using-large-language-model-to-build-diy-dictionary/">实验 | 使用Ollama本地大模型DIY制作单词书教案PDF</a></li>
<li><a href="https://textdata.cn/blog/2024-08-05-create-a-blog-writer-multi-agent-system-using-crewai-and-ollama/">实验 | 使用 Crewai 和 Ollama 构建智能体(AI Agent)帮我撰写博客文章</a></li>
</ul>
]]></content:encoded>
    </item>
    
    <item>
      <title>PNAS | GPT 是多语言心理文本分析的有效工具</title>
      <link>https://textdata.cn/blog/2025-02-17-gpt-is-an-effective-tool-for-multilingual-psychological-text-analysis/</link>
      <pubDate>Mon, 17 Feb 2025 00:00:00 +0000</pubDate>
      
      <guid>/blog/2025-02-17-gpt-is-an-effective-tool-for-multilingual-psychological-text-analysis/</guid>
      <description>许多领域（包括心理学、社会学、通信、政治学和计算机科学）都使用计算方法来分析文本数据。但是，现有的文本分析方法存在许多缺点。字典方法虽然易于使用，但与最近的方法相比通常不是很准确。机器学习模型虽然更准确，但可能难以训练和使用。我们证明，大型语言模型 GPT 能够使用简单的提示准确检测 12 种语言文本中的各种心理结构（由手动注释者判断），无需额外的训练数据。因此，GPT 克服了现有方法中存在的局限性。GPT 在几种较少使用的语言中也很有效，这可以促进来自研究不足的环境中的文本分析研究。</description>
      <content:encoded><![CDATA[<p>许多领域（包括心理学、社会学、通信、政治学和计算机科学）都使用量化文本分析来构建研究中的概念指标。但现有的文本分析方法存在许多缺点。</p>
<ul>
<li>字典方法易于使用，但与最近的方法相比通常不是很准确。</li>
<li>机器学习模型虽然更准确，但可能难以训练和使用。</li>
</ul>
<p><strong>该研究证明，大型语言模型 GPT 能够使用简单的提示准确检测 12 种语言文本中的各种心理结构(情感、情绪、冒犯性和道德基础)，无需额外的训练数据</strong>。因此，GPT 克服了现有方法中存在的局限性。</p>
<br>
<h2 id="一资料">一、资料</h2>
<h3 id="11-文献">1.1 文献</h3>
<p>S. Rathje, D. Mirea, I. Sucholutsky, R. Marjieh, C.E. Robertson, &amp; J.J. Van Bavel, GPT is an effective tool for multilingual psychological text analysis, Proc. Natl. Acad. Sci. U.S.A. 121 (34) e2308950121, <a href="https://doi.org/10.1073/pnas.2308950121">https://doi.org/10.1073/pnas.2308950121</a> (2024).</p>
<h3 id="12-代码">1.2 代码</h3>
<p>该研究的作者使用的R语言进行的数据分析， 实验数据&amp;代码 <a href="https://osf.io/6pnb2/">https://osf.io/6pnb2/</a></p>
<p>演示视频</p>
<ul>
<li><a href="https://www.youtube.com/watch?v=Mm3uoK4Fogc&amp;t=344s">https://www.youtube.com/watch?v=Mm3uoK4Fogc&amp;t=344s</a></li>
<li><a href="https://www.bilibili.com/video/BV1KQwdeYE74/">https://www.bilibili.com/video/BV1KQwdeYE74/</a></li>
</ul>
<p><img loading="lazy" src="img/03.png" alt=""  />
</p>
<br>
<br>
<h2 id="二内容速览">二、内容速览</h2>
<h3 id="21-研究背景">2.1 研究背景</h3>
<ol>
<li><strong>研究问题</strong>：这篇文章探讨了大型语言模型（LLM）GPT是否可以作为自动化心理文本分析的工具，用于在多种语言中检测心理构念（如情感、离散情绪、冒犯性和道德基础）。</li>
<li><strong>研究难点</strong>：现有的文本分析方法存在准确性和适用性不足的问题。词典方法虽然易于使用，但在检测心理构念时准确性较低。机器学习方法虽然更准确，但需要大量的标注数据和高级编程技能。此外，现有方法在多语言数据分析方面也存在局限性。</li>
<li><strong>相关工作</strong>：计算社会科学领域已经使用自动化文本分析来研究社会趋势、社交媒体病毒式传播、心理健康状况与意识形态、个性等。然而，大多数现有方法依赖于西方人群和英语数据集，缺乏对少数语言和文化的研究。</li>
</ol>
<br>
<h3 id="22-研究方法">2.2 研究方法</h3>
<p>这篇论文提出了使用GPT进行自动化心理文本分析的方法。具体来说，</p>
<ol>
<li><strong>GPT模型</strong>：GPT是基于Transformer架构的大型语言模型，训练数据来自互联网文本（如Common Crawl或Wikipedia），能够在无需额外训练的情况下完成跨语言的文本分析任务（即“零样本学习”）。</li>
<li><strong>提示使用</strong>：GPT通过“提示”的方式工作，即根据用户提出的问题生成输出。例如，对于情感分析任务，提示可以是 <strong>”请根据以下文本的情感打分：1表示非常负面，7表示非常正面”</strong>。</li>
<li><strong>性能评估</strong>：使用准确率和平均F1值来衡量GPT的性能。准确率计算正确评分的数量占总评分数量的比例，而平均F1值则考虑了GPT在不同类型错误（如假阳性和假阴性）上的表现。</li>
</ol>
<br>
<h3 id="23-实验设计">2.3 实验设计</h3>
<ol>
<li><strong>数据集</strong>：使用了15个数据集，共包含47,925条手动标注的推文和新闻标题，涵盖12种语言。数据集涵盖了四种心理构念：<strong>情感、离散情绪、冒犯性和道德基础</strong>。</li>
<li><strong>实验设置</strong>：使用GPT API进行多次提示，提示格式简洁明了。例如，情感分析的提示为：<strong>“请根据以下文本的情感打分：1表示非常负面，7表示非常正面。这里是我们的文本：[文本内容]”</strong>。</li>
<li><strong>对比方法</strong>：将GPT的性能与其他常见的文本分析方法（如词典方法）以及顶级调优的机器学习模型进行对比。</li>
</ol>
<br>
<h3 id="24-结果与分析">2.4 结果与分析</h3>
<p><img loading="lazy" src="img/01.png" alt=""  />
</p>
<p><img loading="lazy" src="img/02.png" alt=""  />
</p>
<ol>
<li><strong>情感分析</strong>：在英语和阿拉伯语数据集上，GPT-3.5 Turbo的准确率为0.673和0.700，F1值分别为0.685和0.720。<strong>GPT-4和GPT-4 Turbo在情感分析任务上也表现出色，且随着模型版本的更新，性能有所提升</strong>。</li>
<li><strong>离散情绪检测</strong>：在英语和印度尼西亚语数据集上，GPT-3.5 Turbo的F1值分别为0.714和0.686，GPT-4 Turbo的F1值分别为0.782和0.785。<strong>GPT-4 Turbo在所有测试的语言中都表现出色，接近或超过了顶级调优的机器学习模型</strong>。</li>
<li><strong>冒犯性检测</strong>：在英语和土耳其语数据集上，GPT-3.5 Turbo的F1值分别为0.721和0.752，GPT-4 Turbo的F1值分别为0.762和0.762。<strong>GPT-4 Turbo在所有测试的语言中都表现出色，显著优于现有的词典方法</strong>。</li>
<li><strong>道德基础检测</strong>：在Reddit评论数据集上，GPT-4的F1值为0.653，GPT-4 Turbo的F1值为0.677。<strong>尽管在某些复杂心理构念（如比例性）上表现较差，但总体上仍接近顶级调优的BERT模型</strong>。</li>
</ol>
<br>
<h3 id="25-总体结论">2.5 总体结论</h3>
<p><strong>这篇论文展示了GPT作为自动化心理文本分析工具的潜力，具有高精度和广泛的应用范围</strong>。GPT在多种语言和不同类型的文本数据上表现出色，且无需额外的训练数据。<strong>尽管在某些复杂心理构念上表现不如最新的调优模型，但其灵活性和易用性使其成为现有自动化文本分析方法的有效替代方案</strong>。未来研究应继续探索GPT和其他LLM在不同语言和文化背景下的表现，以验证其普适性。</p>
<br>
<h2 id="三论文评价">三、论文评价</h2>
<h3 id="31-优点与创新">3.1 优点与创新</h3>
<ol>
<li><strong>多语言支持</strong>：GPT在多种语言（包括12种语言）中表现出色，特别是在较少使用的语言中，如斯瓦希里语、豪萨语、阿姆哈拉语等。</li>
<li><strong>无需训练数据</strong>：GPT能够在零样本学习的情况下进行文本分析，不需要额外的训练数据。</li>
<li><strong>简单易用</strong>：GPT使用简单的提示（如“这篇文章是消极的吗？”）即可进行分析，且不需要大量的编码经验。</li>
<li><strong>高准确性</strong>：<strong>GPT在检测情感、离散情绪、冒犯性和道德基础等心理构念方面，表现优于现有的英语词典分析方法，并且在某些情况下接近或超过了顶级调优的机器学习模型</strong>。</li>
<li><strong>跨语言一致性</strong>：不同版本的GPT在输出上具有高度一致性，表明其结果具有较高的可重复性。</li>
<li><strong>广泛的适用性</strong>：GPT适用于各种文本类型（如推文、新闻标题和Reddit评论），并且能够处理不同类型的评分（如Likert量表）。</li>
<li><strong>测试-重测可靠性</strong>：GPT在多次运行中具有极高的可靠性，Cohen&rsquo;s Kappa值在0.93到0.99之间。</li>
</ol>
<br>
<h3 id="32-不足应对">3.2 不足&amp;应对</h3>
<table>
<thead>
<tr>
<th>问题</th>
<th>反思</th>
<th>应对</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>版本差异</strong></td>
<td>chatGPT模型每时每刻都在训练&amp;更新，不同时刻的GPT可以看做是不同版本；</td>
<td>使用开源LLM， 可明确指定使用的版本，方便其他学者复现。</td>
</tr>
<tr>
<td><strong>心里构念</strong></td>
<td>只能处理较为简单的心理构念， 太复杂的心里构念因为难以描述和界定，无法设计提示。</td>
<td>未来可微调LLM</td>
</tr>
<tr>
<td><strong>成本问题</strong></td>
<td>GPT API的使用成本较高，尤其是GPT-4</td>
<td>使用开源LLM，如Qwen、llama、deepseek；</td>
</tr>
<tr>
<td><strong>隐私问题</strong></td>
<td>使用GPT API可能会导致研究中的隐私数据泄露</td>
<td>本地(离线)部署LLM， 所有数据都不会泄露</td>
</tr>
<tr>
<td><strong>模型选择</strong></td>
<td>本研究仅使用了GPT，未测试其他模型</td>
<td>使用其他LLM，如Qwen、llama、deepseek等</td>
</tr>
<tr>
<td><strong>文化偏见</strong></td>
<td>GPT可能会反映人类偏见，如内群体偏好和对WEIRD人群任务的认知偏差，这可能会影响其结果的普遍性。</td>
<td>对大多数研究来说不是问题， 比如每个研究者都是有偏见的。把GPT这类LLM看做一个有偏见的人看待即可。</td>
</tr>
</tbody>
</table>
<blockquote>
<p>&ldquo;WEIRD&rdquo; 是一个缩写词，代表 &ldquo;Western, Educated, Industrialized, Rich, and Democratic&rdquo;（西方的、受过教育的、工业化的、富裕的和民主的）</p>
</blockquote>
<p><br><br></p>
<h2 id="相关内容">相关内容</h2>
<ul>
<li><a href="https://textdata.cn/blog/2025-02-14-using-online-large-model-api-to-transform-text-data-into-structured-data/"><strong>教程 | 使用大模型将文本数据转化为结构化数据</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-06-14-how-to-download-large-language-model-with-ollama/"><strong>教程 | 如何使用 Ollama 下载 &amp; 使用本地大语言模型</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-08-07-structured-outputs-with-ollama/"><strong>实验 | 如何使 Ollama 结构化输出 JSON 样式的结果</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/"><strong>推荐 | 文本分析库 cntext 使用手册</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-06-14-using-large-language-model-to-extract-structure-data-from-raw-text/"><strong>实验 | 使用本地大模型从文本中提取结构化信息</strong></a></li>
</ul>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>教程 | 使用大模型将文本编码为结构化数据(本地Ollama篇)</title>
      <link>https://textdata.cn/blog/2025-02-14-using-online-large-model-api-to-transform-text-data-into-structured-data/</link>
      <pubDate>Fri, 14 Feb 2025 00:00:00 +0000</pubDate>
      
      <guid>/blog/2025-02-14-using-online-large-model-api-to-transform-text-data-into-structured-data/</guid>
      <description>实验数据为外卖评论， 今天咱们做个有难度的文本分析任务，从不同维度(味道、速度、服务)对外卖评论进行打分(-1.0~1.0)文本分析（也称为文本挖掘或自然语言处理，NLP）是指使用计算机算法和技术从大量文本数据中提取有价值信息的过程。文本分析的目标是从非结构化的文本数据中识别模式、提取关键信息、理解语义，并将其转化为结构化数据以便进一步分析和应用。</description>
      <content:encoded><![CDATA[<p>实验数据为外卖评论， 今天咱们做个有难度的文本分析任务，从不同维度(味道、速度、服务)对外卖评论进行打分(-1.0~1.0)。</p>
<p><img loading="lazy" src="img/06-df.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="一文本分析">一、文本分析</h2>
<p><strong>文本分析</strong>（也称为<strong>文本挖掘</strong>或<strong>自然语言处理</strong>，NLP）是指使用计算机算法和技术从大量文本数据中提取有价值信息的过程。文本分析的目标是从非结构化的文本数据中识别模式、提取关键信息、理解语义，并将其转化为结构化数据以便进一步分析和应用。 常用的文本分析方法有:</p>
<ul>
<li>词频统计</li>
<li>情感分析</li>
<li>文本分类</li>
<li>话题分析</li>
<li>&hellip;</li>
</ul>
<p><img loading="lazy" src="img/text-2-structured-data.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="二大模型云服务商">二、大模型云服务商</h2>
<p>随着 chatGPT、deepseek、通义千问这类大语言模型(<strong><em>LLM</em></strong>, large language model)的出现， 它们增强了文本理解能力，能够更精准的把握文本中的语义和情绪等信息，使得文本分析任务实现难度大大降低。</p>
<p>一般大模型服务提供商，有免费开源和封闭付费两种服务。</p>
<ul>
<li>免费模型， 可通过 <strong>Ollama</strong> 本地部署。部署教程可参考 <a href="https://textdata.cn/blog/2024-06-14-how-to-download-large-language-model-with-ollama/"><strong>教程 | 如何使用 Ollama 下载 &amp; 使用本地大语言模型</strong></a></li>
<li>付费模型， 账户有钱的情况下， 通过联网调用大模型厂商的 API 接口。</li>
</ul>
<p>使用 Python 代码， 联网调用大模型的 API，我们首先需要确定三个</p>
<ul>
<li><strong><em>BASE_URL</em></strong> 服务提供商运行大模型的网址。 如果是本地离线， BASE_URL = ''</li>
<li><strong><em>API_KEY</em></strong> 调用服务所需密钥，类似于钥匙</li>
<li><strong><em>MODEL_NAME</em></strong> 调用哪种模型(名字)</li>
</ul>
<p>阿里云不需要注册，支付宝扫码登录，即可调用市面上常见的大模型，如 <strong>通义千问qwen</strong>、<strong>Llama</strong>、<strong>deepseek</strong>、<strong>chatGLM</strong>等。现在我们以阿里云服务商为例， 依次获取<strong>BASE_URL</strong>、<strong>API_KEY</strong>、<strong>MODEL_NAME</strong>。</p>
<br>
<h3 id="21-充钱">2.1 充钱</h3>
<p><a href="https://billing-cost.console.aliyun.com/home">阿里云</a>替咱们在云服务商运行大模型，肯定不能是免费的。 所以先检查下账号里是否有钱，没钱了记得充值哦。 点击链接 <a href="https://billing-cost.console.aliyun.com/home">https://billing-cost.console.aliyun.com/home</a></p>
<p><img loading="lazy" src="img/02-charge.png" alt=""  />
</p>
<br>
<h3 id="22-base_url">2.2 BASE_URL</h3>
<p>阿里云运行大模型的网址 <strong><em>BASE_URL</em></strong> 为 <code>https://dashscope.aliyuncs.com/compatible-mode/v1</code></p>
<br>
<h3 id="23-api_key">2.3 API_KEY</h3>
<p>点击 <a href="https://bailian.console.aliyun.com/">阿里云百炼 https://bailian.console.aliyun.com/</a>，打开后点击右上角<img loading="lazy" src="https://help-static-aliyun-doc.aliyuncs.com/assets/img/zh-CN/0278981271/p824758.png" alt="image"  />
图标，在下拉菜单中单击<strong>API-KEY</strong>。</p>
<p><img loading="lazy" src="img/01-bai-lian.png" alt=""  />
</p>
<br>
<p>在左侧导航栏，选择 <strong>全部 API-KEY</strong> 或 <strong>我的 API-KEY</strong> ，然后<strong>创建</strong>（图中位置 ①）或<strong>查看</strong>（图中位置 ②）<strong><em>API Key</em></strong>。</p>
<p><img loading="lazy" src="img/02-api-key.png" alt=""  />
</p>
<br>
<p><strong>注意:</strong> 请不要将 <strong><em>API Key</em></strong> 以任何方式公开，避免因未经授权的使用造成安全风险或资金损失。</p>
<br>
<h3 id="24-model_name">2.4 MODEL_NAME</h3>
<p><a href="https://help.aliyun.com/zh/model-studio/getting-started/models">通义千问的模型列表 https://help.aliyun.com/zh/model-studio/getting-started/models</a>， 根据任务需要，选择适合的模型。</p>
<p><img loading="lazy" src="img/03-model-name.png" alt=""  />
</p>
<p>上图仅展示了阿里云服务提供的部分大模型， 以通义千问旗舰模型为例， <strong>MODEL_NAME</strong>模型名分别为**<em>qwen-max</em><strong>、</strong><em>qwen-plus</em><strong>、</strong><em>qwen-turbo</em><strong>、</strong><em>qwen-long</em>**。</p>
<br>
<h2 id="三环境配置">三、环境配置</h2>
<p>在 Python 中调用大模型， 不论是本地离线 API 还是云服务 API， 先要配置好相应的环境。 cntext2x支持Ollama和LMstudio结构化输出， 本文使用<strong>Ollama+cntext2.x</strong> 组合。</p>
<h3 id="31-安装软件-ollama">3.1 安装软件 Ollama</h3>
<p><a href="https://ollama.ai/"><strong>Ollama</strong></a>是一款开源应用程序，可让您使用 MacOS、Linux 和 Windows 上的命令行界面在本地运行、创建和共享大型语言模型。</p>
<p>Ollama 可以直接从其库中访问各种 LLM，只需一个命令即可下载。下载后，只需执行一个命令即可开始使用。这对于工作量围绕终端窗口的用户非常有帮助。Ollama 的安装、配置、使用的详细教程可阅读 <a href="https://textdata.cn/blog/2024-06-14-how-to-download-large-language-model-with-ollama/"><strong>教程 | 如何使用 Ollama 下载 &amp; 使用本地大语言模型</strong></a></p>
<p><img loading="lazy" src="img/04-ollama-gui.png" alt=""  />
</p>
<br>
<h3 id="32-安装-cntext2x">3.2 安装 cntext2.x</h3>
<p><a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/">cntext2.x</a>是大邓开发的文本分析库， 内置了丰富的文本分析函数， 如词频统计、词典法情感分析、经济政策不确定性 epu 等， 大大降低了文本分析难度。 以本文大模型文本分析为例， Python 源代码需要 <strong>80+</strong> 行， 经过大邓封装， 使用 cntext2.x 内置函数 <strong><em>text_analysis_by_llm</em></strong> 仅需要不到 <strong>5</strong> 行代码。</p>
<p><strong><em>安装包 cntext-2.1.7-py3-none-any.whl</em></strong> 是付费内容(100 元)， 如需使用<strong>加微信: 372335839</strong>，备注「<strong>姓名-学校-专业-cntext</strong>」</p>
<p>所有 <strong><em>cntext2.x</em></strong> 安装方法类似， 以目前 <strong><em>cntext2.1.7</em></strong> 为例，将 <strong><em>cntext-2.1.7-py3-none-any.whl</em></strong> 放置于桌面，打开 <strong><em>cmd</em></strong> (苹果电脑打开 terminal)， 输入 <strong><em>cd desktop</em></strong></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">cd desktop
</code></pre></div><p>之后在 <strong><em>cmd</em></strong> (苹果电脑打开 terminal) 中使用 <strong><em>pip3</em></strong> 安装</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">pip3 install cntext-2.1.7-py3-none-any.whl
</code></pre></div><p>需要注意， <strong>cntext2.x 使用环境为 Python3.8 及以上版本</strong>； 文章开头和文章末都有 <strong><em>cntext-2.1.7-py3-none-any.whl</em></strong> 获取方式说明。</p>
<p><br><br></p>
<h2 id="四实验代码">四、实验代码</h2>
<h4 id="41-启动本地服务ollama">4.1 启动本地服务(Ollama)</h4>
<p>使用 <strong>cntext2.x</strong> 调用本地电脑安装的大模型进行文本分析，不需要设置<strong>BASE_URL</strong>、<strong>API_KEY</strong> 这两个参数。</p>
<p>本节使用本地安装的模型， 先在命令行<strong>cmd</strong> (mac 对应 terminal) 中检查本地已安装的模型。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">ollama list
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">NAME                       ID              SIZE      MODIFIED
qwen2.5:7b                 845dbda0ea48    4.7 GB    7 days ago
qwen2.5:3b                 357c53fb659c    1.9 GB    7 days ago
qwen2.5:0.5b               a8b0c5157701    397 MB    7 days ago
qwen2.5:1.5b               65ec06548149    986 MB    7 days ago
deepseek-r1:1.5b           a42b25d8c10a    1.1 GB    7 days ago
deepseek-r1:7b             0a8c26691023    4.7 GB    7 days ago
nomic-embed-text:latest    0a109f422b47    274 MB    9 months ago
</code></pre></div><br>
<p>在 <strong><em>cmd</em></strong> 中使用命令 <strong><em>ollama serve</em></strong> 启动本地服务。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">ollama</span> <span class="n">serve</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">2025/02/14 16:00:18 routes.go:1259: INFO server config env=&#34;map[HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/Users/deng/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false http_proxy: https_proxy: no_proxy:]&#34;
time=2025-02-07T16:00:18.551+08:00 level=INFO source=images.go:757 msg=&#34;total blobs: 11&#34;
time=2025-02-07T16:00:18.551+08:00 level=INFO source=images.go:764 msg=&#34;total unused blobs removed: 0&#34;
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in &#34;debug&#34; mode. Switch to &#34;release&#34; mode in production.
 - using env:	export GIN_MODE=release
 - using code:	gin.SetMode(gin.ReleaseMode)
er.(*Server).GenerateRoutes.func1 (5 handlers)
......
time=2025-02-14T16:00:18.553+08:00 level=INFO source=routes.go:1339 msg=&#34;Dynamic LLM libraries&#34; runners=[metal]
time=2025-02-14T16:00:18.577+08:00 level=INFO source=types.go:131 msg=&#34;inference compute&#34; id=0 library=metal variant=&#34;&#34; compute=&#34;&#34; driver=0.0 name=&#34;&#34; total=&#34;72.0 GiB&#34; available=&#34;72.0 GiB&#34;

</code></pre></div><p><strong><em>cmd</em></strong> 之中出现上方信息，证明服务已经启动。 如果之前已经启动服务， 会看到信息</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">Error: listen tcp 127.0.0.1:11434: bind: address already in use
</code></pre></div><p>接下来，我们在 Python 中调用模型 <strong><em>qwen2.5:7b</em></strong></p>
<br>
<h3 id="42-读取数据">4.2 读取数据</h3>
<p><strong>实验数据为外卖评论， 今天咱们做个有难度的任务，从不同维度(味道、速度、服务)对外卖评论进行打分(-1.0~1.0)</strong>。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="c1">#构造实验数据</span>
<span class="n">data</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;速度非常快，口味非常好， 服务非常棒！&#39;</span><span class="p">,</span>
        <span class="s1">&#39;送餐时间还是比较久&#39;</span><span class="p">,</span>
        <span class="s1">&#39;送单很快，菜也不错赞&#39;</span><span class="p">,</span>
        <span class="s1">&#39;太难吃了&#39;</span><span class="p">]</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;comment&#39;</span><span class="p">])</span>

<span class="c1">#假设有外卖评论数据集data.csv， 文件内有字段comment， 直接读取数据。</span>
<span class="c1">#df = pd.read_csv(&#39;data.csv&#39;)</span>
<span class="n">df</span>
</code></pre></div><p><img loading="lazy" src="img/05-df.png" alt=""  />
</p>
<br>
<h3 id="43-小实验ctllm">4.3 小实验ct.llm</h3>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="n">PROMPT</span> <span class="o">=</span> <span class="s1">&#39;从口味taste、速度speed、服务service三个维度， 对外卖评论内容进行文本分析， 分别返回不同维度的分值(分值范围-1.0 ~ 1.0)&#39;</span>
<span class="n">MODEL_NAME</span> <span class="o">=</span> <span class="s1">&#39;qwen2.5:7b&#39;</span>

<span class="c1">#味道、速度、服务</span>
<span class="n">OUTPUT_FORMAT</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;taste&#39;</span><span class="p">:</span> <span class="nb">float</span><span class="p">,</span> <span class="s1">&#39;speed&#39;</span><span class="p">:</span> <span class="nb">float</span><span class="p">,</span> <span class="s1">&#39;service&#39;</span><span class="p">:</span> <span class="nb">float</span><span class="p">}</span>

<span class="n">COMMENT_CONTENT</span> <span class="o">=</span> <span class="s1">&#39;太难吃了&#39;</span>

<span class="c1">#使用</span>
<span class="c1">#result = ct.llm(text=COMMENT_CONTENT,</span>
<span class="c1">#或</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">llm</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="n">COMMENT_CONTENT</span><span class="p">,</span>
                                 <span class="n">prompt</span><span class="o">=</span><span class="n">PROMPT</span><span class="p">,</span>
                                 <span class="n">model_name</span><span class="o">=</span><span class="n">MODEL_NAME</span><span class="p">,</span>
                                 <span class="n">output_format</span><span class="o">=</span><span class="p">{</span><span class="s1">&#39;taste&#39;</span><span class="p">:</span> <span class="nb">float</span><span class="p">,</span>
                                                <span class="s1">&#39;speed&#39;</span><span class="p">:</span> <span class="nb">float</span><span class="p">,</span>
                                                <span class="s1">&#39;service&#39;</span><span class="p">:</span> <span class="nb">float</span><span class="p">},</span>
                                 <span class="n">max_retries</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span>
                                 <span class="n">return_df</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>

<span class="n">result</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">{&#39;taste&#39;: -1.0, &#39;speed&#39;: 0.0, &#39;service&#39;: 0.0}
</code></pre></div><br>
<h3 id="44-内置prompt">4.4 内置Prompt</h3>
<p>cntext2x内置了常用的10种中文文本分析任务， 每个任务都有一个默认的 prompt 模板，用户可以直接使用默认模板或者参考模板进行自定义。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#查看有哪些任务</span>
<span class="n">ct</span><span class="o">.</span><span class="n">llm</span><span class="o">.</span><span class="n">tasks_list</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[&#39;sentiment&#39;,
 &#39;emotion&#39;,
 &#39;classify&#39;,
 &#39;intent&#39;,
 &#39;keywords&#39;,
 &#39;entities&#39;,
 &#39;summarize&#39;,
 &#39;rewrite&#39;,
 &#39;quality&#39;,
 &#39;similarity&#39;]
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#获取sentiment模板</span>
<span class="n">ct</span><span class="o">.</span><span class="n">llm</span><span class="o">.</span><span class="n">tasks_get</span><span class="p">(</span><span class="s1">&#39;sentiment&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">{&#39;prompt&#39;: &#39;分析评论的情感倾向：返回情感类别 label（pos 表示正面，neg 表示负面，neutral 表示中性）和情感分值 score（取值范围 -1~1，负数为负面）&#39;,
 &#39;output_format&#39;: {&#39;label&#39;: &#39;str&#39;, &#39;score&#39;: &#39;float&#39;}}
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">#使用sentiment提示词模板。
#启用Ollama服务，调用qwen2.5:7b模型
ct.llm(&#34;服务很棒！&#34;, task=&#34;sentiment&#34;, backend=&#34;ollama&#34;,  model_name=&#34;qwen2.5:7b&#34;)
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[cntext2x] ✅ 连接模型服务: http://127.0.0.1:11434/v1
{&#39;label&#39;: &#39;pos&#39;, &#39;score&#39;: 0.8}
</code></pre></div><br>
<h3 id="45-云服务商-api">4.5 云服务商 API</h3>
<p>使用 <strong>cntext2.x</strong> 调用云服务商大模型进行文本分析，需要设置<strong>BASE_URL</strong>、<strong>API_KEY</strong>等参数。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="n">PROMPT</span> <span class="o">=</span> <span class="s1">&#39;从口味taste、速度speed、服务service三个维度， 对外卖评论内容进行文本分析， 分别返回不同维度的分值(分值范围-1.0 ~ 1.0)&#39;</span>
<span class="n">BASE_URL</span> <span class="o">=</span> <span class="s1">&#39;https://dashscope.aliyuncs.com/compatible-mode/v1&#39;</span>
<span class="n">API_KEY</span> <span class="o">=</span> <span class="s1">&#39;你的API-KEY&#39;</span>
<span class="n">MODEL_NAME</span> <span class="o">=</span> <span class="s1">&#39;qwen-max&#39;</span>

<span class="c1">#味道、速度、服务</span>
<span class="n">OUTPUT_FORMAT</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;taste&#39;</span><span class="p">:</span> <span class="nb">float</span><span class="p">,</span> <span class="s1">&#39;speed&#39;</span><span class="p">:</span> <span class="nb">float</span><span class="p">,</span> <span class="s1">&#39;service&#39;</span><span class="p">:</span> <span class="nb">float</span><span class="p">}</span>

<span class="n">COMMENT_CONTENT</span> <span class="o">=</span> <span class="s1">&#39;太难吃了&#39;</span>

<span class="n">result</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">llm</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="n">COMMENT_CONTENT</span><span class="p">,</span>
                <span class="n">prompt</span><span class="o">=</span><span class="n">PROMPT</span><span class="p">,</span>
                <span class="n">base_url</span><span class="o">=</span><span class="n">BASE_URL</span><span class="p">,</span>
                <span class="n">api_key</span><span class="o">=</span><span class="n">API_KEY</span><span class="p">,</span>
                <span class="n">model_name</span><span class="o">=</span><span class="n">MODEL_NAME</span><span class="p">,</span>
                <span class="n">output_format</span><span class="o">=</span><span class="n">OUTPUT_FORMAT</span><span class="p">,</span>
                <span class="n">max_retries</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span>
                <span class="n">return_df</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>

<span class="n">result</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">{&#39;taste&#39;: -1.0, &#39;speed&#39;: 0.0, &#39;service&#39;: 0.0}
</code></pre></div><p>小实验成功，现在设计分析函数， 对所有的评论进行分析，输出 dataframe 格式，保存到 csv 中。</p>
<br>
<br>
<h3 id="46-设计分析函数">4.6 设计分析函数</h3>
<p>使用 <strong>cntext2.x</strong> 中的大模型文本分析函数 <strong>ct.llm(text, prompt, base_url, api_key, model_name, temperature, output_format, max_retries, return_df)_</strong></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">- text (str): 待分析的文本内容
- task (str): 预设任务名称，默认为 &#39;sentiment&#39;。可用任务见 TASKS.keys()
- backend (str, optional): 快捷后端别名：
     - &#39;ollama&#39; → http://127.0.0.1:11434/v1
     - &#39;lmstudio&#39; 或 &#39;lms&#39; → http://localhost:1234/v1
     - None → 需配合 base_url 使用
- base_url (str, optional): 自定义模型服务地址，优先级高于 backend。 示例：
     - 远程：https://dashscope.aliyuncs.com/compatible-mode/v1
     - 内网：http://192.168.1.10:11434/v1
     - 本地：http://localhost:1234/v1
- api_key (str): API 密钥，远程服务必填，本地通常为 &#34;EMPTY&#34;
- model_name (str): 模型名称（需服务端已加载）
- temperature (float): 生成温度，0 表示确定性输出
- max_retries (int): 失败重试次数
- return_df (bool): 是否返回 DataFrame
- verbose (bool): 是否输出连接信息
- prompt (str, optional): 自定义系统提示语
- output_format (dict, optional): 自定义输出结构，如 {&#39;label&#39;: str, &#39;score&#39;: float}
</code></pre></div><p>以调用云服务商大模型为例， 设计<strong>ct.llm</strong></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="c1">#分析函数</span>
<span class="k">def</span> <span class="nf">llm_analysis</span><span class="p">(</span><span class="n">text</span><span class="p">):</span>
    <span class="n">result</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">llm</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="n">text</span><span class="p">,</span>
                                <span class="n">prompt</span><span class="o">=</span> <span class="s1">&#39;从口味taste、速度speed、服务service三个维度， 对外卖评论内容进行文本分析， 分别返回不同维度的分值(分值范围-1.0 ~ 1.0)&#39;</span><span class="p">,</span>
                                <span class="n">base_url</span><span class="o">=</span><span class="s1">&#39;https://dashscope.aliyuncs.com/compatible-mode/v1&#39;</span><span class="p">,</span>
                                <span class="n">api_key</span><span class="o">=</span><span class="s1">&#39;你的API-KEY&#39;</span><span class="p">,</span>
                                <span class="n">model_name</span><span class="o">=</span><span class="s1">&#39;qwen-max&#39;</span><span class="p">,</span>
                                <span class="n">temperature</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span>
                                <span class="n">output_format</span><span class="o">=</span><span class="p">{</span><span class="s1">&#39;taste&#39;</span><span class="p">:</span> <span class="nb">float</span><span class="p">,</span> <span class="s1">&#39;speed&#39;</span><span class="p">:</span> <span class="nb">float</span><span class="p">,</span> <span class="s1">&#39;service&#39;</span><span class="p">:</span> <span class="nb">float</span><span class="p">}</span>
                               <span class="p">)</span>
    <span class="k">return</span> <span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">(</span><span class="n">result</span><span class="p">)</span>


<span class="c1">#批量运算</span>
<span class="n">df2</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;comment&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">llm_analysis</span><span class="p">)</span>
<span class="n">res_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">concat</span><span class="p">([</span><span class="n">df</span><span class="p">,</span> <span class="n">df2</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="c1">#保存分析结果</span>
<span class="n">res_df</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s1">&#39;result.csv&#39;</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
<span class="n">res_df</span>
</code></pre></div><p><img loading="lazy" src="img/06-df.png" alt=""  />
</p>
<br>
<br>
<h2 id="五获取-cntext2x">五、获取 cntext2.x</h2>
<p>安装包<strong>cntext-2.1.7-py3-none-any.whl</strong> 是付费内容(<strong><em>100 元</em></strong>)， 如需使用<strong>加微信: 372335839</strong>，备注「<strong>姓名-学校-专业-cntext</strong>」</p>
<p><br><br></p>
<h2 id="相关内容">相关内容</h2>
<ul>
<li><a href="https://textdata.cn/2025-09-09-transform-text-data-into-structured-data-with-lmstudio-and-cntext/"><strong>教程 | 使用大模型将文本数据转化为结构化数据(LMstudio篇)</strong></a></li>
<li><a href="https://textdata.cn/blog/2025-02-17-gpt-is-an-effective-tool-for-multilingual-psychological-text-analysis/"><strong>PNAS | GPT 是多语言心理文本分析的有效工具</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-06-14-how-to-download-large-language-model-with-ollama/"><strong>教程 | 如何使用 Ollama 下载 &amp; 使用本地大语言模型</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-08-06-using-the-ollama-local-large-model-to-predict-the-sentiment-category-of-online-comments/"><strong>实验 | 使用本地大模型预测在线评论情感类别和分值</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-08-07-structured-outputs-with-ollama/"><strong>实验 | 如何使 Ollama 结构化输出 JSON 样式的结果</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/"><strong>推荐 | 文本分析库 cntext2.x 使用手册</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-06-14-using-large-language-model-to-extract-structure-data-from-raw-text/"><strong>实验 | 使用本地大模型从文本中提取结构化信息</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-07-10-using-large-language-model-to-build-diy-dictionary/">实验 | 使用 Ollama 本地大模型 DIY 制作单词书教案 PDF</a></li>
<li><a href="https://textdata.cn/blog/2024-08-05-create-a-blog-writer-multi-agent-system-using-crewai-and-ollama/">实验 | 使用 Crewai 和 Ollama 构建智能体(AI Agent)帮我撰写博客文章</a></li>
</ul>
]]></content:encoded>
    </item>
    
    <item>
      <title>实验 | 使用本地大模型预测在线评论情感类别和分值</title>
      <link>https://textdata.cn/blog/2024-08-06-using-the-ollama-local-large-model-to-predict-the-sentiment-category-of-online-comments/</link>
      <pubDate>Fri, 07 Feb 2025 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-08-06-using-the-ollama-local-large-model-to-predict-the-sentiment-category-of-online-comments/</guid>
      <description>情感分析是分析文本以确定消息的情绪基调是积极、消极还是中性的过程。通过情感分析，我们可以了解文本是否表现出快乐、悲伤、愤怒等情绪。主要的计算方法有语义词典法、机器学习法、混合方法、其他方法。 随着chatGPT这类大语言模型的出现， 它们增强了文本理解能力，使我们能够更精准的把握文本中的语义和情绪，也因此大型语言模型 (LLM) 一出场就有实现情感分析功能。Sentiment analysis is the process of analyzing text to determine whether the emotional tone of a message is positive, negative, or neutral. Through sentiment analysis, we can understand whether the text expresses emotions such as happiness, sadness, anger, etc. The main computational methods are semantic dictionary method, machine learning method, hybrid method, and other methods. With the emergence of large language models such as chatGPT, they enhance text understanding capabilities, allowing us to more accurately grasp the semantics and emotions in the text. Therefore, large language models (LLMs) have implemented sentiment analysis functions as soon as they appeared.</description>
      <content:encoded><![CDATA[<p>情感分析是分析文本以确定消息的情绪基调是积极、消极还是中性的过程。通过情感分析，我们可以了解文本是否表现出快乐、悲伤、愤怒等情绪。主要的计算方法有语义词典法、机器学习法、混合方法、其他方法。 随着chatGPT这类大语言模型的出现， 它们增强了文本理解能力，使我们能够更精准的把握文本中的语义和情绪，也因此大型语言模型 (LLM) 一出场就有实现情感分析功能。</p>
<p><img loading="lazy" src="img/Sentiment-Analysis-methods.png" alt=""  />
</p>
<h2 id="一任务描述">一、任务描述</h2>
<p>大邓准备了200条外卖评论数据(下图蓝色框)， 已进行标注, 其中负面110条，正面90条。</p>
<p>现在想设计一个Prompt， 使用中文大模型对 <em><strong>review</strong></em> 文本进行情感类别(pos/neg)的预测(红色框)， 最终会计算大模型预测的准确率。</p>
<p><img loading="lazy" src="img/00-purpose.png" alt=""  />
</p>
<p>先提前剧透一下， 模型预测的准确率87.5%。这种准确率，用到经管社科研究中， 应该没啥问题。</p>
<p><br><br></p>
<h2 id="二传统模式-vs-大语言模型">二、传统模式 VS 大语言模型</h2>
<p>大语言模型 (LLM) 因其在理解和生成人类语言方面的熟练程度而在情绪分析方面表现出色。通过对各种数据和算法进行训练，LLM 可以检测文本中的细微差别，从而增强其在社交媒体、新闻文章和客户评论等平台上掌握人们情绪和观点的能力。它们捕捉上下文和情感线索的能力提高了情绪分析的准确性和深度。</p>
<p>情感分析领域，传统模式与大语言模型 (LLM) 的比较</p>
<ul>
<li>传统的内容分析方法可能难以准确捕捉细微的情绪。</li>
<li>LLM 使用深度学习和迁移学习等先进技术，擅长理解不同的语言表达。</li>
<li>LLM 在跨文本源（包括社交媒体帖子和新闻文章）的情感分析方面具有卓越的准确性和效率。</li>
</ul>
<p><br><br></p>
<h2 id="三ollama">三、Ollama</h2>
<p><a href="https://ollama.ai/"><strong>Ollama</strong></a>是一款开源应用程序，可让您使用 MacOS、Linux 和 Windows 上的命令行界面在本地运行、创建和共享大型语言模型。</p>
<p>Ollama 可以直接从其库中访问各种 LLM，只需一个命令即可下载。下载后，只需执行一个命令即可开始使用。这对于工作量围绕终端窗口的用户非常有帮助。Ollama的安装、配置、使用的详细教程可阅读  <a href="https://textdata.cn/blog/2024-06-14-how-to-download-large-language-model-with-ollama/"><strong>教程 | 如何使用 Ollama 下载 &amp; 使用本地大语言模型</strong></a></p>
<br>
<h3 id="31-安装模型">3.1 安装模型</h3>
<p>假设电脑中已安装了Ollama软件，</p>
<ul>
<li><em><strong>qwen</strong></em>： 阿里的通义千问大模型， 主要适用于中文场景， 英文也可。</li>
<li><em><strong>llama</strong></em>：Meta发布的LLama大模型，主要适用于英文场景， 中文也可。</li>
<li><em><strong>deepseek</strong></em>： 幻方量化的DeepSeek模型，适用于中英文场景。</li>
</ul>
<p><img loading="lazy" src="img/00-qwen.png" alt=""  />
</p>
<p>本文实验对象为中文内容(中文外卖在线评论）， 之前我尝试过deepseek感觉运行速度较慢， 本文选择 <em><strong>qwen</strong></em> (最新的模型是qwen2.5), <em><strong>我们尝试一次性安装多个模型， 测试运行速度和任务完成的准确率</strong></em>。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">ollama</span> <span class="n">run</span> <span class="n">qwen2</span><span class="mf">.5</span><span class="p">:</span><span class="mf">0.5</span><span class="n">b</span>
<span class="n">ollama</span> <span class="n">run</span> <span class="n">qwen2</span><span class="mf">.5</span><span class="p">:</span><span class="mf">1.5</span><span class="n">b</span>
<span class="n">ollama</span> <span class="n">run</span> <span class="n">qwen2</span><span class="mf">.5</span><span class="p">:</span><span class="mi">3</span><span class="n">b</span>
<span class="n">ollama</span> <span class="n">run</span> <span class="n">qwen2</span><span class="mf">.5</span><span class="p">:</span><span class="mi">7</span><span class="n">b</span>
</code></pre></div><br>
<h3 id="32-安装python包">3.2 安装python包</h3>
<p>打开电脑命令行cmd(mac是terminal),  网络是连网状态，执行安装命令</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">pip3 install ollama
pip3 install instructor
</code></pre></div><p>查看当前版本ollama(0.2.1)和instructor(1.11.2)</p>
<br>
<h3 id="33-启动ollama服务">3.3 启动ollama服务</h3>
<p>在电脑中找到软件 Ollama， 双击打开，即可开启Ollama服务。</p>
<p><br><br></p>
<h2 id="四实验">四、实验</h2>
<h3 id="41-代码结构">4.1 代码结构</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">project
  - code.ipynb                   #代码
  - data.csv                     #在线评论数据
  - qwen2.5-0.5b-result.csv      #qwen2.5:0.5b预测结果
  - qwen2.5-1.5b-result.csv      #qwen2.5:1.5预测结果
  - qwen2.5-3b-result.csv        #qwen2.5:3b预测结果
  - qwen2.5-7b-result.csv        #qwen2.5:7b预测结果
  - async-qwen2.5-7b-result.csv  #qwen2.5:7b异步代码预测结果
</code></pre></div><br>
<h3 id="42-读取数据">4.2 读取数据</h3>
<p><em><strong>data.csv</strong></em> 内存储着200条外卖评论，均已标注(label字段，其中1为正面， 0为负面)</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;data.csv&#39;</span><span class="p">)</span>
<span class="n">df</span>
</code></pre></div><p><img loading="lazy" src="img/06-df.png" alt=""  />
</p>
<br>
<p>字段的数据类型</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="o">.</span><span class="n">dtypes</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">label</span>         <span class="n">int64</span>
<span class="n">review</span>       <span class="nb">object</span>
<span class="n">dtype</span><span class="p">:</span> <span class="nb">object</span>
</code></pre></div><br>
<p>label数值的分布</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="o">.</span><span class="n">label</span><span class="o">.</span><span class="n">value_counts</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">label
0    110
1     90
Name: count, dtype: int64
</code></pre></div><br>
<h3 id="43-设计提示promp">4.3 设计提示Promp</h3>
<p>需要根据单词，生成单词、音标、语义、例句、历史文化、相关单词等信息， 提示如下，</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">PROMPT_TEXT = &#34;根据评论内容，返回文本的情感类别(pos、neg、neo)和对应的情感得分(取值范围0~1)&#34;&#34;
</code></pre></div><p><strong>注意: PROMPT_TEXT会影响模型表现， 大邓设计的非常粗糙， 建议大家可以设计DIY自己PROMPT_TEXT</strong>。</p>
<br>
<h3 id="44-小实验">4.4 小实验</h3>
<p>使用参考推文 <a href="https://textdata.cn/blog/2024-08-07-structured-outputs-with-ollama/">实验 | 如何使 Ollama 结构化输出 JSON 样式的结果</a> ，可确保情感分析的结果为指定格式。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>
<span class="kn">from</span> <span class="nn">openai</span> <span class="kn">import</span> <span class="n">OpenAI</span>
<span class="kn">from</span> <span class="nn">pydantic</span> <span class="kn">import</span> <span class="n">BaseModel</span>
<span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">List</span>
<span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">instructor</span>

<span class="c1">#结构化输出</span>
<span class="k">class</span> <span class="nc">Sentiment</span><span class="p">(</span><span class="n">BaseModel</span><span class="p">):</span>
    <span class="n">senti_label</span><span class="p">:</span> <span class="nb">str</span>
    <span class="n">senti_score</span><span class="p">:</span> <span class="nb">float</span>


<span class="c1">#Prompt提示</span>
<span class="n">PROMPT_TEXT</span> <span class="o">=</span> <span class="s2">&#34;根据评论内容，返回文本的情感类别(pos、neg、neo)和对应的情感得分(取值范围0~1)&#34;</span>

<span class="c1">#实验数据</span>
<span class="n">COMMENT_CONTENT</span> <span class="o">=</span> <span class="s1">&#39;11点14订餐，13点20饭才到，2个小时才把我的午饭送到，而且还是打了2次客服电话，1次投诉电话才给送来，要是不打电话都不知道几点能吃上午饭？&#39;</span>


<span class="n">client</span> <span class="o">=</span> <span class="n">instructor</span><span class="o">.</span><span class="n">from_openai</span><span class="p">(</span>
    <span class="n">OpenAI</span><span class="p">(</span>
        <span class="n">base_url</span><span class="o">=</span><span class="s2">&#34;http://localhost:11434/v1&#34;</span><span class="p">,</span>
        <span class="n">api_key</span><span class="o">=</span><span class="s2">&#34;NA&#34;</span><span class="p">,</span>  <span class="c1"># required, but unused</span>
    <span class="p">),</span>
    <span class="n">mode</span> <span class="o">=</span> <span class="n">instructor</span><span class="o">.</span><span class="n">Mode</span><span class="o">.</span><span class="n">JSON</span><span class="p">,</span>
<span class="p">)</span>


<span class="n">resp</span> <span class="o">=</span> <span class="n">client</span><span class="o">.</span><span class="n">chat</span><span class="o">.</span><span class="n">completions</span><span class="o">.</span><span class="n">create</span><span class="p">(</span>
    <span class="n">model</span> <span class="o">=</span> <span class="s2">&#34;qwen2.5:7b&#34;</span><span class="p">,</span>
    <span class="n">messages</span><span class="o">=</span><span class="p">[</span>
        <span class="p">{</span><span class="s2">&#34;role&#34;</span><span class="p">:</span> <span class="s2">&#34;system&#34;</span><span class="p">,</span> <span class="s2">&#34;content&#34;</span><span class="p">:</span> <span class="n">PROMPT_TEXT</span><span class="p">},</span> <span class="c1">#提示PROMP</span>
        <span class="p">{</span><span class="s2">&#34;role&#34;</span><span class="p">:</span> <span class="s2">&#34;user&#34;</span><span class="p">,</span> <span class="s2">&#34;content&#34;</span><span class="p">:</span> <span class="n">COMMENT_CONTENT</span><span class="p">}</span> <span class="c1">#评论文本</span>
    <span class="p">],</span>
    <span class="n">response_model</span> <span class="o">=</span> <span class="n">Sentiment</span><span class="p">,</span>
    <span class="n">max_retries</span> <span class="o">=</span> <span class="mi">3</span>
<span class="p">)</span>


<span class="nb">print</span><span class="p">(</span><span class="n">resp</span><span class="o">.</span><span class="n">model_dump</span><span class="p">())</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">{&#34;senti_label&#34;: &#34;neg&#34;, &#34;senti_score&#34;: -0.65}

CPU times: user 44.4 ms, sys: 6.46 ms, total: 50.8 ms
Wall time: 1 s
</code></pre></div><p>运行一条评论耗时 1 s， 该评论为 <em><strong>负面neg</strong></em>,  情感分 <em><strong>-0.65</strong></em>。</p>
<br>
<br>
<h2 id="五-完整代码">五、 完整代码</h2>
<p>由于大模型速度非常缓慢，一次提问耗时几秒， 如果大规模使用大模型对数据进行数据标注， 速度慢的令人抓狂。 这时候写代码就有同步代码和异步代码之分。</p>
<ul>
<li><strong>同步代码</strong> 按照顺序执行，每个任务必须等待前一个任务完成后才能开始。适用于处理少量数据或不需要高并发性能的情况。</li>
<li><strong>异步代码</strong> 允许并发执行多个任务，适合处理大量数据时提高效率。使用<code>asyncio</code>库来实现异步操作。</li>
</ul>
<p>本章节是情感分析实验代码的收官章节， 设计了 <strong>同步代码</strong> 和 <strong>异步代码</strong> 两个版本， 并在本章末进行了任务耗时(速度)对比。</p>
<h3 id="51-同步代码">5.1 同步代码</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">from</span> <span class="nn">tqdm</span> <span class="kn">import</span> <span class="n">tqdm</span>
<span class="kn">from</span> <span class="nn">openai</span> <span class="kn">import</span> <span class="n">OpenAI</span>
<span class="kn">from</span> <span class="nn">pydantic</span> <span class="kn">import</span> <span class="n">BaseModel</span>
<span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">List</span>
<span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">instructor</span>

 

<span class="c1">#结构化输出</span>
<span class="k">class</span> <span class="nc">Sentiment</span><span class="p">(</span><span class="n">BaseModel</span><span class="p">):</span>
    <span class="n">senti_label</span><span class="p">:</span> <span class="nb">str</span>
    <span class="n">senti_score</span><span class="p">:</span> <span class="nb">float</span>
    

<span class="c1">#Prompt提示</span>
<span class="n">PROMPT_TEXT</span> <span class="o">=</span> <span class="s2">&#34;根据评论内容，返回文本的情感类别(pos、neg)和情感得分(取值范围 -1~1)&#34;</span> 

<span class="n">client</span> <span class="o">=</span> <span class="n">instructor</span><span class="o">.</span><span class="n">from_openai</span><span class="p">(</span>
    <span class="n">OpenAI</span><span class="p">(</span>
        <span class="n">base_url</span><span class="o">=</span><span class="s2">&#34;http://localhost:11434/v1&#34;</span><span class="p">,</span>
        <span class="n">api_key</span><span class="o">=</span><span class="s2">&#34;NA&#34;</span><span class="p">,</span>  <span class="c1"># required, but unused</span>
    <span class="p">),</span>
    <span class="c1">#mode = instructor.Mode.JSON,</span>
    <span class="n">mode</span> <span class="o">=</span> <span class="n">instructor</span><span class="o">.</span><span class="n">Mode</span><span class="o">.</span><span class="n">MD_JSON</span><span class="p">,</span>
<span class="p">)</span>


<span class="n">labels</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">scores</span> <span class="o">=</span> <span class="p">[]</span>

<span class="c1">#读取数据</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;data.csv&#39;</span><span class="p">)</span>
<span class="k">for</span> <span class="n">review</span> <span class="ow">in</span> <span class="n">tqdm</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;review&#39;</span><span class="p">]):</span>
    <span class="k">try</span><span class="p">:</span>
        <span class="n">resp</span> <span class="o">=</span> <span class="n">client</span><span class="o">.</span><span class="n">chat</span><span class="o">.</span><span class="n">completions</span><span class="o">.</span><span class="n">create</span><span class="p">(</span>
        <span class="n">model</span> <span class="o">=</span> <span class="s1">&#39;qwen2.5:7b&#39;</span><span class="p">,</span>  <span class="c1">#选择模型。 0.5b、1.5b、3b、7b等</span>
        <span class="n">messages</span><span class="o">=</span><span class="p">[</span>
            <span class="p">{</span><span class="s2">&#34;role&#34;</span><span class="p">:</span> <span class="s2">&#34;system&#34;</span><span class="p">,</span> <span class="s2">&#34;content&#34;</span><span class="p">:</span> <span class="n">PROMPT_TEXT</span><span class="p">},</span>  <span class="c1">#提示</span>
            <span class="p">{</span><span class="s2">&#34;role&#34;</span><span class="p">:</span> <span class="s2">&#34;user&#34;</span><span class="p">,</span> <span class="s2">&#34;content&#34;</span><span class="p">:</span> <span class="n">review</span><span class="p">}</span>   <span class="c1">#评论文本</span>
        <span class="p">],</span>
        <span class="n">response_model</span> <span class="o">=</span> <span class="n">Sentiment</span><span class="p">,</span>
        <span class="n">max_retries</span> <span class="o">=</span> <span class="mi">3</span>
    <span class="p">)</span>
        
        <span class="n">labels</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">resp</span><span class="o">.</span><span class="n">senti_label</span><span class="p">)</span>
        <span class="n">scores</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">resp</span><span class="o">.</span><span class="n">senti_score</span><span class="p">)</span>
    <span class="k">except</span><span class="p">:</span>
        <span class="n">labels</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="s1">&#39;NA&#39;</span><span class="p">)</span>
        <span class="n">scores</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="s1">&#39;NA&#39;</span><span class="p">)</span>
        
        
    
    
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;SentiLabel&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">labels</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;SentiScore&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">scores</span>
<span class="c1">#保存结果</span>
<span class="n">df</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s1">&#39;qwen2.5-7b-result.csv&#39;</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
<span class="n">df</span>
</code></pre></div><p><img loading="lazy" src="img/07-df.png" alt=""  />
</p>
<br>
<h3 id="52-异步代码">5.2 异步代码</h3>
<p>相比 <em><strong>5.1普通代码</strong></em> ， 异步代码运行速度更快。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">from</span> <span class="nn">tqdm.asyncio</span> <span class="kn">import</span> <span class="n">tqdm_asyncio</span>
<span class="c1"># 使用AsyncOpenAI代替OpenAI以支持异步操作</span>
<span class="kn">from</span> <span class="nn">openai</span> <span class="kn">import</span> <span class="n">AsyncOpenAI</span>
<span class="kn">from</span> <span class="nn">pydantic</span> <span class="kn">import</span> <span class="n">BaseModel</span>
<span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">List</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">instructor</span>
<span class="kn">import</span> <span class="nn">asyncio</span>

<span class="c1"># 结构化输出</span>
<span class="k">class</span> <span class="nc">Sentiment</span><span class="p">(</span><span class="n">BaseModel</span><span class="p">):</span>
    <span class="n">senti_label</span><span class="p">:</span> <span class="nb">str</span>
    <span class="n">senti_score</span><span class="p">:</span> <span class="nb">float</span>

<span class="c1"># Prompt提示</span>
<span class="n">PROMPT_TEXT</span> <span class="o">=</span> <span class="s2">&#34;根据评论内容，返回文本的情感类别(pos、neg)和情感得分(取值范围 -1~1)&#34;</span>

<span class="n">client</span> <span class="o">=</span> <span class="n">instructor</span><span class="o">.</span><span class="n">from_openai</span><span class="p">(</span>
    <span class="n">AsyncOpenAI</span><span class="p">(</span>
        <span class="n">base_url</span><span class="o">=</span><span class="s2">&#34;http://localhost:11434/v1&#34;</span><span class="p">,</span>
        <span class="n">api_key</span><span class="o">=</span><span class="s2">&#34;NA&#34;</span><span class="p">,</span>  <span class="c1"># required, but unused</span>
    <span class="p">),</span>
    <span class="c1">#mode=instructor.Mode.JSON,</span>
    <span class="n">mode</span><span class="o">=</span><span class="n">instructor</span><span class="o">.</span><span class="n">Mode</span><span class="o">.</span><span class="n">MD_JSON</span><span class="p">,</span>
<span class="p">)</span>

<span class="k">async</span> <span class="k">def</span> <span class="nf">analyze_review</span><span class="p">(</span><span class="n">review</span><span class="p">):</span>
    <span class="k">try</span><span class="p">:</span>
        <span class="n">resp</span> <span class="o">=</span> <span class="k">await</span> <span class="n">client</span><span class="o">.</span><span class="n">chat</span><span class="o">.</span><span class="n">completions</span><span class="o">.</span><span class="n">create</span><span class="p">(</span>
            <span class="n">model</span><span class="o">=</span><span class="s1">&#39;qwen2.5:7b&#39;</span><span class="p">,</span>  <span class="c1"># 选择模型。 3b、7b等</span>
            <span class="n">messages</span><span class="o">=</span><span class="p">[</span>
                <span class="p">{</span><span class="s2">&#34;role&#34;</span><span class="p">:</span> <span class="s2">&#34;system&#34;</span><span class="p">,</span> <span class="s2">&#34;content&#34;</span><span class="p">:</span> <span class="n">PROMPT_TEXT</span><span class="p">},</span>  <span class="c1"># 提示</span>
                <span class="p">{</span><span class="s2">&#34;role&#34;</span><span class="p">:</span> <span class="s2">&#34;user&#34;</span><span class="p">,</span> <span class="s2">&#34;content&#34;</span><span class="p">:</span> <span class="n">review</span><span class="p">}</span>  <span class="c1"># 评论文本</span>
            <span class="p">],</span>
            <span class="n">response_model</span><span class="o">=</span><span class="n">Sentiment</span><span class="p">,</span>
            <span class="n">max_retries</span><span class="o">=</span><span class="mi">3</span>  <span class="c1"># 最大(失败）的重试次数。</span>
        <span class="p">)</span>
        <span class="k">return</span> <span class="n">resp</span><span class="o">.</span><span class="n">senti_label</span><span class="p">,</span> <span class="n">resp</span><span class="o">.</span><span class="n">senti_score</span>
    <span class="k">except</span> <span class="ne">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
        <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;Error processing review: </span><span class="si">{</span><span class="n">e</span><span class="si">}</span><span class="s2">&#34;</span><span class="p">)</span>
        <span class="k">return</span> <span class="s1">&#39;NA&#39;</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">nan</span>

<span class="k">async</span> <span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
    <span class="c1"># 读取数据</span>
    <span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;data.csv&#39;</span><span class="p">)</span>
    <span class="n">tasks</span> <span class="o">=</span> <span class="p">[</span><span class="n">analyze_review</span><span class="p">(</span><span class="n">review</span><span class="p">)</span> <span class="k">for</span> <span class="n">review</span> <span class="ow">in</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;review&#39;</span><span class="p">]]</span>
    <span class="n">results</span> <span class="o">=</span> <span class="k">await</span> <span class="n">tqdm_asyncio</span><span class="o">.</span><span class="n">gather</span><span class="p">(</span><span class="o">*</span><span class="n">tasks</span><span class="p">)</span>
    
    <span class="n">labels</span><span class="p">,</span> <span class="n">scores</span> <span class="o">=</span> <span class="nb">zip</span><span class="p">(</span><span class="o">*</span><span class="n">results</span><span class="p">)</span>
    <span class="n">df</span><span class="p">[</span><span class="s1">&#39;SentiLabel&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">labels</span>
    <span class="n">df</span><span class="p">[</span><span class="s1">&#39;SentiScore&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">scores</span>
    
    <span class="c1"># 保存结果</span>
    <span class="n">df</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s1">&#39;async-qwen2.5-7b-result.csv&#39;</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>

<span class="c1"># 检查是否已经在运行的事件循环中</span>
<span class="k">try</span><span class="p">:</span>
    <span class="n">asyncio</span><span class="o">.</span><span class="n">get_running_loop</span><span class="p">()</span>
    <span class="c1"># 如果在交互模式下运行，直接调度main()而不使用asyncio.run</span>
    <span class="n">asyncio</span><span class="o">.</span><span class="n">create_task</span><span class="p">(</span><span class="n">main</span><span class="p">())</span>
<span class="k">except</span> <span class="ne">RuntimeError</span><span class="p">:</span>
    <span class="c1"># 如果没有正在运行的事件循环，使用asyncio.run(main())</span>
    <span class="n">asyncio</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">main</span><span class="p">())</span>
</code></pre></div><br>
<h3 id="53-速度对比">5.3 速度对比</h3>
<p>以<em><strong>qwen2.5:7b</strong></em>为例， 对本文 <em><strong>data.csv</strong></em> 在线评论数据进行情感分析，</p>
<ul>
<li><strong>普通代码</strong> 运行耗时 <strong>160</strong> 秒</li>
<li><strong>异步代码</strong> 运行耗时 <strong>90</strong> 秒</li>
</ul>
<br>
<p><br><br></p>
<h2 id="六评价模型">六、评价模型</h2>
<p>本文分别对0.5b、1.5b、3b、7b进行实验， 记录了200条外卖评论的任务耗时(以同步代码为例）和准确率， 结果如下</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">|  模型  | 模型参数 | 任务耗时(秒) | 准确率 |
| ----- | ------  |  --------  | ----- |
|qwen2.5|   0.5b  |     260s   | 1.5%  |
|qwen2.5|   1.5b  |    48.5s   | 58.5% |
|qwen2.5|    3b   |     140s   |  86%  |
|qwen2.5|    7b   |     160s   |  87.5% |
</code></pre></div><p>综合任务耗时和准确率， 建议使用 <em><strong>qwen2.5:3b</strong></em> 和 <em><strong>qwen2.5:7b</strong></em> 。如果电脑性能很好，直接上  <em><strong>qwen2.5:7b</strong></em>  甚至更大参数的模型。</p>
<h3 id="tips准确率计算方法">Tips:准确率计算方法</h3>
<p>假设label为1时， <em><strong>SentiLabel</strong></em> 为pos(或label为0时， SentiLabel为neg)， 大模型判断正确。反之，判断失误。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">expression</span> <span class="o">=</span> <span class="s2">&#34;(label == 1) &amp; (sentiment == &#39;pos&#39;) | (label == 0) &amp; (sentiment == &#39;neg&#39;)&#34;</span>
<span class="n">correct_ratio</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">query</span><span class="p">(</span><span class="n">expression</span><span class="p">))</span><span class="o">/</span> <span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;准确率: </span><span class="si">{</span><span class="n">correct_ratio</span><span class="o">*</span><span class="mi">100</span><span class="si">}</span><span class="s1">%&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">准确率: 86%
</code></pre></div><p><br><br></p>
<h2 id="七获取代码">七、获取代码</h2>
<p><a href="project.zip"><strong>点击下载本文代码</strong></a></p>
<p><br><br></p>
<h2 id="相关内容">相关内容</h2>
<ul>
<li><a href="https://textdata.cn/blog/2025-02-17-gpt-is-an-effective-tool-for-multilingual-psychological-text-analysis/"><strong>文献 | GPT 是多语言心理文本分析的有效工具</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-06-14-how-to-download-large-language-model-with-ollama/"><strong>教程 | 如何使用 Ollama 下载 &amp; 使用本地大语言模型</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-08-07-structured-outputs-with-ollama/"><strong>实验 | 如何使 Ollama 结构化输出 JSON 样式的结果</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/"><strong>推荐 | 文本分析库cntext2.x使用手册</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-06-14-using-large-language-model-to-extract-structure-data-from-raw-text/"><strong>实验 | 使用本地大模型从文本中提取结构化信息</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-07-10-using-large-language-model-to-build-diy-dictionary/">实验 | 使用Ollama本地大模型DIY制作单词书教案PDF</a></li>
<li><a href="https://textdata.cn/blog/2024-08-05-create-a-blog-writer-multi-agent-system-using-crewai-and-ollama/">实验 | 使用 Crewai 和 Ollama 构建智能体(AI Agent)帮我撰写博客文章</a></li>
</ul>
]]></content:encoded>
    </item>
    
    <item>
      <title>实验 | 如何使 Ollama 结构化输出 JSON 样式的结果</title>
      <link>https://textdata.cn/blog/2024-08-07-structured-outputs-with-ollama/</link>
      <pubDate>Fri, 07 Feb 2025 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-08-07-structured-outputs-with-ollama/</guid>
      <description>开源 LLMS 越来越受欢迎，Ollama 的 OpenAI 兼容性后来发布了，这使得使用 JSON 模式获取结构化输出成为可能。在本篇博文的结尾，您将了解如何有效地利用 Instructor 和 ollama。但在继续之前，让我们先探讨一下修补的概念。Open-source LLMS are gaining popularity, and the release of Ollama&amp;#39;s OpenAI compatibility later it has made it possible to obtain structured outputs using JSON schema.By the end of this blog post, you will learn how to effectively utilize instructor with ollama. But before we proceed, let&amp;#39;s first explore the concept of patching.</description>
      <content:encoded><![CDATA[<h2 id="一问题">一、问题</h2>
<p>我们希望LLM的回答的结果具有格式，最好是JSON格式(Python字典)， 这样有利于后续的调用。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#普通格式</span>
<span class="n">姓名</span> <span class="n">张三</span>
<span class="n">年龄</span> <span class="mi">34</span>
<span class="n">兴趣</span> <span class="n">打篮球</span><span class="err">、</span><span class="n">踢足球</span><span class="err">、</span><span class="n">游泳</span><span class="err">、</span><span class="n">打游戏</span>


<span class="c1">#JSON格式</span>
<span class="p">{</span>
  <span class="s2">&#34;name&#34;</span><span class="p">:</span> <span class="s2">&#34;张三&#34;</span><span class="p">,</span>
  <span class="s2">&#34;age&#34;</span><span class="p">:</span> <span class="mi">34</span><span class="p">,</span>
  <span class="s2">&#34;hobby&#34;</span><span class="p">:</span> <span class="p">[</span>
    <span class="s2">&#34;打篮球&#34;</span><span class="p">,</span>
    <span class="s2">&#34;踢足球&#34;</span><span class="p">,</span>
    <span class="s2">&#34;游泳&#34;</span><span class="p">,</span>
    <span class="s2">&#34;打游戏&#34;</span>
  <span class="p">]</span>
<span class="p">}</span>
</code></pre></div><p>如何从 「普通格式」转为 结构化的「JSON格式」？这里就用到 <em><strong>Instructor库</strong></em> 。</p>
<p><br><br></p>
<h2 id="二instructor介绍">二、Instructor介绍</h2>
<p><em><strong>Instructor</strong></em> 是一个 Python 库，它使处理大型语言模型 (LLM) 的结构化输出变得轻而易举。它建立在 Pydantic 之上，提供了一个简单、透明且用户友好的 API 来管理验证、重试和流式响应。</p>
<br>
<h3 id="21-instructor的主要特征">2.1 Instructor的主要特征</h3>
<ul>
<li><strong>定义输出样式</strong>：指定 Pydantic 模型来定义 LLM 输出的结构</li>
<li><strong>失败重试管理</strong>：轻松配置请求失败的重试次数</li>
<li><strong>样式验证</strong>：使用 Pydantic 验证确保 LLM 响应符合您的期望</li>
<li><strong>灵活的后端</strong>：与 OpenAI 之外的各种 LLM 提供商无缝集成</li>
</ul>
<br>
<h3 id="22-安装">2.2 安装</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">pip install instructor
</code></pre></div><br>
<h3 id="23-样例">2.3 样例</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">instructor</span>
<span class="kn">from</span> <span class="nn">pydantic</span> <span class="kn">import</span> <span class="n">BaseModel</span>
<span class="kn">from</span> <span class="nn">openai</span> <span class="kn">import</span> <span class="n">OpenAI</span>


<span class="c1"># Define your desired output structure</span>
<span class="k">class</span> <span class="nc">UserInfo</span><span class="p">(</span><span class="n">BaseModel</span><span class="p">):</span>
    <span class="n">name</span><span class="p">:</span> <span class="nb">str</span>
    <span class="n">age</span><span class="p">:</span> <span class="nb">int</span>


<span class="c1"># Patch the OpenAI client</span>
<span class="n">client</span> <span class="o">=</span> <span class="n">instructor</span><span class="o">.</span><span class="n">from_openai</span><span class="p">(</span><span class="n">OpenAI</span><span class="p">())</span>

<span class="c1"># Extract structured data from natural language</span>
<span class="n">user_info</span> <span class="o">=</span> <span class="n">client</span><span class="o">.</span><span class="n">chat</span><span class="o">.</span><span class="n">completions</span><span class="o">.</span><span class="n">create</span><span class="p">(</span>
    <span class="n">model</span><span class="o">=</span><span class="s2">&#34;gpt-3.5-turbo&#34;</span><span class="p">,</span>
    <span class="n">response_model</span><span class="o">=</span><span class="n">UserInfo</span><span class="p">,</span>
    <span class="n">messages</span><span class="o">=</span><span class="p">[{</span><span class="s2">&#34;role&#34;</span><span class="p">:</span> <span class="s2">&#34;user&#34;</span><span class="p">,</span> <span class="s2">&#34;content&#34;</span><span class="p">:</span> <span class="s2">&#34;John Doe is 30 years old.&#34;</span><span class="p">}],</span>
<span class="p">)</span>

<span class="nb">print</span><span class="p">(</span><span class="n">user_info</span><span class="o">.</span><span class="n">name</span><span class="p">)</span>
<span class="c1">#&gt; John Doe</span>
<span class="nb">print</span><span class="p">(</span><span class="n">user_info</span><span class="o">.</span><span class="n">age</span><span class="p">)</span>
<span class="c1">#&gt; 30</span>
</code></pre></div><p>注意，本部分的样例仅供观看，因为chatGPT 限制中国大陆用户使用，所以不论是你还是大邓，运行此代码会失败。但文章末尾会提供本地电脑可运行的实验代码。</p>
<p><br><br></p>
<h2 id="三结构化输出实验">三、结构化输出实验</h2>
<h3 id="31-环境配置">3.1 环境配置</h3>
<p>假设已在本地安装Ollama软件， 也使用ollama安装了相应的大语言模型(如 <em><strong>qwen2.5:0.5b</strong></em>、<em><strong>deepseek-r1:1.5b</strong></em> 等)。 如果之前没有进行这些操作， 请阅读 <a href="https://textdata.cn/blog/2024-06-14-how-to-download-large-language-model-with-ollama/"><strong>教程 | 如何使用 Ollama 下载 &amp; 使用本地大语言模型</strong></a></p>
<br>
<p>在命令行<em><strong>cmd</strong></em> (mac对应terminal) 中启动本地服务。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">ollama</span> <span class="n">serve</span>
</code></pre></div><br>
<h3 id="32-代码">3.2 代码</h3>
<p>只要完成2.2、3.1，本章节的代码是可以运行出结果的。  不做过多解释，直接上代码，大家看运行结果。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>

<span class="kn">from</span> <span class="nn">openai</span> <span class="kn">import</span> <span class="n">OpenAI</span>
<span class="kn">from</span> <span class="nn">pydantic</span> <span class="kn">import</span> <span class="n">BaseModel</span>
<span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">List</span>
<span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">instructor</span>



<span class="c1">#结构化输出</span>
<span class="k">class</span> <span class="nc">UserDetail</span><span class="p">(</span><span class="n">BaseModel</span><span class="p">):</span>
    <span class="n">name</span><span class="p">:</span> <span class="nb">str</span>
    <span class="n">age</span><span class="p">:</span> <span class="nb">int</span>
    <span class="n">hobby</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span>
    

<span class="c1">#Prompt提示</span>
<span class="n">PROMPT_TEXT</span> <span class="o">=</span> <span class="s2">&#34;根据自我介绍文本内容，从中提取出姓名、年龄、兴趣&#34;</span>

<span class="c1">#实验数据</span>
<span class="n">introduction_text</span> <span class="o">=</span> <span class="s1">&#39;我是张三，今年34岁， 来自黑龙江省， 我的兴趣爱好有打篮球、踢足球、游泳、打游戏。&#39;</span>


<span class="n">client</span> <span class="o">=</span> <span class="n">instructor</span><span class="o">.</span><span class="n">from_openai</span><span class="p">(</span>
    <span class="n">OpenAI</span><span class="p">(</span>
        <span class="n">base_url</span><span class="o">=</span><span class="s2">&#34;http://localhost:11434/v1&#34;</span><span class="p">,</span>
        <span class="n">api_key</span><span class="o">=</span><span class="s2">&#34;NA&#34;</span><span class="p">,</span>  <span class="c1"># required, but unused</span>
    <span class="p">),</span>
    <span class="n">mode</span> <span class="o">=</span> <span class="n">instructor</span><span class="o">.</span><span class="n">Mode</span><span class="o">.</span><span class="n">JSON</span><span class="p">,</span>
<span class="p">)</span>


<span class="n">resp</span> <span class="o">=</span> <span class="n">client</span><span class="o">.</span><span class="n">chat</span><span class="o">.</span><span class="n">completions</span><span class="o">.</span><span class="n">create</span><span class="p">(</span>
    <span class="n">model</span> <span class="o">=</span> <span class="s2">&#34;qwen2.5:0.5b&#34;</span><span class="p">,</span> <span class="c1">#本次任务简单，可以使用最轻量的0.5b模型。 </span>
    <span class="n">messages</span><span class="o">=</span><span class="p">[</span>
        <span class="p">{</span><span class="s2">&#34;role&#34;</span><span class="p">:</span> <span class="s2">&#34;system&#34;</span><span class="p">,</span> <span class="s2">&#34;content&#34;</span><span class="p">:</span> <span class="n">PROMPT_TEXT</span><span class="p">},</span>
        <span class="p">{</span><span class="s2">&#34;role&#34;</span><span class="p">:</span> <span class="s2">&#34;user&#34;</span><span class="p">,</span> <span class="s2">&#34;content&#34;</span><span class="p">:</span> <span class="n">introduction_text</span><span class="p">}</span>
    <span class="p">],</span>
    <span class="n">response_model</span> <span class="o">=</span> <span class="n">UserDetail</span><span class="p">,</span>
    <span class="n">max_retries</span> <span class="o">=</span> <span class="mi">3</span>
<span class="p">)</span>


<span class="nb">print</span><span class="p">(</span><span class="n">resp</span><span class="o">.</span><span class="n">model_dump_json</span><span class="p">(</span><span class="n">indent</span><span class="o">=</span><span class="mi">2</span><span class="p">))</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">{
  &#34;name&#34;: &#34;张三&#34;,
  &#34;age&#34;: 34,
  &#34;hobby&#34;: [
    &#34;打篮球&#34;,
    &#34;踢足球&#34;,
    &#34;游泳&#34;,
    &#34;打游戏&#34;
  ]
}
CPU times: user 47.2 ms, sys: 5.71 ms, total: 52.9 ms
Wall time: 412 ms
</code></pre></div><p>resp的数据类型为UserDetail， 是代码中是我们定义的 <em><strong>UserDetail</strong></em> 类。该类具有一些方法，也可直接 <em><strong>resp.dict()</strong></em> 转化为dict</p>
<br>
<p>查看 resp 的数据类型</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="nb">print</span><span class="p">(</span><span class="n">resp</span><span class="o">.</span><span class="n">dict</span><span class="p">())</span>
<span class="nb">print</span><span class="p">(</span><span class="nb">type</span><span class="p">(</span><span class="n">resp</span><span class="o">.</span><span class="n">dict</span><span class="p">()))</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">{&#39;name&#39;: &#39;张三&#39;, &#39;age&#39;: 34, &#39;hobby&#39;: [&#39;打篮球&#39;, &#39;踢足球&#39;, &#39;游泳&#39;, &#39;打游戏&#39;]}
&lt;class &#39;dict&#39;&gt;
</code></pre></div><br>
<br>
<h2 id="相关内容">相关内容</h2>
<ul>
<li><a href="https://textdata.cn/blog/2024-06-14-how-to-download-large-language-model-with-ollama/"><strong>教程 | 如何使用 Ollama 下载 &amp; 使用本地大语言模型</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-08-06-using-the-ollama-local-large-model-to-predict-the-sentiment-category-of-online-comments/"><strong>实验 | 使用本地大模型预测在线评论情感类别和分值</strong></a></li>
<li><a href="https://textdata.cn/blog/2025-02-14-using-online-large-model-api-to-transform-text-data-into-structured-data/"><strong>教程 | 使用大模型API将文本数据转化为结构化数据</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/"><strong>推荐 | 文本分析库 cntext 使用手册</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-06-14-using-large-language-model-to-extract-structure-data-from-raw-text/"><strong>实验 | 使用本地大模型从文本中提取结构化信息</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-07-10-using-large-language-model-to-build-diy-dictionary/">实验 | 使用Ollama本地大模型DIY制作单词书教案PDF</a></li>
<li><a href="https://textdata.cn/blog/2024-08-05-create-a-blog-writer-multi-agent-system-using-crewai-and-ollama/">实验 | 使用 Crewai 和 Ollama 构建智能体(AI Agent)帮我撰写博客文章</a></li>
</ul>
<p><br><br></p>
<h2 id="精选内容">精选内容</h2>
<ul>
<li><a href="https://textdata.cn/blog/datasets_available_for_management_science/">LIST | 可供社科(经管)领域使用的数据集汇总</a></li>
<li><a href="https://textdata.cn/blog/the_text_analysis_list_about_ms/">LIST | 社科(经管)数据挖掘文献资料汇总</a></li>
<li><a href="https://textdata.cn/blog/2024-06-16-scrapegraph-ai/">网络爬虫 | 使用scrapegraph-ai(大模型方案)自动采集网页数据</a></li>
<li><a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/">推荐 | 文本分析库cntext2.x使用手册</a></li>
<li><a href="https://textdata.cn/blog/management_python_course/">付费视频课 | Python实证指标构建与文本分析</a></li>
<li><a href="https://textdata.cn/blog/2024-06-14-using-large-language-model-to-extract-structure-data-from-raw-text/">实验 | 使用本地大模型从文本中提取结构化信息</a>
<br>
<br></li>
</ul>
]]></content:encoded>
    </item>
    
    <item>
      <title>教程 | 如何使用 Ollama 下载 &amp; 使用本地大语言模型</title>
      <link>https://textdata.cn/blog/2024-06-14-how-to-download-large-language-model-with-ollama/</link>
      <pubDate>Fri, 07 Feb 2025 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-06-14-how-to-download-large-language-model-with-ollama/</guid>
      <description>Ollama是一款开源应用程序，可让您使用 MacOS、Linux 和 Windows 上的命令行界面在本地运行、创建和共享大型语言模型。Ollama 可以直接从其库中访问各种 LLM，只需一个命令即可下载。下载后，只需执行一个命令即可开始使用。这对于工作量围绕终端窗口的用户非常有帮助。如果他们被困在某个地方，他们可以在不切换到另一个浏览器窗口的情况下获得答案。Ollama is an open source application that allows you to run, create, and share large language models locally using a command line interface on MacOS, Linux, and Windows. Ollama can access a variety of LLMs directly from its library and can be downloaded with just one command. Once downloaded, it only takes one command to get started. This is very helpful for users whose workload revolves around a terminal window. If they are stuck somewhere, they can get the answer without switching to another browser window.</description>
      <content:encoded><![CDATA[<br>
<h2 id="一ollama">一、Ollama</h2>
<h3 id="11-ollama介绍">1.1 Ollama介绍</h3>
<p><a href="https://ollama.ai/"><strong>Ollama</strong></a>是一款开源应用程序，可让您使用 MacOS、Linux 和 Windows 上的命令行界面在本地运行、创建和共享大型语言模型。</p>
<p>Ollama 可以直接从其库中访问各种 LLM，只需一个命令即可下载。下载后，只需执行一个命令即可开始使用。这对于工作量围绕终端窗口的用户非常有帮助。如果他们被困在某个地方，他们可以在不切换到另一个浏览器窗口的情况下获得答案。</p>
<br>
<p>这就是为什么 OLLAMA 是您的工具包中必备的工具：</p>
<ul>
<li><strong>简单</strong> ：OLLAMA 提供简单的设置过程。您无需拥有机器学习博士学位即可启动和运行它。</li>
<li><strong>成本效益</strong> ：在本地运行模型意味着您无需支付云成本。您的钱包会感谢您。</li>
<li><strong>隐私</strong> ：使用 OLLAMA，所有数据处理都在您的本地机器上进行。这对于用户隐私来说是一个巨大的胜利。</li>
<li><strong>多功能性</strong> ：OLLAMA 不只是为 Python 爱好者准备的。它的灵活性使其可以用于各种应用程序，包括 Web 开发。</li>
</ul>
<br>
<h3 id="12-安装ollama">1.2 安装ollama</h3>
<p>点击前往网站 <a href="https://ollama.com/">https://ollama.com/</a> ，下载ollama软件，支持win、Mac、linux</p>
<p><img loading="lazy" src="img/03-ollama-gui.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="二ollama操作">二、Ollama操作</h2>
<h3 id="21-选择模型">2.1 选择模型</h3>
<p>ollama软件目前支持多种大模型， 如阿里的（<em><strong>qwen2.5</strong></em>）、meta的(llama3.3) 等。目前ollama最流行的模型，是国产开源大模型 <em><strong>deepseek r1</strong></em>。本文将安装<em><strong>qwen2.5:0.5b</strong></em>、 <em><strong>qwen2.5:1.5b</strong></em>、 <em><strong>qwen2.5:3b</strong></em>、 <em><strong>qwen2.5:7b</strong></em>、 <em><strong>deepseek-r1:1.5b</strong></em>、<em><strong>deepseek-r1:7b</strong></em>。 并对模型的速度、内容质量进行对比。</p>
<p><img loading="lazy" src="img/04-ollama-model.png" alt=""  />
</p>
<br>
<p><em><strong>DeepSeek-R1</strong></em> 在后训练阶段大规模使用了强化学习技术，在仅有极少标注数据的情况下，极大提升了模型推理能力。<strong>在数学、代码、自然语言推理等任务上，性能比肩 OpenAI o1 正式版</strong>。</p>
<p><img loading="lazy" src="img/01-deepseekr1-performance.png" alt=""  />
</p>
<br>
<h3 id="22-安装模型">2.2 安装模型</h3>
<p><strong>一般b前面的数字越小， 运行模型对电脑性能的要求越低</strong>。</p>
<p><img loading="lazy" src="img/05-deepseek-r1.png" alt=""  />
</p>
<p><img loading="lazy" src="img/05-qwen2.5.png" alt=""  />
</p>
<br>
<p>打开电脑命令行cmd(mac是terminal),  网络是连网状态，执行模型下载(安装)命令</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">ollama run deepseek-r1:1.5b
ollama run deepseek-r1:7b
ollama run qwen2.5:0.5b
ollama run qwen2.5:1.5b
ollama run qwen2.5:3b
ollama run qwen2.5:7b
</code></pre></div><br>
<h3 id="23-查看已安装模型">2.3 查看已安装模型</h3>
<p>在电脑命令行cmd(mac是terminal),  执行命令</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">ollama list
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">Last login: Tue Sep 24 19:26:46 on ttys000
da@deng ~ % ollama list
NAME                       ID              SIZE      MODIFIED        
qwen2.5:0.5b               a8b0c5157701    397 MB    15 minutes ago 
qwen2.5:1.5b               65ec06548149    986 MB    18 minutes ago  
qwen2.5:3b                 357c53fb659c    1.9 GB    12 minutes ago 
qwen2.5:7b                 845dbda0ea48    4.7 GB    21 seconds ago       
deepseek-r1:1.5b           a42b25d8c10a    1.1 GB    48 minutes ago    
deepseek-r1:7b             0a8c26691023    4.7 GB    50 minutes ago    
nomic-embed-text:latest    0a109f422b47    274 MB    9 months ago 
da@deng ~ % 
</code></pre></div><p>可以看到，列表中有 <em><strong>deepseek-r1:1.5b</strong></em> ， 说明在大邓的电脑中， 已经成功安装了 <em><strong>deepseek-r1:1.5b</strong></em> 。</p>
<br>
<h3 id="24-移除模型">2.4 移除模型</h3>
<p>使用 <code>ollama rm 模型名称</code> 移除已安装的某模型。 假设要移除 <em><strong>deepseek-r1:8b</strong></em>， 在电脑命令行cmd(mac是terminal),  执行移除命令</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">ollama</span> <span class="n">rm</span> <span class="n">deepseek</span><span class="o">-</span><span class="n">r1</span><span class="p">:</span><span class="mi">8</span><span class="n">b</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">deleted &#39;deepseek-r1:8b&#39;
</code></pre></div><br>
<h3 id="25-启动ollama服务">2.5 启动ollama服务</h3>
<p>在电脑中找到 ollama软件的图标， 双击打开即可开启 Ollama 服务。</p>
<p>如果觉得点击启动太麻烦，也可使用命令行操作， 打开电脑命令行cmd(mac是terminal), 执行</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">ollama serve
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">2025/02/07 16:00:18 routes.go:1259: INFO server config env=&#34;map[HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/Users/deng/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false http_proxy: https_proxy: no_proxy:]&#34;
time=2025-02-07T16:00:18.551+08:00 level=INFO source=images.go:757 msg=&#34;total blobs: 11&#34;
time=2025-02-07T16:00:18.551+08:00 level=INFO source=images.go:764 msg=&#34;total unused blobs removed: 0&#34;
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in &#34;debug&#34; mode. Switch to &#34;release&#34; mode in production.
 - using env:	export GIN_MODE=release
 - using code:	gin.SetMode(gin.ReleaseMode)
er.(*Server).GenerateRoutes.func1 (5 handlers)
......
time=2025-02-07T16:00:18.553+08:00 level=INFO source=routes.go:1339 msg=&#34;Dynamic LLM libraries&#34; runners=[metal]
time=2025-02-07T16:00:18.577+08:00 level=INFO source=types.go:131 msg=&#34;inference compute&#34; id=0 library=metal variant=&#34;&#34; compute=&#34;&#34; driver=0.0 name=&#34;&#34; total=&#34;72.0 GiB&#34; available=&#34;72.0 GiB&#34;
</code></pre></div><p>cmd(mac是terminal)看到如上的信息，说明命令行本地ollama服务已开启。</p>
<br>
<h2 id="三在python中调用ollama中大模型">三、在Python中调用Ollama中大模型</h2>
<p>在Python中， 有很多第三方库，如langchain、langgraph、ollama， 都能调用Ollama内的模型。 这里以ollama库为例，</p>
<h3 id="31-启动ollama服务">3.1 启动Ollama服务</h3>
<p>在命令行<em><strong>cmd</strong></em> (mac对应terminal) 中启动本地服务。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">ollama</span> <span class="n">serve</span>
</code></pre></div><br>
<h3 id="32-安装">3.2 安装</h3>
<p>打开电脑命令行 <em><strong>cmd</strong></em> (mac是terminal),  网络是连网状态，执行安装命令</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">pip3 install ollama
#pip3 install ollama==0.2.1
</code></pre></div><br>
<h3 id="33-实验">3.3 实验</h3>
<p><em><strong>假设你是X先生的私人助理，负责X先生的形成安排。X先生一周后将去哈尔滨旅游，帮X先生设计一个哈尔滨一日游形成安排</strong></em>。</p>
<h4 id="331-qwen2515b">3.3.1 qwen2.5:1.5b</h4>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#%%time #单次运行时间</span>
<span class="c1">#%%timeit #多次运行，求得平均运行时间</span>

<span class="kn">import</span> <span class="nn">ollama</span>
<span class="c1">#大邓的ollama版本为0.2.1</span>


<span class="n">content</span> <span class="o">=</span> <span class="s2">&#34;你是X先生的私人助理，负责X先生的形成安排。X先生一周后将去哈尔滨旅游，帮X先生设计一个哈尔滨一日游形成安排。&#34;</span>
<span class="n">response</span> <span class="o">=</span> <span class="n">ollama</span><span class="o">.</span><span class="n">chat</span><span class="p">(</span><span class="n">model</span> <span class="o">=</span> <span class="s1">&#39;qwen2:7b&#39;</span><span class="p">,</span>   <span class="c1">#选择模型</span>
                       <span class="n">messages</span> <span class="o">=</span> <span class="p">[{</span><span class="s1">&#39;role&#39;</span><span class="p">:</span> <span class="s1">&#39;user&#39;</span><span class="p">,</span> <span class="s1">&#39;content&#39;</span><span class="p">:</span> <span class="n">content</span><span class="p">}])</span>


<span class="c1">#content2 = &#34;X先生一周后将去哈尔滨旅游，帮X先生设计一个哈尔滨一日游形成安排。&#34;</span>
<span class="c1">#response = ollama.chat(model = &#39;qwen2:7b&#39;,  #选择模型</span>
<span class="c1">#                       messages = [</span>
<span class="c1">#                         {&#39;role&#39;: &#39;system&#39;, &#39;content&#39;: &#34;你是X先生的私人助理，负责X先生的形成安排。&#34;},</span>
<span class="c1">#                         {&#39;role&#39;: &#39;user&#39;, &#39;content&#39;: content2}</span>
<span class="c1">#                       ]</span>
<span class="c1">#                      )</span>


<span class="n">result</span> <span class="o">=</span> <span class="n">response</span><span class="p">[</span><span class="s1">&#39;message&#39;</span><span class="p">][</span><span class="s1">&#39;content&#39;</span><span class="p">]</span>
<span class="nb">print</span><span class="p">(</span><span class="n">result</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">我理解您是Qwen，但我作为AI模型，并没有实际经历、记忆或能力来为任何人策划旅行安排。不过，我可以提供一些建议性的建议帮助您设计这个行程。

1. 交通：根据您的具体位置和出发时间，您可以考虑哈尔滨的机场（哈尔滨太平国际机场）或者火车站的便利性。
2. 确定活动：哈尔滨是冰城，所以可以尝试观看滑冰、滑雪或雪地摩托等冰上活动。另外，您还可以参加冰雪节，体验东北特色的冰灯展览。当然，如果您喜欢风景的话，还可以在松花江畔散步，欣赏沿岸的风景。
3. 餐饮：哈尔滨的特色美食包括狗不理包子和东北锅盔。此外，您可以品尝到鲜美的冻梨和各种风味小吃。 
4. 购物：逛一逛哈啤街可以买到冰城特产和纪念品。

以上只是建议性的行程安排，您的实际旅行需要根据您的兴趣爱好、身体状况以及时间来确定。希望这些建议对您有所帮助！
好的，我将根据您的需求为您制定一份哈尔滨一日游的行程方案：

### 第1天：抵达哈尔滨

- **上午8:00**：从上海或您所在的城市出发前往哈尔滨国际机场。
- **下午2:30**：抵达哈尔滨，下飞机后换乘高速动车（约4小时）到达哈尔滨南站。
- **下午3:30**：到机场附近的酒店办理入住手续，稍作休息准备。

### 第2天：哈尔滨之旅

#### 上午：城市观光与探索

- **10:00**：前往中央大街。这里以其独特的建筑风格和文化氛围吸引着众多游客。
- **上午11:30**：参观东北虎林园，了解中国最北部的野生动物保护情况。
- **下午2:00**：漫步于圣索菲亚教堂附近的小巷，体验哈尔滨的老城区生活。
- **下午4:00**：在哈啤博物馆内探索哈尔滨啤酒的历史和制作工艺。

#### 下午：文化与美食

- **15:30**：参观哈尔滨冰雕艺术展览，欣赏世界级的冰雪艺术品。
- **下午6:00**：返回市区，享用正宗的东北大餐，比如狗不理包子、哈尔滨锅包肉等特色小吃。

#### 晚上：夜游与体验

- **18:00**：乘坐雪乡索道上山，探索世界最大的冰雪雕塑群。
- **晚上20:30**：回到市区，品尝地道的哈尔滨美食和市井风味小摊。
- **21:30**：结束今天的行程。

### 第3天：返程

#### 晚间：准备离店

- **15:00**：在酒店享用晚餐，并安排打包食物或行李。
- **16:30**：开始收拾行囊，准备出发离开哈尔滨。可能需要提前半小时抵达机场。

请注意，这个方案是基于一般情况下的旅游规划，实际行程可能会根据您的偏好和具体交通时间有所调整。希望这份行程能为您提供一个美好的哈尔滨旅行体验！
</code></pre></div><br>
<h4 id="332-deepseek-r115b">3.3.2 deepseek r1:1.5b</h4>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#%%time #单次运行时间</span>
<span class="c1">#%%timeit #多次运行，求得平均运行时间</span>

<span class="kn">import</span> <span class="nn">ollama</span>
<span class="c1">#大邓的ollama版本为0.2.1</span>

<span class="n">content</span> <span class="o">=</span> <span class="s2">&#34;你是X先生的私人助理，负责X先生的形成安排。X先生一周后将去哈尔滨旅游，帮X先生设计一个哈尔滨一日游形成安排。&#34;</span>
<span class="n">response</span> <span class="o">=</span> <span class="n">ollama</span><span class="o">.</span><span class="n">chat</span><span class="p">(</span><span class="n">model</span> <span class="o">=</span> <span class="s1">&#39;deepseek-r1:1.5b&#39;</span><span class="p">,</span>   <span class="c1">#选择模型</span>
                       <span class="n">messages</span> <span class="o">=</span> <span class="p">[{</span><span class="s1">&#39;role&#39;</span><span class="p">:</span> <span class="s1">&#39;user&#39;</span><span class="p">,</span> <span class="s1">&#39;content&#39;</span><span class="p">:</span> <span class="n">content</span><span class="p">}])</span>


<span class="c1">#content2 = &#34;X先生一周后将去哈尔滨旅游，帮X先生设计一个哈尔滨一日游形成安排。&#34;</span>
<span class="c1">#response = ollama.chat(model = &#39;deepseek-r1:1.5b&#39;,  #选择模型</span>
<span class="c1">#                       messages = [</span>
<span class="c1">#                         {&#39;role&#39;: &#39;system&#39;, &#39;content&#39;: &#34;你是X先生的私人助理，负责X先生的形成安排。&#34;},</span>
<span class="c1">#                         {&#39;role&#39;: &#39;user&#39;, &#39;content&#39;: content2}</span>
<span class="c1">#                       ]</span>
<span class="c1">#                      )</span>


<span class="n">result</span> <span class="o">=</span> <span class="n">response</span><span class="p">[</span><span class="s1">&#39;message&#39;</span><span class="p">][</span><span class="s1">&#39;content&#39;</span><span class="p">]</span>
<span class="nb">print</span><span class="p">(</span><span class="n">result</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">&lt;think&gt;
好的，用户需要我设计哈尔滨一日游的安排。首先，我得考虑目标客户的需求是什么。是单日游还是多日？用户可能是一个游客，想要既能体验哈尔滨的魅力，又不想太累，所以时间控制在两三个小时比较合适。

接下来，我要确定 Polar Express作为活动的主要交通工具，因为它不仅风景优美，还能带来刺激感，比如乘坐 escalator，这样可以让游客感觉有 motion。然后，其他活动要安排得轻松愉快，比如游船、冰灯 etc.，这些都能让整个行程看起来充实而有变化。

时间安排方面，从12点到下午三点左右比较合适，因为中午的户外活动和下午的购物区可以好好玩一玩。每个时间段都要留足够的时间进行活动，确保行程紧凑且不累赘。

另外，还要注意注意事项，比如天气、门票等，特别是如果用户是儿童的话，得确保安全和适合的主题。最后，提醒用户根据自己的需求调整时间和内容，让行程更加个性化和有创意。
&lt;/think&gt;

嗯，你已经让我设计了一个详细的哈尔滨一日游安排了！让我们一步步来思考一下：

### 1. 时间框架
假设你计划在上午去游览景点，下午去夜游，晚上进行购物和品尝美食。

### 2. 活动安排建议：
   - **早上： Polar Express 呼吸机 ride（快速穿梭公园内）**
   - 每人乘坐 Polar Express前往公园，体验快速移动的刺激。
   
- **中午： 游船游湖**
   - 晚上11点前坐 boat 温泉，感受 Polar Bear 和 fish 的美丽。

- **下午： 傍晚： ice cream 环岛游（Polar Express 区）**
   - 跑在环城的路上购买冰棒和冰淇淋，享受夜景。

- **晚上： 餐饮**
   - 中午去餐馆点餐，晚餐去夜市品尝特色小吃。

### 3. 注意事项：
- 如果是儿童游玩，记得注意安全，选择容易摔倒的景点。
- 建议提前预订 Polar Express 的票，避免排队。

希望这个安排能满足你的需求！如果你有其他具体需求或偏好，请告诉我，我可以再调整哦！
</code></pre></div><br>
<br>
<h2 id="四性能评价">四、性能评价</h2>
<p><em><strong>qwen2.5</strong></em> 和 <em><strong>deepseek r1</strong></em> 都能很好的完成了旅游规划的任务。 运行速度方面， <em><strong>qwen2.5</strong></em> 远快于 <em><strong>deepseek r1</strong></em> 。本次实验中每个代码均运行7次，最终求得平均耗时</p>
<ul>
<li><em><strong>qwen2.5:0.5b</strong></em> 平均耗时 <em><strong>1.43 s ± 746 ms</strong></em></li>
<li><em><strong>qwen2.5:1.5b</strong></em> 平均耗时 <em><strong>2.5 s ± 1.18 s</strong></em></li>
<li><em><strong>qwen2.5: 3b</strong></em> 平均耗时 <em><strong>4.76 s ± 1.77 s</strong></em></li>
<li><em><strong>qwen2.5: 7b</strong></em> 平均耗时 <em><strong>8.58 s ± 534 ms</strong></em></li>
<li><em><strong>deepseek r1:1.5b</strong></em> 平均耗时 <em><strong>8.71 s ± 1.66 s</strong></em></li>
<li><em><strong>deepseek r1:7b</strong></em> 平均耗时 <em><strong>21 s ± 4.39 s</strong></em></li>
</ul>
<p>如果追求速度， 同样体量的模型的(以1.5b为例)，目前首选 <em><strong>qwen2.5</strong></em> （qwen2.5:1.5b）。</p>
<p>各位可以结合自己任务， 电脑性能， 速度等不同需求， 选择对自己最合适的模型。</p>
<br>
<br>
<h2 id="相关内容">相关内容</h2>
<ul>
<li><a href="https://textdata.cn/blog/2025-02-14-using-online-large-model-api-to-transform-text-data-into-structured-data/"><strong>教程 | 使用大模型将文本数据转化为结构化数据</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-08-06-using-the-ollama-local-large-model-to-predict-the-sentiment-category-of-online-comments/"><strong>实验 | 使用本地大模型预测在线评论情感类别和分值</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-08-07-structured-outputs-with-ollama/"><strong>实验 | 如何使 Ollama 结构化输出 JSON 样式的结果</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-08-03-literature-document-parsing-using-large-language-models-with-code/"><strong>实验 | 使用本地大模型从论文PDF中提取结构化信息</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/"><strong>推荐 | 文本分析库 cntext 使用手册</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-06-14-using-large-language-model-to-extract-structure-data-from-raw-text/"><strong>实验 | 使用本地大模型从文本中提取结构化信息</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-07-10-using-large-language-model-to-build-diy-dictionary/">实验 | 使用Ollama本地大模型DIY制作单词书教案PDF</a></li>
<li><a href="https://textdata.cn/blog/2024-08-05-create-a-blog-writer-multi-agent-system-using-crewai-and-ollama/">实验 | 使用 Crewai 和 Ollama 构建智能体(AI Agent)帮我撰写博客文章</a></li>
</ul>
<br>
<br>
<h2 id="精选内容">精选内容</h2>
<ul>
<li><a href="https://textdata.cn/blog/datasets_available_for_management_science/">LIST | 可供社科(经管)领域使用的数据集汇总</a></li>
<li><a href="https://textdata.cn/blog/the_text_analysis_list_about_ms/">LIST | 社科(经管)数据挖掘文献资料汇总</a></li>
<li><a href="https://textdata.cn/blog/2024-06-16-scrapegraph-ai/">网络爬虫 | 使用scrapegraph-ai(大模型方案)自动采集网页数据</a></li>
<li><a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/">推荐 | 文本分析库cntext2.x使用手册</a></li>
<li><a href="https://textdata.cn/blog/management_python_course/">付费视频课 | Python实证指标构建与文本分析</a></li>
<li><a href="https://textdata.cn/blog/2025-02-07-using-large-language-model-to-extract-structure-data-from-raw-text/">实验 | 使用本地大模型从文本中提取结构化信息</a>7
<br>
<br></li>
</ul>
]]></content:encoded>
    </item>
    
    <item>
      <title>使用 Word2Vec 和 TF-IDF 计算五类企业文化</title>
      <link>https://textdata.cn/blog/2024-12-31-measure-corporate-culture-using-word2vec/</link>
      <pubDate>Tue, 31 Dec 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-12-31-measure-corporate-culture-using-word2vec/</guid>
      <description>我们使用最新的机器学习技术——**词嵌入模型**——和209,480份盈利电话会议记录创建了一本文化词典。我们对2001年至2018年期间的62,664个公司年度观察数据的**五个公司文化价值——创新、诚信、质量、尊重和团队合作**进行评分。结果表明，创新文化比公司创新的通常衡量标准——研发支出和专利数量——更广泛。此外，我们还表明，企业文化与业务结果相关，包括运营效率、风险承担、盈利管理、高管薪酬设计、企业价值和交易等，并且文化-绩效联系在困难时期更加显著。最后，我们提供了初步证据，表明企业文化受到重大公司事件（如合并和收购）的影响。</description>
      <content:encoded><![CDATA[<p><img loading="lazy" src="img/cover.png" alt=""  />
</p>
<p>Kai Li, Feng Mai, Rui Shen, Xinyan Yan, <strong>Measuring Corporate Culture Using Machine Learning</strong>, The Review of Financial Studies, 2020</p>
<p>摘要: 我们使用最新的机器学习技术——<strong>词嵌入模型</strong>——和209,480份盈利电话会议记录创建了一本文化词典。我们对2001年至2018年期间的62,664个公司年度观察数据的<strong>五个公司文化价值——创新、诚信、质量、尊重和团队合作</strong>进行评分。结果表明，创新文化比公司创新的通常衡量标准——研发支出和专利数量——更广泛。此外，我们还表明，企业文化与业务结果相关，包括运营效率、风险承担、盈利管理、高管薪酬设计、企业价值和交易等，并且文化-绩效联系在困难时期更加显著。最后，我们提供了初步证据，表明企业文化受到重大公司事件（如合并和收购）的影响。</p>
<h2 id="内容概况">内容概况</h2>
<p>今天分享的内容主要包括两部分， 即</p>
<ul>
<li>使用word2vec扩展得到五大类企业文化词典</li>
<li>使用TFIDF算法，结合五类文化词典对公司进行评分</li>
</ul>
<br>
<h2 id="算法步骤">算法步骤</h2>
<ol>
<li>构建种子词； 人工构建五类企业文化的种子词典， 每类词典人工准备5-10个词</li>
<li>word2vec扩充； 使用word2vec扩充五类企业文化词典的词汇量</li>
<li>tf-idf；将文本数据转为tf-idf格式</li>
<li>计算五类企业文化得分； 筛选含有文化词的列，按不同企业文化类别，分别求和得到得分。</li>
</ol>
<p><br><br></p>
<h2 id="一人工构建种子词典">一、人工构建种子词典</h2>
<p>词向量法程序会挖掘出原始数据中的所有词的词向量，这时候如果给词向量模型传入种子词，会根据语义空间的距离远近识别出多个近义词。</p>
<p>手工构建了五大类企业文化词典，存放在txt中，即</p>
<ul>
<li>data/w2v_seeds/innovation.txt</li>
<li>data/w2v_seeds/integrity.txt</li>
<li>data/w2v_seeds/quality.txt</li>
<li>data/w2v_seeds/respect.txt</li>
<li>data/w2v_seeds/teamwork.txt</li>
</ul>
<p>注意，在txt中，每行一个词语。原始语料</p>
<ul>
<li>data/w2v_corpus.txt</li>
</ul>
<p><br><br></p>
<h2 id="二-词向量法扩展词典">二、 词向量法扩展词典</h2>
<p>论文使用gensim的word2vec算法扩充企业文化词典。但代码太复杂，对初学Python的小白而言，代码的调试难度较大。大邓在这里将代码进行压缩和封装，只需要几行就能原作者几十行才能实现的词向量扩充的功能。需要安装gensim库和cntext库</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#安装需要的包</span>
<span class="err">!</span><span class="n">pip3</span> <span class="n">install</span> <span class="n">gensim</span><span class="o">==</span><span class="mf">4.2.0</span>
<span class="err">!</span><span class="n">pip3</span> <span class="n">install</span> <span class="n">cntext</span><span class="o">==</span><span class="mf">1.9.2</span>
</code></pre></div><p><img loading="lazy" src="img/nltk.png" alt=""  />
</p>
<p>注意，下方代码运行可能会出现nltk_data问题，解决办法参考视频  <a href="https://www.bilibili.com/video/BV14A411i7DB/">https://www.bilibili.com/video/BV14A411i7DB/</a></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>
<span class="kn">import</span> <span class="nn">os</span>

<span class="c1">#初始化模型,需要设置lang参数。</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">W2VModels</span><span class="p">(</span><span class="n">cwd</span><span class="o">=</span><span class="n">os</span><span class="o">.</span><span class="n">getcwd</span><span class="p">(),</span> 
                     <span class="n">lang</span><span class="o">=</span><span class="s1">&#39;english&#39;</span><span class="p">)</span>  <span class="c1">#语料数据 w2v_corpus.txt</span>

<span class="c1">#训练词向量模型</span>
<span class="n">model</span><span class="o">.</span><span class="n">train</span><span class="p">(</span><span class="n">input_txt_file</span><span class="o">=</span><span class="s1">&#39;data/w2v_corpus.txt&#39;</span><span class="p">)</span>


<span class="c1">#根据种子词和训练好的词向量模型，筛选出没类词最相近的前100个词</span>
<span class="n">model</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="n">seedword_txt_file</span><span class="o">=</span><span class="s1">&#39;data/w2v_seeds/integrity.txt&#39;</span><span class="p">,</span> 
           <span class="n">topn</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span>
<span class="n">model</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="n">seedword_txt_file</span><span class="o">=</span><span class="s1">&#39;data/w2v_seeds/innovation.txt&#39;</span><span class="p">,</span> 
           <span class="n">topn</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span>
<span class="n">model</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="n">seedword_txt_file</span><span class="o">=</span><span class="s1">&#39;data/w2v_seeds/quality.txt&#39;</span><span class="p">,</span> 
           <span class="n">topn</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span>
<span class="n">model</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="n">seedword_txt_file</span><span class="o">=</span><span class="s1">&#39;data/w2v_seeds/respect.txt&#39;</span><span class="p">,</span> 
           <span class="n">topn</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span>
<span class="n">model</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="n">seedword_txt_file</span><span class="o">=</span><span class="s1">&#39;data/w2v_seeds/teamwork.txt&#39;</span><span class="p">,</span> 
           <span class="n">topn</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span>
</code></pre></div><pre><code>Step 1/4:...Preprocess   corpus ...
Step 2/4:...Train  word2vec model
            used   42 s
Step 3/4:...Prepare similar candidates for each seed word in the word2vec model...
Step 4/4 Finish! Used 46 s
Step 3/4:...Prepare similar candidates for each seed word in the word2vec model...
Step 4/4 Finish! Used 46 s
Step 3/4:...Prepare similar candidates for each seed word in the word2vec model...
Step 4/4 Finish! Used 46 s
Step 3/4:...Prepare similar candidates for each seed word in the word2vec model...
Step 4/4 Finish! Used 46 s
Step 3/4:...Prepare similar candidates for each seed word in the word2vec model...
Step 4/4 Finish! Used 46 s
</code></pre>
<p><br><br></p>
<h2 id="三-使用tf-idf有权重计算情感词典">三、 使用TF-IDF有权重计算情感词典</h2>
<p>一般的情感分析是无权重算法，即每个词语的权重都是1，只需要统计文本中<strong>某类概念词</strong>出现的多寡，就能确定<strong>该概念的得分</strong></p>
<p><img loading="lazy" src="img/tf.png" alt=""  />

<img loading="lazy" src="img/idf.png" alt=""  />

<img loading="lazy" src="img/tfidf.png" alt=""  />
</p>
<p>这篇论文使用的tf-idf， 我们可以将tf简单的理解为某词在文本出现的次数， idf是该词的稀奇程度(少见多怪程度)。一般我们认为一个词语出现的次数越多，信息量越大。但有时候，稀缺性也是一种很重要的信息量。例如以下两类词</p>
<ul>
<li>A 的、它、呢、了&hellip;&hellip;</li>
<li>B 核能、战争、死&hellip;&hellip;</li>
</ul>
<p>A类词，几乎出现在中文所有的句子中，我们可以忽略掉这类词，不影响对句子语义的理解。而B类词很少出现在我们日常文本句子中，但一旦出现，直接影响句子的语义。所以只考虑TF词频的大小还不全面，我们还需要纳入稀缺性信息IDF。</p>
<h3 id="31-读入数据">3.1 读入数据</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1"># 读企业文本数据</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">corporate_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_excel</span><span class="p">(</span><span class="s1">&#39;data/corporate_culture.xlsx&#39;</span><span class="p">)</span>
<span class="n">corporate_df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>idx</th>
      <th>text</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>0</td>
      <td>Thank you sir. Ladies and gentlemen, at this t...</td>
    </tr>
    <tr>
      <th>1</th>
      <td>1</td>
      <td>[OPERATOR INSTRUCTIONS]. Our first question is...</td>
    </tr>
    <tr>
      <th>2</th>
      <td>2</td>
      <td>Thank you, Mr. Gallagher. [Caller instructions...</td>
    </tr>
    <tr>
      <th>3</th>
      <td>3</td>
      <td>(OPERATOR INSTRUCTIONS.)  Ann Gillen, Lehman B...</td>
    </tr>
    <tr>
      <th>4</th>
      <td>4</td>
      <td>(OPERATOR INSTRUCTIONS) John Harmon.    I have...</td>
    </tr>
  </tbody>
</table>
</div>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#查看记录数(企业文本数）</span>
<span class="nb">len</span><span class="p">(</span><span class="n">corporate_df</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<pre><code>105
</code></pre>
<br>
<h3 id="32-读词典">3.2 读词典</h3>
<p>word2vec扩展后的企业文化五大类，需要人工检查，剔除不符合词典含义的词，留下可用的词语。<strong>这里我们假装已经人工检查过了</strong>。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">read_dict</span><span class="p">(</span><span class="n">file</span><span class="p">):</span>
    <span class="n">words</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="n">file</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">&#39;</span><span class="se">\n</span><span class="s1">&#39;</span><span class="p">)</span>
    <span class="n">words</span> <span class="o">=</span> <span class="p">[</span><span class="n">w</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">words</span> <span class="k">if</span> <span class="n">w</span><span class="p">]</span>
    <span class="k">return</span> <span class="n">words</span>

<span class="n">innovation_words</span> <span class="o">=</span> <span class="n">read_dict</span><span class="p">(</span><span class="s1">&#39;output/w2v_candi_words/innovation.txt&#39;</span><span class="p">)</span>
<span class="n">integrity_words</span> <span class="o">=</span> <span class="n">read_dict</span><span class="p">(</span><span class="s1">&#39;output/w2v_candi_words/integrity.txt&#39;</span><span class="p">)</span>
<span class="n">quality_words</span> <span class="o">=</span> <span class="n">read_dict</span><span class="p">(</span><span class="s1">&#39;output/w2v_candi_words/quality.txt&#39;</span><span class="p">)</span>
<span class="n">respect_words</span> <span class="o">=</span> <span class="n">read_dict</span><span class="p">(</span><span class="s1">&#39;output/w2v_candi_words/respect.txt&#39;</span><span class="p">)</span>
<span class="n">teamwork_words</span> <span class="o">=</span> <span class="n">read_dict</span><span class="p">(</span><span class="s1">&#39;output/w2v_candi_words/teamwork.txt&#39;</span><span class="p">)</span>

<span class="nb">print</span><span class="p">(</span><span class="n">respect_words</span><span class="p">)</span>
</code></pre></div><pre><code>['respectful', 'talent', 'talented', 'employee', 'dignity', 'empowerment', 'empower', 'skills', 'ibos', 'hr', 'salespeople', 'designers', 'creative', 'organizations', 'dedicated', 'backbone', 'abilities', 'missions', 'engine', 'tools', 'training', 'tackling', 'resource', 'adapting', 'interface', 'selecting', 'functions', 'expertise', 'cryocooler', 'sdk', 'affiliated', 'computers', 'departments', 'awareness', 'logistical', 'in-house', 'associate', 'optimization', 'functioning', 'outsource', 'organized', 'dedicate', 'outbound', 'pride', 'organization', 'referral', 'contacts', 'culture', 'motor', 'coordination', 'financially', 'onsite', 'web-based', 'functionality', 'wholesalers', 'provider', 'telesales', 'professionally', 'dealers', 'managers', 'involves', 'backhaul', 'crm', 'beefing', 'rf', 'computer', 'outreach', 'branding', 'appealing', 'networks', 'knowledge', 'electrical', 'industry-leading', 'providers', 'desires', 'guests', 'managerial', 'enhanced', 'assigned', 'railroad', 'durability', 'individuals', 'co2', 'believes', 'long-standing', 'high-quality', 'third-party', 'systems', 'groups', 'party', 'connecting', 'community', 'complementary', 'practices', 'reputation', 'processes', 'merchandising', 'next-generation', 'bundles', 'refocus', 'infrastructure', 'physician', 'transportation', 'aircraft', 'responsiveness', 'trained', 'full-time']
</code></pre>
<br>
<h3 id="33-生成每条记录的tfidf值">3.3 生成每条记录的tfidf值</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">sklearn.feature_extraction.text</span> <span class="kn">import</span> <span class="n">TfidfVectorizer</span>


<span class="k">def</span> <span class="nf">createDTM</span><span class="p">(</span><span class="n">corpus</span><span class="p">):</span>
    <span class="s2">&#34;&#34;&#34;构建文档词语矩阵&#34;&#34;&#34;</span>
    <span class="n">vectorize</span> <span class="o">=</span> <span class="n">TfidfVectorizer</span><span class="p">()</span>
    <span class="c1">#注意fit_transform相当于fit之后又transform。</span>
    <span class="n">dtm</span> <span class="o">=</span> <span class="n">vectorize</span><span class="o">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">corpus</span><span class="p">)</span>
    <span class="c1">#vectorize.fit(corpus)</span>
    <span class="c1">#dtm  = vectorize.transform(corpus) </span>
    <span class="c1">#打印dtm</span>
    <span class="k">return</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">dtm</span><span class="o">.</span><span class="n">toarray</span><span class="p">(),</span> 
                        <span class="n">columns</span><span class="o">=</span><span class="n">vectorize</span><span class="o">.</span><span class="n">get_feature_names_out</span><span class="p">())</span> 

<span class="n">corporate_tfidf_df</span> <span class="o">=</span> <span class="n">createDTM</span><span class="p">(</span><span class="n">corporate_df</span><span class="p">[</span><span class="s1">&#39;text&#39;</span><span class="p">])</span>
<span class="n">corporate_tfidf_df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/tfidf-df5.png" alt=""  />
</p>
<br>
<h3 id="34-更新五大类词典">3.4 更新五大类词典</h3>
<p>企业文化词典中的词，并不是都出现在corporate_tfidf_df中的列里， 为避免列操作出错。 需要重新更新文化词典。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">Innovation_words</span> <span class="o">=</span> <span class="p">[</span><span class="n">w</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">innovation_words</span> <span class="k">if</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">corporate_tfidf_df</span><span class="o">.</span><span class="n">columns</span><span class="p">]</span>
<span class="n">Integrity_words</span> <span class="o">=</span> <span class="p">[</span><span class="n">w</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">integrity_words</span> <span class="k">if</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">corporate_tfidf_df</span><span class="o">.</span><span class="n">columns</span><span class="p">]</span>
<span class="n">Quality_words</span> <span class="o">=</span> <span class="p">[</span><span class="n">w</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">quality_words</span> <span class="k">if</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">corporate_tfidf_df</span><span class="o">.</span><span class="n">columns</span><span class="p">]</span>
<span class="n">Respect_words</span> <span class="o">=</span> <span class="p">[</span><span class="n">w</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">respect_words</span> <span class="k">if</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">corporate_tfidf_df</span><span class="o">.</span><span class="n">columns</span><span class="p">]</span>
<span class="n">Teamwork_words</span> <span class="o">=</span> <span class="p">[</span><span class="n">w</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">teamwork_words</span> <span class="k">if</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">corporate_tfidf_df</span><span class="o">.</span><span class="n">columns</span><span class="p">]</span>
</code></pre></div><br>
<h3 id="35-计算不同类别得分">3.5 计算不同类别得分</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">data</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;text&#39;</span><span class="p">:</span> <span class="n">corporate_df</span><span class="p">[</span><span class="s1">&#39;text&#39;</span><span class="p">],</span>
        <span class="s1">&#39;Innovation&#39;</span><span class="p">:</span> <span class="n">corporate_tfidf_df</span><span class="p">[</span><span class="n">Innovation_words</span><span class="p">]</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">),</span>
        <span class="s1">&#39;Integrity&#39;</span><span class="p">:</span><span class="n">corporate_tfidf_df</span><span class="p">[</span><span class="n">Integrity_words</span><span class="p">]</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">),</span>
        <span class="s1">&#39;Quality&#39;</span><span class="p">:</span><span class="n">corporate_tfidf_df</span><span class="p">[</span><span class="n">Quality_words</span><span class="p">]</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">),</span>
        <span class="s1">&#39;Respect&#39;</span><span class="p">:</span><span class="n">corporate_tfidf_df</span><span class="p">[</span><span class="n">Respect_words</span><span class="p">]</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">),</span>
        <span class="s1">&#39;Teamwork&#39;</span><span class="p">:</span><span class="n">corporate_tfidf_df</span><span class="p">[</span><span class="n">Teamwork_words</span><span class="p">]</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">),</span>
        <span class="p">}</span>

<span class="n">CultureResultDf</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
<span class="n">CultureResultDf</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/output.png" alt=""  />
</p>
<br>
<h3 id="36-保存结果">3.6 保存结果</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">CultureResultDf</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s1">&#39;output/企业五大类文化tfidf有权重计算.csv&#39;</span><span class="p">)</span>
</code></pre></div><p><br><br></p>
<h2 id="cntext使用声明">cntext使用声明</h2>
<p>如在研究或项目中使用 cntext ，请在文中介绍并附引用声明。引用格式可参考 <a href="https://textdata.cn/blog/2025-09-09-transform-text-data-into-structured-data-with-lmstudio-and-cntext/#%E4%BD%BF%E7%94%A8%E5%A3%B0%E6%98%8E">cntext 推荐引用格式</a></p>
<br>
]]></content:encoded>
    </item>
    
    <item>
      <title>如何用图嵌入(网络思维和嵌入思维)表征企业，表征高管的职业经历</title>
      <link>https://textdata.cn/blog/2024-12-31-the-experience-of-ceo-to-vector-with-graphe-embeddings/</link>
      <pubDate>Tue, 31 Dec 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-12-31-the-experience-of-ceo-to-vector-with-graphe-embeddings/</guid>
      <description>管理的本质是一种实践，在某些情形下，阅历比简历更重要，丰富的职业经历有助于企业高管形成多元化的思维结构、广阔的管理视野、丰富的社会资源和过人的胆识。因此，对于企业而言，了解高管的职业经历非常重要，这可以帮助企业更好地了解高管的背景和潜力，从而更好地为企业的发展提供支持。而研究高管的个人特质，已有的研究，主要从年龄、性别、学历等类别型变量开展研究，即使从从职业经历研究，也是作为离散变量，没有充分挖掘职业经历的信息。</description>
      <content:encoded><![CDATA[<p>今天分享的内容主要是 <strong>如何用图嵌入(网络思维和嵌入思维)表征企业，表征高管的职业经历</strong>。</p>
<p><br><br></p>
<h2 id="一高管职业经历">一、高管职业经历</h2>
<p>管理的本质是一种实践，在某些情形下，阅历比简历更重要，丰富的职业经历有助于企业高管形成多元化的思维结构、广阔的管理视野、丰富的社会资源和过人的胆识。因此，对于企业而言，了解高管的职业经历非常重要，这可以帮助企业更好地了解高管的背景和潜力，从而更好地为企业的发展提供支持。</p>
<p>而研究高管的个人特质，已有的研究，主要从年龄、性别、学历等类别型变量开展研究，即使从从职业经历研究，也是作为离散变量，没有充分挖掘职业经历的信息。</p>
<p>今天分享的内容主要是 <strong>如何用网络思维和嵌入思维表征企业，表征高管的职业经历</strong>。</p>
<p><img loading="lazy" src="img/graph-embeddings-numerical.jpg" alt=""  />
</p>
<p>本技术文的创新价值</p>
<blockquote>
<ol>
<li><strong>网络思维</strong>;<br>
节点、边。 企业是节点，高管一般有多个企业就职经历，可以在任意2个企业构件一个边。</li>
</ol>
</blockquote>
<ol start="2">
<li><strong>嵌入思维</strong>;</li>
</ol>
<blockquote>
<p>把企业网络中的节点转化为向量表示，企业是n维向量，高管也是同样的n维向量。类似于阴阳五行思维表征世间万物完事。</p>
<ol start="3">
<li>
<p><strong>向量计算</strong>；</p>
<p>事物都用N维向量表征，那么不同类别的事物，我们就可以对高管和企业之间就可以进行向量计算。</p>
<p><strong>Kmeans聚类</strong>;<br>
将很多企业向量或高管职业经历向量进行聚类，可能得到理想的&quot;分类&quot;标签(人工解读后的cluster数字就是很好的分类标签)。</p>
<p><strong>相似度计算</strong>
招聘来的高管对于该企业是否带来更多的异质性、或者带来更多的相似性。某企业的高管团队的异质性可以度量出一个准确的数字。招聘新的高管，也可以测量给改企业带来多少的相似性或者异质性。</p>
</li>
</ol>
</blockquote>
<p><img loading="lazy" src="img/vector_similarity.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="二职业经历聚类算法步骤">二、职业经历聚类算法步骤</h2>
<p>首先，我们需要构建一个企业的网络图，其中每个节点代表一个企业，每条边代表两个企业之间的关系，比如同城市、同行业， 而<strong>高管职业经历中的任意两家企业天然也是一个边</strong>。然后，我们使用node2vec算法对这个网络图进行嵌入，得到每个企业的向量表示。</p>
<p>接着我们需要对每个高管的职业经历进行处理。一般来说，一个高管的职业经历可能涉及多个企业，因此我们需要将这些企业的向量求平均，得到一个高管的职业经历向量。这个向量可以代表高管的职业经历背景，可以用于后续的各类分析，比如聚类。</p>
<p>最后我们可以使用聚类算法（比如k-means算法）对高管职业经历向量进行聚类分析， 具有相似职业经历的高管分为一组。</p>
<p><br><br></p>
<h2 id="三数据导入">三、数据导入</h2>
<p>高管中的职业经历信息一般存在于没有规律的个人简历中，提取起来是比较复杂，这里可能需要用到正则表达式、命名实体识别等将处于简介中的企业名识别出来。</p>
<p>高管职业经历信息一般存在于个人简历中，而个人简历中的职业经历信息往往不太规律，多个企业名也没有明显的分隔符，因此需要用到一些技术手段来将企业名识别出来，以便后续的处理和分析。</p>
<p><strong>首先，我们可以使用正则表达式来匹配企业名</strong>。正则表达式是一种用于匹配文本的工具，可以根据一定的规则来匹配出所需要的信息。在这里，我们可以使用正则表达式来匹配一些常见的企业名词，比如“有限公司”、“股份有限公司”等。这样可以一定程度上识别出企业名，但对于一些特殊的企业名还需要进行进一步处理。</p>
<p><strong>其次，我们可以使用命名实体识别（NER）技术来识别企业名</strong>。命名实体识别是自然语言处理的一项任务，旨在识别文本中的命名实体，比如人名、地名、组织名等。在这里，我们可以使用NER技术来识别文本中的企业名，从而更准确地提取出高管的职业经历信息。</p>
<p><strong>最后，我们需要对提取出的企业名进行去重和整理</strong>。在提取出的企业名中，有一些是重复的，有一些是不必要的，需要将它们去掉。同时，对于每个高管的职业经历信息，我们需要将提取出的企业名整理成一个有序的序列，方便后续的处理和分析。</p>
<p>总之，对于高管职业经历信息的提取，需要用到一些技术手段，包括正则表达式、命名实体识别等。这些技术手段可以帮助我们更准确地提取出高管的职业经历信息，为后续的分析和处理提供基础。<strong>但限于篇幅和主题，今天使用虚构的50条高管记录数据做实验。</strong></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;high_executives.csv&#39;</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
<p></style></p>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>姓名</th>
      <th>性别</th>
      <th>出生年份</th>
      <th>学历</th>
      <th>职业经历</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>高管1</td>
      <td>女</td>
      <td>2000</td>
      <td>本科</td>
      <td>华为-CFO,Facebook-CFO,Facebook-CEO</td>
    </tr>
    <tr>
      <th>1</th>
      <td>高管2</td>
      <td>女</td>
      <td>1989</td>
      <td>本科</td>
      <td>百度-CEO,阿里巴巴-CFO,亚马逊-COO</td>
    </tr>
    <tr>
      <th>2</th>
      <td>高管3</td>
      <td>女</td>
      <td>1992</td>
      <td>博士</td>
      <td>谷歌-CTO,腾讯-COO,百度-COO</td>
    </tr>
    <tr>
      <th>3</th>
      <td>高管4</td>
      <td>女</td>
      <td>1989</td>
      <td>本科</td>
      <td>IBM-COO,苹果-CFO,微软-COO</td>
    </tr>
    <tr>
      <th>4</th>
      <td>高管5</td>
      <td>男</td>
      <td>1960</td>
      <td>本科</td>
      <td>谷歌-COO,苹果-CFO,百度-COO</td>
    </tr>
  </tbody>
</table>
</div>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#记录数</span>
<span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
</code></pre></div><pre><code>50
</code></pre>
<br>
<br>
<h2 id="四训练node2vec模型">四、训练Node2Vec模型</h2>
<h3 id="41-node2vec算法">4.1 node2vec算法</h3>
<p>Node2Vec是一种基于深度学习的图嵌入算法，旨在将图中的节点映射到低维向量空间中，从而方便后续的分析和处理。具体来说，Node2Vec算法可以将节点的局部邻域结构转化为向量表示，同时保留节点之间的全局结构信息。如果熟悉Wordc2Vec的同学，理解起来会比较容易，Node2Vec是基于word2vec算法开发出来的，将职业经历中每个企业看做词语，训练得到企业向量表示。</p>
<p><img loading="lazy" src="img/node2vec.png" alt=""  />
</p>
<p>在Node2Vec算法中，每个节点的向量表示由两个部分组成：一个是节点自身的特征向量，另一个是节点在不同邻域结构下的向量表示。算法的核心思想是通过两个步骤来生成节点的向量表示：</p>
<ul>
<li>随机游走：对于每个节点，从它的邻居节点中随机选择一个节点进行访问，然后在这个节点的邻居中进行同样的随机选择。这个过程可以生成一系列的节点序列，其中每个序列都代表了一个从起始节点出发的随机游走路径。</li>
<li>Skip-gram模型：基于这些随机游走路径，使用Skip-gram模型进行向量表示的学习。Skip-gram模型是一种常见的自然语言处理模型，用于学习词向量。在Node2Vec算法中，可以将节点序列看作“句子”，将每个节点看作“词”，然后使用Skip-gram模型来学习节点向量的表示。</li>
</ul>
<p>Node2Vec算法通过随机游走和Skip-gram模型的结合，可以生成具有丰富语义信息的节点向量，同时保留了节点之间的全局结构信息。这种算法可以应用于各种不同的图结构，包括社交网络、知识图谱、生物信息学等领域，具有广泛的应用前景。</p>
<p><strong>在高管职业经历数据的应用中，我们可以将每个企业看作图中的一个节点，然后使用Node2Vec算法来训练企业向量模型</strong>。这样可以将企业的职业经历信息转化为向量表示，方便后续的分析和处理。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">re</span>
<span class="kn">import</span> <span class="nn">networkx</span> <span class="k">as</span> <span class="nn">nx</span>
<span class="kn">from</span> <span class="nn">node2vec</span> <span class="kn">import</span> <span class="n">Node2Vec</span>


<span class="c1"># 读取CSV文件并提取职业经历中的公司名列表</span>
<span class="n">companies</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;IBM&#39;</span><span class="p">,</span> <span class="s1">&#39;谷歌&#39;</span><span class="p">,</span> <span class="s1">&#39;Facebook&#39;</span><span class="p">,</span> <span class="s1">&#39;苹果&#39;</span><span class="p">,</span> <span class="s1">&#39;微软&#39;</span><span class="p">,</span> <span class="s1">&#39;亚马逊&#39;</span><span class="p">,</span> <span class="s1">&#39;阿里巴巴&#39;</span><span class="p">,</span> <span class="s1">&#39;腾讯&#39;</span><span class="p">,</span> <span class="s1">&#39;百度&#39;</span><span class="p">,</span> <span class="s1">&#39;华为&#39;</span><span class="p">]</span>
<span class="n">companies_regex</span> <span class="o">=</span> <span class="sa">r</span><span class="s1">&#39;|&#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">companies</span><span class="p">)</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;companies&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;职业经历&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">re</span><span class="o">.</span><span class="n">findall</span><span class="p">(</span><span class="n">companies_regex</span><span class="p">,</span> <span class="n">x</span><span class="p">))</span>

<span class="c1"># 构建公司名之间的边</span>
<span class="n">G</span> <span class="o">=</span> <span class="n">nx</span><span class="o">.</span><span class="n">Graph</span><span class="p">()</span>
<span class="k">for</span> <span class="n">companies</span> <span class="ow">in</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;companies&#39;</span><span class="p">]:</span>
    <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">company1</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">companies</span><span class="p">):</span>
        <span class="k">for</span> <span class="n">company2</span> <span class="ow">in</span> <span class="n">companies</span><span class="p">[</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">:]:</span>
            <span class="n">G</span><span class="o">.</span><span class="n">add_edge</span><span class="p">(</span><span class="n">company1</span><span class="p">,</span> <span class="n">company2</span><span class="p">)</span>

<span class="c1"># 使用node2vec库生成公司名向量</span>
<span class="c1"># 企业向量维度 16</span>
<span class="n">node2vec</span> <span class="o">=</span> <span class="n">Node2Vec</span><span class="p">(</span><span class="n">G</span><span class="p">,</span> <span class="n">dimensions</span><span class="o">=</span><span class="mi">16</span><span class="p">,</span> <span class="n">walk_length</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">num_walks</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">node2vec</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">window</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">min_count</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="c1"># model.wv[company] 查询某个企业的向量</span>
<span class="n">vectors</span> <span class="o">=</span> <span class="p">[</span><span class="n">model</span><span class="o">.</span><span class="n">wv</span><span class="p">[</span><span class="n">company</span><span class="p">]</span> <span class="k">for</span> <span class="n">company</span> <span class="ow">in</span> <span class="n">G</span><span class="o">.</span><span class="n">nodes</span><span class="p">()]</span>
</code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">    Computing transition probabilities:   0%|          | 0/10 [00:00&lt;?, ?it/s]
    Generating walks (CPU: 1): 100%|████████████| 100/100 [00:00&lt;00:00, 3920.94it/s]
</code></pre></div><br>
<h3 id="42-理解嵌入">4.2 理解嵌入</h3>
<p>本文嵌入的维度是16维， 其实这里用任意 N 维都可以嵌入表示任意一个事物。这里理解起来比较难，咱们用熟悉的事情来假装理解嵌入维度的设定。
在中国传统文化中，经常使用n维来刻画、描述、表征任意事物。例如</p>
<ul>
<li>2维， 阴阳思维去描述事物的阴阳</li>
<li>5维， 五行，金木水火土描述事物</li>
</ul>
<p>而在本技术中， <code>Node2Vec(G, dimensions=16, walk_length=10, num_walks=100)</code> dimensions=16 即用16维表征每个企业，得到16维的企业向量。 需要注意的是， 我们可能都无法理解 16维的任意一个维度的含义。如果设置成2维、5维或者其他维，我们也是无法理解对应的维度含义。因为中国传统文化，已经定义了每个维度的含义，然后再去表征事物。但是我们是先定义了维度数，所以维度的含义是未知的。</p>
<p><br><br></p>
<h2 id="五计算高管职业经历向量">五、计算高管职业经历向量</h2>
<p>定义一个 <strong>companys2vec(companys)</strong> 函数 ，该函数可以把多家企业就职的经历转为 「职业经历向量」。</p>
<p>将高管的职业经历转化为向量，每家企业是一个 16 维向量，最简单粗暴的办法是求平多个就职企业向量的均值，均值向量也是 16 维。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>

<span class="k">def</span> <span class="nf">companys2vec</span><span class="p">(</span><span class="n">companys</span><span class="p">):</span>
    <span class="n">cvs</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="k">for</span> <span class="n">company</span> <span class="ow">in</span> <span class="n">companys</span><span class="p">:</span>
        <span class="k">try</span><span class="p">:</span>
            <span class="n">cvs</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">model</span><span class="o">.</span><span class="n">wv</span><span class="p">[</span><span class="n">company</span><span class="p">])</span>
        <span class="k">except</span><span class="p">:</span>
            <span class="k">pass</span>
    <span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">cvs</span><span class="p">),</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>


<span class="n">companys2vec</span><span class="p">(</span><span class="n">companys</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;华为&#39;</span><span class="p">,</span> <span class="s1">&#39;Facebook&#39;</span><span class="p">,</span> <span class="s1">&#39;Facebook&#39;</span><span class="p">])</span>
</code></pre></div><pre><code>array([-0.03054439,  0.28417936,  0.23475488,  0.36075735, -0.16633254,
       -0.06266979,  0.74403137,  0.16226356,  0.01991086, -0.15565623,
        0.34757233,  0.3079434 ,  0.19876878,  0.01175458, -0.55069256,
        0.01623819], dtype=float32)
</code></pre>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#批量计算，得到高管的职业经历向量 career_vec</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;career_vec&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;companies&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">companys2vec</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
<p></style></p>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>姓名</th>
      <th>性别</th>
      <th>出生年份</th>
      <th>学历</th>
      <th>职业经历</th>
      <th>companies</th>
      <th>career_vec</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>高管1</td>
      <td>女</td>
      <td>2000</td>
      <td>本科</td>
      <td>华为-CFO,Facebook-CFO,Facebook-CEO</td>
      <td>[华为, Facebook, Facebook]</td>
      <td>[-0.03054439, 0.28417936, 0.23475488, 0.360757...</td>
    </tr>
    <tr>
      <th>1</th>
      <td>高管2</td>
      <td>女</td>
      <td>1989</td>
      <td>本科</td>
      <td>百度-CEO,阿里巴巴-CFO,亚马逊-COO</td>
      <td>[百度, 阿里巴巴, 亚马逊]</td>
      <td>[-0.007476224, 0.2823672, 0.20479268, 0.281800...</td>
    </tr>
    <tr>
      <th>2</th>
      <td>高管3</td>
      <td>女</td>
      <td>1992</td>
      <td>博士</td>
      <td>谷歌-CTO,腾讯-COO,百度-COO</td>
      <td>[谷歌, 腾讯, 百度]</td>
      <td>[-0.040801946, 0.27067217, 0.20773596, 0.34468...</td>
    </tr>
    <tr>
      <th>3</th>
      <td>高管4</td>
      <td>女</td>
      <td>1989</td>
      <td>本科</td>
      <td>IBM-COO,苹果-CFO,微软-COO</td>
      <td>[IBM, 苹果, 微软]</td>
      <td>[-0.022640707, 0.31426176, 0.1841877, 0.323161...</td>
    </tr>
    <tr>
      <th>4</th>
      <td>高管5</td>
      <td>男</td>
      <td>1960</td>
      <td>本科</td>
      <td>谷歌-COO,苹果-CFO,百度-COO</td>
      <td>[谷歌, 苹果, 百度]</td>
      <td>[-0.030689942, 0.25639233, 0.19499178, 0.33937...</td>
    </tr>
  </tbody>
</table>
</div>
<p><br><br></p>
<h2 id="六高管职业经历聚类kmeans">六、高管职业经历聚类Kmeans</h2>
<p>Kmeans是一种常见的聚类算法，旨在将相似的数据点分组为同一类别，从而发现数据的内在结构。Kmeans算法的优点是简单易实现，对于大型数据集也具有较高的效率。它可以适用于各种类型的数据，包括数值型、文本型和图像型等数据。同时，Kmeans算法可以通过调整簇的个数来控制聚类结果的细粒度程度，比如选择较大的簇个数可以得到更细致的聚类结果。</p>
<p><img loading="lazy" src="img/kmeans.png" alt=""  />
</p>
<p><strong>Kmeans算法的缺点是需要指定簇的个数k，这个参数选择较大或者较小都可能导致聚类结果不理想</strong>。同时，Kmeans算法对于离群点比较敏感，可能会导致簇中心偏离聚类的本质结构。如果对数据不了解，需要使用手肘法则等方式确定K。 <strong>这里假设我们对数据很了解，那么可以指定K=5</strong>。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">sklearn.cluster</span> <span class="kn">import</span> <span class="n">KMeans</span>

<span class="c1"># 对高管的职业经历向量进行聚类分析</span>
<span class="n">kmeans</span> <span class="o">=</span> <span class="n">KMeans</span><span class="p">(</span><span class="n">n_clusters</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">kmeans</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;career_vec&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">to_list</span><span class="p">()))</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;cluster&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">kmeans</span><span class="o">.</span><span class="n">labels_</span>

<span class="c1"># 保存结果为CSV文件</span>
<span class="n">df</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s1">&#39;high_executives_with_clusters.csv&#39;</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8-sig&#39;</span><span class="p">)</span>
</code></pre></div><p><br><br></p>
<h2 id="最后-结果呈现">最后， 结果呈现</h2>
<p><strong>注意cluster是没有意义的数字，不同的数字代表着丰富的信息，例如职能、行业、地域、晋升路径等， 需要需要我们「人工解读」理解每个数字对应的含义</strong>。 这里摘抄一下 <strong>何瑛,于文蕾,戴逸驰,王砚羽.高管职业经历与企业创新[J].管理世界,2019,35(11):174-192.</strong> 内的内容，该文章没有使用向量表示，但是可以提供理解cluster的角度。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">在 CEO 职业经历的分类上，现有文献的研究视角各有不同。 

- 在「职能方面」，比较受认可的是 Hambrick 和 Mason（1984）所提出的 3 部门分类法，即产出型职能（营销与研发）、生产型职能（过程管理、设备管理与会计） 和外围型职能（法律与融资），之后 Abebe 等（2010）较多学者沿用这种分类；

- 在「行业方面」，Crossland 等（2014）将 其区分为能源、材料、工业、非必需消费品、日用消费品、保健、金融、信息技术、电信服务和公用设施等 10 种类 型；在组织机构方面，Hu 和 Liu（2015）分为生产性组织（如企业）、非生产性组织（如大学）以及行政或政府组织 3 类；

- 在「地域类型」方面，Schmid 和 Wurster（2017）根据是否具有国际工作经历分为两类；

- 在「工作背景」方面，Fan 等 （2007）和 Benmelech 和 Frydman（2015）分别根据是否具有从政经历和从军经历进行区分；

- 另外在「晋升路径」上， Brockman 等（2019）则区分了内部提拔与外部聘请两种类型。 

总的来说，上述有关管理者职业经历的文献大多 集中于研究管理者职业经历的某一方面，对复合型职业经历进行整合研究的文献十分罕见。 事实上，不同方面的职业经历之间往往存在某种联系，相互作用最终塑造了独特的管理风格（Kaplan et al.，2008）。
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="n">df2</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;high_executives_with_clusters.csv&#39;</span><span class="p">)</span>
<span class="n">df2</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
<p></style></p>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>姓名</th>
      <th>性别</th>
      <th>出生年份</th>
      <th>学历</th>
      <th>职业经历</th>
      <th>companies</th>
      <th>career_vec</th>
      <th>cluster</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>高管1</td>
      <td>女</td>
      <td>2000</td>
      <td>本科</td>
      <td>华为-CFO,Facebook-CFO,Facebook-CEO</td>
      <td>['华为', 'Facebook', 'Facebook']</td>
      <td>[-0.03054439  0.28417936  0.23475488  0.360757...</td>
      <td>0</td>
    </tr>
    <tr>
      <th>1</th>
      <td>高管2</td>
      <td>女</td>
      <td>1989</td>
      <td>本科</td>
      <td>百度-CEO,阿里巴巴-CFO,亚马逊-COO</td>
      <td>['百度', '阿里巴巴', '亚马逊']</td>
      <td>[-0.00747622  0.2823672   0.20479268  0.281800...</td>
      <td>1</td>
    </tr>
    <tr>
      <th>2</th>
      <td>高管3</td>
      <td>女</td>
      <td>1992</td>
      <td>博士</td>
      <td>谷歌-CTO,腾讯-COO,百度-COO</td>
      <td>['谷歌', '腾讯', '百度']</td>
      <td>[-0.04080195  0.27067217  0.20773596  0.344685...</td>
      <td>3</td>
    </tr>
    <tr>
      <th>3</th>
      <td>高管4</td>
      <td>女</td>
      <td>1989</td>
      <td>本科</td>
      <td>IBM-COO,苹果-CFO,微软-COO</td>
      <td>['IBM', '苹果', '微软']</td>
      <td>[-2.2640707e-02  3.1426176e-01  1.8418770e-01 ...</td>
      <td>2</td>
    </tr>
    <tr>
      <th>4</th>
      <td>高管5</td>
      <td>男</td>
      <td>1960</td>
      <td>本科</td>
      <td>谷歌-COO,苹果-CFO,百度-COO</td>
      <td>['谷歌', '苹果', '百度']</td>
      <td>[-0.03068994  0.25639233  0.19499178  0.339379...</td>
      <td>3</td>
    </tr>
  </tbody>
</table>
</div>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df2</span><span class="o">.</span><span class="n">cluster</span><span class="o">.</span><span class="n">value_counts</span><span class="p">()</span>
</code></pre></div><pre><code>2    14
3    13
4     9
0     7
1     7
Name: cluster, dtype: int64
</code></pre>
<p><br><br></p>
<h2 id="代码获取">代码获取</h2>
<p>链接: <a href="https://pan.baidu.com/s/1pZQj5_s2sv5LYZ-EerJD1Q">https://pan.baidu.com/s/1pZQj5_s2sv5LYZ-EerJD1Q</a> 提取码: 6m7v</p>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>实验 | 使用 Crewai 和 Ollama 构建智能体(AI Agent)帮我撰写博客文章</title>
      <link>https://textdata.cn/blog/2024-08-05-create-a-blog-writer-multi-agent-system-using-crewai-and-ollama/</link>
      <pubDate>Mon, 05 Aug 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-08-05-create-a-blog-writer-multi-agent-system-using-crewai-and-ollama/</guid>
      <description>大邓是一个技术博主，运营着公众号，每天要消耗大量的时间进行选题、创作、编辑。随着LLM的流行， 能否让LLM替我进行选题、创作、编辑，从此进入躺平式人生新阶段。  这不是做梦， 使用软件Ollama、Python的CrewAI库，设计好智能体(AI Agent)，就能实现大邓的白日梦。In technical terms an AI Agent is a software entity designed to perform tasks autonomously or semi-autonomously on behalf of a user or another program. These agents leverage artificial intelligence to make decisions, take actions, and interact with their environment or other systems.</description>
      <content:encoded><![CDATA[<p>大邓是一个技术博主，运营着公众号，每天要消耗大量的时间进行选题、创作、编辑。随着LLM的流行， 能否让LLM替我进行选题、创作、编辑，从此进入躺平式人生新阶段。  这不是做梦， 使用软件Ollama、Python的CrewAI库，设计好智能体(AI Agent)，就能实现大邓的白日梦。</p>
<br>
<p><img loading="lazy" src="img/01-multiagent-worflow.png" alt="Multiagent Workflow using CrewAI and Ollama"  />
</p>
<h2 id="一什么是智能体ai-agent">一、什么是智能体(AI Agent)?</h2>
<p>从技术角度来说，**智能体(AI Agent)**是一种软件实体，旨在代表用户或其他程序自主或半自主地执行任务。这些代理利用人工智能做出决策、采取行动并与环境或其他系统进行交互。智能体的主要特征有：</p>
<ol>
<li><strong>自治</strong>：智能体无需人工干预即可运行。一旦被赋予目标，它们就可以独立执行任务。</li>
<li><strong>决策</strong>：智能体使用算法、规则和人工智能模型， 根据自己的感知和目标做出决策。这包括评估不同的选择并选择最佳行动方案。</li>
<li><strong>学习</strong>：许多智能体采用机器学习技术来提高其性能。它们可以从过去的经验中学习并适应新情况。</li>
<li><strong>交互</strong>：智能体可以与用户、其他智能体或系统进行通信和协作。这种交互可能涉及自然语言处理、发送和接收数据或执行协调任务。</li>
<li><strong>专业化</strong>：智能体可以专门用于特定任务或领域。例如，某些智能体可能专为网页浏览而设计，而其他智能体则可能处理数据库交互、执行复杂计算或生成图像。</li>
<li><strong>目标导向</strong>：智能体通常被设定有特定的目标或目的。它们通过一系列动作和决策来实现这些目标。</li>
</ol>
<p><img loading="lazy" src="img/landscape-latest.png" alt=""  />
</p>
<p>总之，智能体是强大的工具，可以自动化和增强广泛的活动，从简单的重复任务到复杂的问题解决场景，这使得它们在各种应用和行业中具有无价的价值。</p>
<p>想象一下，将上述所有概念整合在一起，共同朝着预先确定的目标努力，实现预期结果。这些任务可以按顺序或分层流程执行，所有智能体都像一个协调的团队一样工作。这种强大的协作可以彻底改变我们处理复杂问题的方式，使流程更高效，结果更有效。这就是 <em><strong>CrewAI框架</strong></em>发挥作用的地方。</p>
<p><br><br></p>
<h2 id="二ollama介绍配置">二、Ollama介绍&amp;配置</h2>
<p><a href="https://textdata.cn/blog/2024-06-14-how-to-download-large-language-model-with-ollama/">教程 | 如何使用 Ollama 下载 &amp; 使用本地大语言模型</a></p>
<p><a href="https://ollama.ai/"><strong>Ollama</strong></a>是一款开源应用程序，可让您使用 MacOS、Linux 和 Windows 上的命令行界面在本地运行、创建和共享大型语言模型。</p>
<p>Ollama 可以直接从其库中访问各种 LLM，只需一个命令即可下载。下载后，只需执行一个命令即可开始使用。这对于工作量围绕终端窗口的用户非常有帮助。如果他们被困在某个地方，他们可以在不切换到另一个浏览器窗口的情况下获得答案。</p>
<br>
<h3 id="21-特点和优点">2.1 特点和优点</h3>
<p>这就是为什么 OLLAMA 是您的工具包中必备的工具：</p>
<ul>
<li><strong>简单</strong> ：OLLAMA 提供简单的设置过程。您无需拥有机器学习博士学位即可启动和运行它。</li>
<li><strong>成本效益</strong> ：在本地运行模型意味着您无需支付云成本。您的钱包会感谢您。</li>
<li><strong>隐私</strong> ：使用 OLLAMA，所有数据处理都在您的本地机器上进行。这对于用户隐私来说是一个巨大的胜利。</li>
<li><strong>多功能性</strong> ：OLLAMA 不只是为 Python 爱好者准备的。它的灵活性使其可以用于各种应用程序，包括 Web 开发。</li>
</ul>
<br>
<h3 id="22-安装ollama">2.2 安装ollama</h3>
<p>点击前往网站 <a href="https://ollama.com/">https://ollama.com/</a> ，下载ollama软件，支持win、Mac、linux</p>
<p><img loading="lazy" src="img/03-ollama-gui.png" alt=""  />
</p>
<br>
<h3 id="23-下载llm模型">2.3 下载LLM模型</h3>
<p>默认情况下，Openai Models 在 CrewAI 中用作 llm。有经费、有网络、不担心数据泄露等条件下,  力求达到最佳性能，可考虑使用 GPT-4 或 OpenAI 稍便宜的 GPT-3.5。</p>
<p>但本文是要 <strong>本地部署</strong>， 因此我们将使用 Meta Llama 3，这是迄今为止功能最强大的公开 LLM。Meta Llama 3 是 Meta Inc. 开发的模型系列，是最新推出的模型，具有 8B 和 70B 两种参数大小（预训练或指令调整）。Llama 3 指令调整模型针对对话/聊天用例进行了微调和优化，并且在常见基准测试中胜过许多可用的开源聊天模型。</p>
<p><img loading="lazy" src="img/04-llama3-performance.png" alt=""  />
</p>
<p><img loading="lazy" src="img/05-llama3-performance.png" alt=""  />
</p>
<br>
<p>打开Ollama模型页面 <em><strong><a href="https://ollama.com/library">https://ollama.com/library</a></strong></em>， 第一个就是 Metal 近期发布的 LLama3.1 模型。</p>
<p><img loading="lazy" src="img/06-ollama-model.png" alt=""  />
</p>
<br>
<p>以llama3为例，根据自己电脑显存性能， 选择适宜的版本。如果不知道选什么，那就试着安装，不合适不能用再删除即可。</p>
<p><img loading="lazy" src="img/06-ollama-llama3.png" alt=""  />
</p>
<br>
<p>打开电脑命令行cmd(mac是terminal),  网络是连网状态，执行模型下载(安装)命令</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">ollama pull llama3.1:8b
</code></pre></div><p>等待 <strong>llama3.1:8b</strong> 下载完成。</p>
<br>
<h3 id="23-启动ollama服务">2.3 启动ollama服务</h3>
<p>ollama服务有两种启动方式，即鼠标启动ollama服务 和 命令行启动ollama服务 。<br></p>
<h4 id="231-鼠标启动ollama服务">2.3.1 鼠标启动ollama服务</h4>
<p>在电脑中找到ollama软件，双击打开，就开启了ollama本地服务。</p>
<br>
<h4 id="232-命令行启动ollama服务">2.3.2 命令行启动ollama服务</h4>
<p>在Python中调用本地ollama服务，需要先启动本地ollama服务， 打开电脑命令行cmd(mac是terminal), 执行</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">ollama serve
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">2024/06/14 14:52:24 routes.go:1011: INFO server config env=&#34;map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_KEEP_ALIVE: OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS:/Users/deng/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_TMPDIR:]&#34;
time=2024-06-14T14:52:24.742+08:00 level=INFO source=images.go:725 msg=&#34;total blobs: 18&#34;
time=2024-06-14T14:52:24.742+08:00 level=INFO source=images.go:732 msg=&#34;total unused blobs removed: 0&#34;
time=2024-06-14T14:52:24.743+08:00 level=INFO source=routes.go:1057 msg=&#34;Listening on 127.0.0.1:11434 (version 0.1.44)&#34;
time=2024-06-14T14:52:24.744+08:00 level=INFO source=payload.go:30 msg=&#34;extracting embedded files&#34; dir=/var/folders/y0/4gqxky0s2t94x1c1qhlwr6100000gn/T/ollama4239159529/runners
time=2024-06-14T14:52:24.772+08:00 level=INFO source=payload.go:44 msg=&#34;Dynamic LLM libraries [metal]&#34;
time=2024-06-14T14:52:24.796+08:00 level=INFO source=types.go:71 msg=&#34;inference compute&#34; id=0 library=metal compute=&#34;&#34; driver=0.0 name=&#34;&#34; total=&#34;72.0 GiB&#34; available=&#34;72.0 GiB&#34;
</code></pre></div><p>cmd(mac是terminal)看到如上的信息，说明本地ollama服务已开启。</p>
<p><br><br></p>
<h2 id="三crewai框架介绍">三、CrewAI框架介绍</h2>
<p>CrewAi 是一个用于协调角色扮演、自主 AI 代理的尖端框架。通过促进协作智能，CrewAI 使代理能够无缝协作，解决复杂的任务。</p>
<br>
<h3 id="31-安装crew">3.1 安装crew</h3>
<p>打开电脑命令行cmd(mac是terminal),  网络是连网状态，执行安装命令</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">pip3 install crewai
pip3 install langchain_openai
</code></pre></div><br>
<h3 id="32-crewai核心概念">3.2 CrewAI核心概念</h3>
<ol>
<li><em><strong>智能体(Agents)</strong></em>：这些是经过编程的独立单元，用于执行任务、做出决策和与其他代理进行通信。它们可以使用的 <em><strong>工具Tools</strong></em> 可以是简单的搜索功能，也可以是涉及其他链、API 等的复杂集成。</li>
<li><em><strong>任务(Tasks)</strong></em>：任务是智能体需要完成的任务或工作。它们可以包含其他信息，例如哪个代理应该执行该任务以及它们可能需要哪些工具。</li>
<li><em><strong>团队(Crew)</strong></em>  一个团队是由一群智能体组成的，每个 <em><strong>智能体(Agent)</strong></em> 都有特定的角色，他们齐心协力实现共同目标。组建团队的过程包括召集代理、定义他们的任务以及建立任务执行顺序。</li>
</ol>
<p><img loading="lazy" src="img/02-crewai-system.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="四实验代码">四、实验代码</h2>
<p>大邓是一个技术博主，运营着公众号，每天要消耗大量的时间进行选题、创作、编辑。随着LLM的流行， 能否让LLM替我进行选题、创作、编辑，从此进入躺平式人生新阶段。在实验章节， 代码内容将分为</p>
<ul>
<li>启动ollama服务</li>
<li>调用llm</li>
<li>设置agent</li>
<li>设置task</li>
<li>组装成crew</li>
<li>最终运行</li>
</ul>
<h3 id="41-启动服务">4.1 启动服务</h3>
<p>在 <em><strong>cmd</strong></em> 中使用命令 <em><strong>ollama serve</strong></em> 启动本地服务。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">ollama</span> <span class="n">serve</span>
</code></pre></div><br>
<h3 id="42-调用llm">4.2 调用LLM</h3>
<p>在Python中调用开启的ollama服务， 为crewai调用llm做准备。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">langchain_openai</span> <span class="kn">import</span> <span class="n">ChatOpenAI</span>
<span class="kn">import</span> <span class="nn">os</span>

<span class="c1">#将ollama的api转化为OPENAI式的api，方便crewai调用</span>
<span class="c1">#设置系统环境变量OPENAI_API_BASE和OPENAI_API_KEY</span>
<span class="n">os</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s2">&#34;OPENAI_API_BASE&#34;</span><span class="p">]</span> <span class="o">=</span> <span class="s2">&#34;http://localhost:11434/v1&#34;</span>
<span class="n">os</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s2">&#34;OPENAI_API_KEY&#34;</span><span class="p">]</span> <span class="o">=</span> <span class="s2">&#34;NA&#34;</span>

<span class="n">llama_model</span> <span class="o">=</span> <span class="n">ChatOpenAI</span><span class="p">(</span><span class="n">model</span> <span class="o">=</span> <span class="s2">&#34;llama3.1:8b&#34;</span><span class="p">)</span>
</code></pre></div><br>
<h3 id="43-设置agent">4.3 设置Agent</h3>
<p>大邓运营的公众号的日常，一个人身兼数个职位。 大致拆分成三个员工（智能体）</p>
<ul>
<li>内容策划专员</li>
<li>内容创作专员</li>
<li>内容编辑专员</li>
</ul>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">crewai</span> <span class="kn">import</span> <span class="n">Agent</span>

<span class="n">planner</span> <span class="o">=</span> <span class="n">Agent</span><span class="p">(</span>
    <span class="n">role</span> <span class="o">=</span> <span class="s2">&#34;内容策划专员&#34;</span><span class="p">,</span>
    <span class="n">goal</span> <span class="o">=</span> <span class="s2">&#34;策划有关</span><span class="si">{topic}</span><span class="s2">的引人入胜且事实准确的内容&#34;</span><span class="p">,</span>
    <span class="n">backstory</span> <span class="o">=</span> <span class="p">(</span>
        <span class="s2">&#34;您是一名内容策划专员，正在计划撰写一篇主题为“</span><span class="si">{topic}</span><span class="s2">”的博客文章， &#34;</span>
        <span class="s2">&#34;文章将发布在 &#39;https://medium.com/&#39;。&#34;</span>
        <span class="s2">&#34;您收集的信息可帮助受众了解某些内容,使受众能因此做出明智的决定。&#34;</span>
        <span class="s2">&#34;您必须准备一份详细的大纲，博客文章中应包含的相关主题和子主题。&#34;</span>
        <span class="s2">&#34;您的工作是内容创作专员撰写此主题文章的基础。&#34;</span>
        <span class="s2">&#34;工作语言是中文。&#34;</span>
    <span class="p">),</span>
    <span class="n">llm</span> <span class="o">=</span> <span class="n">llama_model</span><span class="p">,</span>
    <span class="n">allow_delegation</span> <span class="o">=</span> <span class="kc">False</span><span class="p">,</span>
    <span class="n">verbose</span> <span class="o">=</span> <span class="kc">True</span>
<span class="p">)</span>


<span class="n">writer</span> <span class="o">=</span> <span class="n">Agent</span><span class="p">(</span>
    <span class="n">role</span> <span class="o">=</span> <span class="s2">&#34;内容创作专员&#34;</span><span class="p">,</span>
    <span class="n">goal</span> <span class="o">=</span> <span class="s2">&#34;撰写主题</span><span class="si">{topic}</span><span class="s2">的评论文章，要深刻且事实准确&#34;</span><span class="p">,</span>
    <span class="n">backstory</span> <span class="o">=</span> <span class="p">(</span>
        <span class="s2">&#34;您是一名内容编辑专员，正在撰写一篇主题 “</span><span class="si">{topic}</span><span class="s2">” 的新观点文章， &#34;</span>
        <span class="s2">&#34;文章将发表在 &#39;https://medium.com/&#39;。&#34;</span>
        <span class="s2">&#34;内容策划师提供了有关该主题的大纲和相关背景。&#34;</span>
        <span class="s2">&#34;您创作内容时，请遵循内容策划师提供的大纲为主要目标和方向。&#34;</span>
        <span class="s2">&#34;同时您将提供客观公正的见解，并使用内容策划师提供的信息支持您的见解。&#34;</span>
        <span class="s2">&#34;您在观点文章中承认您的陈述是意见，而不是客观陈述。&#34;</span>
        <span class="s2">&#34;工作语言是中文。&#34;</span>
    <span class="p">),</span>
    <span class="n">allow_delegation</span> <span class="o">=</span> <span class="kc">False</span><span class="p">,</span>
    <span class="n">llm</span> <span class="o">=</span> <span class="n">llama_model</span><span class="p">,</span>
    <span class="n">verbose</span> <span class="o">=</span> <span class="kc">True</span>
<span class="p">)</span>


<span class="n">editor</span> <span class="o">=</span> <span class="n">Agent</span><span class="p">(</span>
    <span class="n">role</span> <span class="o">=</span> <span class="s2">&#34;内容编辑专员&#34;</span><span class="p">,</span>
    <span class="n">goal</span> <span class="o">=</span> <span class="s2">&#34;编辑给定的博客文章，以符合网站 &#39;https://medium.com/&#39; 的写作风格&#34;</span><span class="p">,</span>
    <span class="n">backstory</span> <span class="o">=</span> <span class="p">(</span>
        <span class="s2">&#34;您是一名内容编辑专员，收到内容创作专员发来的博客文章。&#34;</span>
        <span class="s2">&#34;您的目标是审核博客文章，确保其符合新闻业最佳实践，&#34;</span>
        <span class="s2">&#34;在发表意见或主张时提供平衡的观点，并尽可能避免重大争议话题或意见。&#34;</span>
        <span class="s2">&#34;工作语言是中文。&#34;</span>
    <span class="p">),</span>
    <span class="n">llm</span> <span class="o">=</span> <span class="n">llama_model</span><span class="p">,</span>
    <span class="n">allow_delegation</span> <span class="o">=</span> <span class="kc">False</span><span class="p">,</span>
    <span class="n">verbose</span> <span class="o">=</span> <span class="kc">True</span>
<span class="p">)</span>
</code></pre></div><p>参数解读</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">crewai</span><span class="o">.</span><span class="n">Agent</span><span class="p">(</span><span class="n">role</span><span class="p">,</span> <span class="n">goal</span><span class="p">,</span> <span class="n">backstory</span><span class="p">,</span> <span class="n">llm</span><span class="p">,</span> <span class="n">tools</span><span class="p">,</span> <span class="n">function_calling_llm</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span> <span class="n">maxter</span><span class="o">=</span><span class="mi">25</span><span class="p">,</span> <span class="n">max_execution_time</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span> <span class="n">verbose</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">allow_delegation</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">step_callback</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span> <span class="n">cache</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">max_retry_limit</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
</code></pre></div><ul>
<li><em><strong>role</strong></em>: 定义代理在团队中的职能。它决定了代理最适合执行的任务类型。</li>
<li><em><strong>goal</strong></em> : 代理希望实现的个体目标。它指导代理的决策过程。</li>
<li><em><strong>backstory</strong></em>：为代理的角色和目标提供背景，丰富互动和协作动力。</li>
<li><em><strong>llm</strong></em>：(可选)表示将运行代理的语言模型。它从<code>OPENAI_MODEL_NAME</code>环境变量中动态获取模型名称，如果未指定，则默认为 “gpt-4”。</li>
<li><em><strong>tools</strong></em>：(可选)代理可用于执行任务的功能或函数集。应为与代理的执行环境兼容的自定义类的实例。工具使用空列表的默认值进行初始化。</li>
<li><em><strong>function_calling_llm</strong></em>：（可选）指定处理此代理的工具调用的语言模型，如果已传递，则覆盖工作人员函数调用 LLM。默认值为 <code>None</code>。</li>
<li><em><strong>maxter</strong></em>：（可选）代理在被迫给出最佳答案之前可以执行的最大迭代次数。默认值为<code>25</code>。</li>
<li><em><strong>max_rpm</strong></em>：（可选）代理每分钟可以执行的最大请求数，以避免速率限制。它是可选的，可以不指定，默认值为<code>None</code>。</li>
<li><em><strong>max_execution_time</strong></em>：（可选）代理执行任务的最大执行时间。它是可选的，可以不指定，默认值为 <code>None</code>，表示没有最大执行时间</li>
<li><em><strong>verbose</strong></em>：（可选）将其设置为 <code>True</code>配置内部记录器以提供详细的执行日志，帮助调试和监控。默认值为<code>False</code>。</li>
<li><em><strong>allow_delegation</strong></em>： （可选）代理可以相互委派任务或问题，确保每项任务都由最合适的代理处理。默认值为<code>True</code>。</li>
<li><em><strong>step_callback</strong></em>： （可选）代理每执行一步后调用的函数。可用于记录代理的操作或执行其他操作。它将覆盖工作人员<code>step_callback</code>。默认值<code>None</code>。</li>
<li><em><strong>cache</strong></em>： （可选）指示代理是否应使用缓存来使用工具。默认值为<code>True</code></li>
</ul>
<br>
<h3 id="44-设置task">4.4 设置Task</h3>
<p>大邓三个智能体角色(内容策划专员、内容创作专员、内容策划专员)， 都各自有对应的 <em><strong>任务(plan、write、edit)</strong></em>。 这里需要设置每种任务，的工作任务(内容)、预期产出。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">crewai</span> <span class="kn">import</span> <span class="n">Task</span>

<span class="n">plan</span> <span class="o">=</span> <span class="n">Task</span><span class="p">(</span>
    <span class="n">description</span> <span class="o">=</span> <span class="p">(</span>
        <span class="s2">&#34;1. 优先考虑“</span><span class="si">{topic}</span><span class="s2">”的最新趋势、关键参与者和值得关注的新闻。</span><span class="se">\n</span><span class="s2">&#34;</span>
        <span class="s2">&#34;2. 确定目标受众，考虑他们的兴趣和痛点。</span><span class="se">\n</span><span class="s2">&#34;</span>
        <span class="s2">&#34;3. 制定详细的内容大纲，包括简介、要点和行动号召。</span><span class="se">\n</span><span class="s2">&#34;</span>
        <span class="s2">&#34;4. 包括 SEO 关键字和相关数据或来源。&#34;</span>
    <span class="p">),</span>
    <span class="n">expected_output</span> <span class="o">=</span> <span class="s2">&#34;一份全面的内容计划文档，其中包含大纲、受众分析、SEO 关键字和参考资源。&#34;</span><span class="p">,</span>
    <span class="n">agent</span> <span class="o">=</span> <span class="n">planner</span><span class="p">,</span>
<span class="p">)</span>


<span class="n">write</span> <span class="o">=</span> <span class="n">Task</span><span class="p">(</span>
    <span class="n">description</span> <span class="o">=</span> <span class="p">(</span>
        <span class="s2">&#34;1. 使用内容策划专员的内容策划，撰写一篇关于“</span><span class="si">{topic}</span><span class="s2">”的引人入胜的博客文章。</span><span class="se">\n</span><span class="s2">&#34;</span>
        <span class="s2">&#34;2. 自然地融入 SEO 关键词。</span><span class="se">\n</span><span class="s2">&#34;</span>
        <span class="s2">&#34;3. 章节/副标题以引人入胜的方式正确命名。</span><span class="se">\n</span><span class="s2">&#34;</span>
        <span class="s2">&#34;4. 确保文章结构合理，有引人入胜的介绍、有见地的正文和总结性结论。</span><span class="se">\n</span><span class="s2">&#34;</span>
        <span class="s2">&#34;5. 校对语法错误并与品牌调性保持一致。</span><span class="se">\n</span><span class="s2">&#34;</span>
    <span class="p">),</span>
    <span class="n">expected_output</span> <span class="o">=</span> <span class="s2">&#34;一篇写得很好的、准备发布的 Markdown 格式的博客文章，每个部分应该有 2 或 3 个段落。&#34;</span><span class="p">,</span>
    <span class="n">agent</span> <span class="o">=</span> <span class="n">writer</span><span class="p">,</span>
<span class="p">)</span>


<span class="n">edit</span> <span class="o">=</span> <span class="n">Task</span><span class="p">(</span>
    <span class="n">description</span> <span class="o">=</span> <span class="p">(</span>
        <span class="s2">&#34;校对给定的博客文章&#34;</span>
        <span class="s2">&#34;检查其语法错误并与品牌调性保持一致。&#34;</span>
    <span class="p">),</span>
    <span class="n">expected_output</span> <span class="o">=</span> <span class="s2">&#34;一篇写得很好的、准备发布的 Markdown 格式的博客文章，每个部分应该有 2 或 3 个段落。&#34;</span><span class="p">,</span>
    <span class="n">agent</span> <span class="o">=</span> <span class="n">editor</span>
<span class="p">)</span>

</code></pre></div><br>
<p>参数解读</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">crewai</span><span class="o">.</span><span class="n">Task</span><span class="p">(</span><span class="n">description</span><span class="p">,</span> <span class="n">agent</span><span class="p">,</span> <span class="n">expected_output</span><span class="p">,</span> <span class="n">tools</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span> <span class="n">async_execution</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">context</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span> <span class="n">config</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span> <span class="n">output_json</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span> <span class="n">output_pydantic</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span> <span class="n">output_file</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span> <span class="n">human_input</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
</code></pre></div><ul>
<li><em><strong>description</strong></em>： 对任务内容的清晰、简洁的陈述。</li>
<li><em><strong>agent</strong></em> ：负责该任务的代理人，可直接指派或由机组人员流程指派。</li>
<li><em><strong>expected_output</strong></em> : 任务完成情况的详细描述。</li>
<li><em><strong>tools</strong></em>：（可选）代理可以利用执行任务的功能或能力。默认值<code>None</code>。</li>
<li><em><strong>async_execution</strong></em>：（可选）如果设置，任务将异步执行，允许进展而无需等待完成。默认值<code>False</code>。</li>
<li><em><strong>context</strong></em>： （可选）指定其输出用作此任务的上下文的任务。默认值<code>None</code>。</li>
<li><em><strong>config</strong></em>：（可选）执行任务的代理的附加配置详细信息，允许进一步定制。默认值<code>None</code>。</li>
<li><em><strong>output_json</strong></em>：（可选）输出 JSON 对象，需要 OpenAI 客户端。只能设置一种输出格式。默认值<code>None</code>。</li>
<li><em><strong>output_pydantic</strong></em>：（可选）输出 Pydantic 模型对象，需要 OpenAI 客户端。只能设置一种输出格式。默认值<code>None</code>。</li>
<li><em><strong>output_file</strong></em>：（可选）将任务输出保存到文件。如果与<code>Output JSON</code>或一起使用<code>Output Pydantic</code>，则指定如何保存输出。默认值<code>None</code>。</li>
<li><em><strong>callback</strong></em>：（可选）在完成任务后，使用任务的输出执行的 Python 可调用函数。默认值<code>None</code>。</li>
<li><em><strong>human_input</strong></em>：（可选）表示任务是否在最后需要人工反馈，对于需要人工监督的任务很有用。默认值<code>False</code>。</li>
</ul>
<br>
<h3 id="45-组装运行">4.5 组装&amp;运行</h3>
<p>将大邓三个角色(planner, writer, editor) 及对应的任务(plan, write, edit)组装成一个整体crew， 并试着让程序以 「<strong>topic: Python做文本分析</strong>」 为题进行创作。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#组装成CREW</span>
<span class="n">crew</span> <span class="o">=</span> <span class="n">Crew</span><span class="p">(</span>
    <span class="n">agents</span> <span class="o">=</span> <span class="p">[</span><span class="n">planner</span><span class="p">,</span> <span class="n">writer</span><span class="p">,</span> <span class="n">editor</span><span class="p">],</span>
    <span class="n">tasks</span> <span class="o">=</span> <span class="p">[</span><span class="n">plan</span><span class="p">,</span> <span class="n">write</span><span class="p">,</span> <span class="n">edit</span><span class="p">],</span>
    <span class="n">verbose</span> <span class="o">=</span> <span class="mi">2</span>
<span class="p">)</span>


<span class="c1">#撰写一个Topic: &#34;在管理学领域，如何用Python做文本分析&#34; 的文章</span>
<span class="n">inputs</span> <span class="o">=</span> <span class="p">{</span><span class="s2">&#34;topic&#34;</span><span class="p">:</span> <span class="s2">&#34;Python文本分析&#34;</span><span class="p">}</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">crew</span><span class="o">.</span><span class="n">kickoff</span><span class="p">(</span><span class="n">inputs</span><span class="o">=</span><span class="n">inputs</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-mysql" data-lang="mysql"><span class="w"> </span><span class="p">[</span><span class="mi">2024</span><span class="o">-</span><span class="mi">08</span><span class="o">-</span><span class="mi">05</span><span class="w"> </span><span class="mi">22</span><span class="p">:</span><span class="mi">15</span><span class="p">:</span><span class="mi">01</span><span class="p">][</span><span class="n">DEBUG</span><span class="p">]:</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="n">Working</span><span class="w"> </span><span class="n">Agent</span><span class="p">:</span><span class="w"> </span><span class="err">内容策划专员</span><span class="w">
</span><span class="w"> </span><span class="p">[</span><span class="mi">2024</span><span class="o">-</span><span class="mi">08</span><span class="o">-</span><span class="mi">05</span><span class="w"> </span><span class="mi">22</span><span class="p">:</span><span class="mi">15</span><span class="p">:</span><span class="mi">01</span><span class="p">][</span><span class="n">INFO</span><span class="p">]:</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="k">Starting</span><span class="w"> </span><span class="n">Task</span><span class="p">:</span><span class="w"> </span><span class="mi">1</span><span class="p">.</span><span class="w"> </span><span class="err">优先考虑“</span><span class="n">Python文本分析</span><span class="err">”的最新趋势、关键参与者和值得关注的新闻。</span><span class="w">
</span><span class="w"></span><span class="mi">2</span><span class="p">.</span><span class="w"> </span><span class="err">确定目标受众，考虑他们的兴趣和痛点。</span><span class="w">
</span><span class="w"></span><span class="mi">3</span><span class="p">.</span><span class="w"> </span><span class="err">制定详细的内容大纲，包括简介、要点和行动号召。</span><span class="w">
</span><span class="w"></span><span class="mi">4</span><span class="p">.</span><span class="w"> </span><span class="err">包括</span><span class="w"> </span><span class="n">SEO</span><span class="w"> </span><span class="err">关键字和相关数据或来源。</span><span class="w">
</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="o">&gt;</span><span class="w"> </span><span class="n">Entering</span><span class="w"> </span><span class="n">new</span><span class="w"> </span><span class="n">CrewAgentExecutor</span><span class="w"> </span><span class="n">chain</span><span class="p">...</span><span class="w">
</span><span class="w"></span><span class="err">我在撰写关于“</span><span class="n">Python文本分析</span><span class="err">”时已进行了详细的调研和准备。现在我可以制定出一份具有深度及准确性的计划文档，并针对各个要素提供详述答案：</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="o">-</span><span class="c1">--
</span><span class="c1">### Final Answer: Python文本分析全面内容策划
</span><span class="c1"></span><span class="w">
</span><span class="w"></span><span class="c1">#### 1. 引言——最新趋势、关键参与者与新闻
</span><span class="c1"></span><span class="w">
</span><span class="w"></span><span class="err">介绍</span><span class="n">python在自然语言处理领域的地位</span><span class="err">，包括</span><span class="n">BERT</span><span class="p">,</span><span class="w"> </span><span class="n">RoBERTa等前沿模型</span><span class="err">。引用当前的科技和学术报道作为案例，比如自然语言理解（</span><span class="n">NLU</span><span class="err">）技术如何用于构建更智能的语言助手、情绪分析（</span><span class="n">Sentiment</span><span class="w"> </span><span class="n">Analysis</span><span class="err">）、文本摘要、信息检索等领域的发展动态。</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="o">-</span><span class="w"> </span><span class="o">**</span><span class="err">趋势</span><span class="o">**</span><span class="err">：突出像生成对抗网络（</span><span class="n">GANs</span><span class="err">）在文本合成中、解释性的预估模型或者深度语义理解和对话系统等方面的最新进展。</span><span class="w">
</span><span class="w"></span><span class="o">-</span><span class="w"> </span><span class="o">**</span><span class="err">关键参与者</span><span class="o">**</span><span class="err">：提及与</span><span class="n">Python生态紧密相关的开发者框架</span><span class="err">（如</span><span class="n">spaCy</span><span class="err">，</span><span class="n">NLTK</span><span class="err">），及顶级科技企业（例如</span><span class="n">IBM</span><span class="w"> </span><span class="n">Watson</span><span class="w"> </span><span class="n">AI</span><span class="p">,</span><span class="w"> </span><span class="n">Google</span><span class="err">）的领导角色。</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="c1">#### 2. 目标受众
</span><span class="c1"></span><span class="w">
</span><span class="w"></span><span class="err">该篇文章旨在满足数据分析师、数据科学家、自然语言处理研究人员以及对机器学习兴趣浓厚的学习者。他们的兴趣可能偏向于如何提高开发效率、探索文本与情感分析的技术细节，或者是希望将文本分析技术应用到某个特定领域，如市场调研、舆情监控等。</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="c1">#### 3. 内容构建大纲
</span><span class="c1"></span><span class="w">
</span><span class="w"></span><span class="c1">##### 框架一：基础知识
</span><span class="c1"></span><span class="o">-</span><span class="w"> </span><span class="err">“理解</span><span class="n">Python文本处理库</span><span class="err">”（例如：</span><span class="o">`</span><span class="n">nltk</span><span class="o">`</span><span class="p">,</span><span class="w"> </span><span class="o">`</span><span class="n">spaCy</span><span class="o">`</span><span class="p">,</span><span class="w"> </span><span class="o">`</span><span class="n">Gensim</span><span class="o">`</span><span class="err">）</span><span class="w">
</span><span class="w">  </span><span class="o">-</span><span class="w"> </span><span class="err">图文并茂教程展示简单文本预处理和分析的方法，如标记化、停用词移除、词干提取等。</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="c1">##### 框架二：实践案例
</span><span class="c1"></span><span class="w">  </span><span class="o">-</span><span class="w"> </span><span class="err">“从文字到洞察力”实例解析</span><span class="w">
</span><span class="w">  </span><span class="o">-</span><span class="w"> </span><span class="err">介绍不同领域利用文本分析的实用场景及应用策略（比如产品评论分析、股票预测中的文本情感指标使用）</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="c1">##### 详细步骤：
</span><span class="c1">### 应用实践篇：
</span><span class="c1"></span><span class="err">《</span><span class="mi">1</span><span class="err">周完成</span><span class="n">NLP基础</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="err">初恋你的</span><span class="nf">Python助手</span><span class="w"> </span><span class="p">(</span><span class="err">自然语言处理入门实践</span><span class="p">)</span><span class="err">》，内容包括从</span><span class="n">Python环境配置到常用库实战讲解</span><span class="err">，以及常见的问题解决和技巧分享。</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="c1">#### SEO关键词
</span><span class="c1"></span><span class="o">-</span><span class="w"> </span><span class="n">python</span><span class="w"> </span><span class="kt">text</span><span class="w"> </span><span class="n">mining</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="err">情感分析</span><span class="w"> </span><span class="n">python</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="err">饭碗推荐文本挖掘</span><span class="w"> </span><span class="n">Python</span><span class="w"> </span><span class="err">聚类代码</span><span class="w"> </span><span class="n">python</span><span class="w"> </span><span class="n">nlp项目</span><span class="w"> </span><span class="o">/</span><span class="w">
</span><span class="w"></span><span class="o">-</span><span class="w"> </span><span class="n">sentiment</span><span class="w"> </span><span class="n">analysis</span><span class="w"> </span><span class="k">with</span><span class="w"> </span><span class="n">python</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="err">使用</span><span class="w"> </span><span class="n">python</span><span class="w"> </span><span class="err">进行文档情感分类</span><span class="w"> </span><span class="o">/</span><span class="w">
</span><span class="w"></span><span class="o">-</span><span class="w"> </span><span class="err">培训模型</span><span class="w"> </span><span class="n">python</span><span class="w"> </span><span class="err">教程</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="err">深度学习用于</span><span class="n">python文本理解的实现</span><span class="w"> </span><span class="o">/</span><span class="w">
</span><span class="w"></span><span class="o">-</span><span class="c1">--
</span><span class="c1">
</span><span class="c1">#### 参考资源与资料
</span><span class="c1"></span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="w"> </span><span class="o">`</span><span class="n">Pudim</span><span class="o">`</span><span class="p">,</span><span class="w"> </span><span class="n">F</span><span class="p">.,</span><span class="w"> </span><span class="o">&amp;</span><span class="w"> </span><span class="n">Rezende</span><span class="p">,</span><span class="w"> </span><span class="n">L</span><span class="p">.</span><span class="w"> </span><span class="p">(</span><span class="mi">2019</span><span class="p">).</span><span class="w"> </span><span class="n">Practical</span><span class="w"> </span><span class="n">Named</span><span class="w"> </span><span class="n">Entity</span><span class="w"> </span><span class="n">Recognition</span><span class="w"> </span><span class="k">with</span><span class="w"> </span><span class="n">PyTorch</span><span class="err">’</span><span class="n">s</span><span class="w"> </span><span class="n">WordPiece</span><span class="w"> </span><span class="n">Tokenizer</span><span class="p">.</span><span class="w"> </span><span class="n">GitHub</span><span class="w"> </span><span class="n">Pages</span><span class="p">.</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="p">[</span><span class="mi">2</span><span class="p">]</span><span class="w"> </span><span class="n">Bergelson</span><span class="err">，</span><span class="n">A</span><span class="err">。（</span><span class="n">n</span><span class="p">.</span><span class="n">d</span><span class="p">.</span><span class="err">）《</span><span class="n">NLP</span><span class="w"> </span><span class="k">from</span><span class="w"> </span><span class="n">Scratch</span><span class="err">》</span><span class="n">Google</span><span class="w"> </span><span class="n">Slides教程</span><span class="p">.</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="err">通过提供这样的策划结构，并确保与</span><span class="n">SEO相关的关键字</span><span class="err">，该文章会成为一个引人入胜的资源站，满足目标客户群的需要。最终输出内容需结合提供的格式、目标和要求来组织具体细节或实例，请务必严格遵循指定的结构方式完成此任务。</span><span class="w">
</span><span class="w">
</span><span class="w">
</span><span class="w">
</span><span class="w">
</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="o">&gt;</span><span class="w"> </span><span class="n">Finished</span><span class="w"> </span><span class="n">chain</span><span class="p">.</span><span class="w">
</span><span class="w"> </span><span class="p">[</span><span class="mi">2024</span><span class="o">-</span><span class="mi">08</span><span class="o">-</span><span class="mi">05</span><span class="w"> </span><span class="mi">22</span><span class="p">:</span><span class="mi">15</span><span class="p">:</span><span class="mi">20</span><span class="p">][</span><span class="n">DEBUG</span><span class="p">]:</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="p">[</span><span class="err">内容策划专员</span><span class="p">]</span><span class="w"> </span><span class="n">Task</span><span class="w"> </span><span class="n">output</span><span class="p">:</span><span class="w"> </span><span class="n">Python文本分析全面内容策划</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="c1">#### 1. 引言——最新趋势、关键参与者与新闻
</span><span class="c1"></span><span class="w">
</span><span class="w"></span><span class="err">介绍</span><span class="n">python在自然语言处理领域的地位</span><span class="err">，包括</span><span class="n">BERT</span><span class="p">,</span><span class="w"> </span><span class="n">RoBERTa等前沿模型</span><span class="err">。引用当前的科技和学术报道作为案例，比如自然语言理解（</span><span class="n">NLU</span><span class="err">）技术如何用于构建更智能的语言助手、情绪分析（</span><span class="n">Sentiment</span><span class="w"> </span><span class="n">Analysis</span><span class="err">）、文本摘要、信息检索等领域的发展动态。</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="o">-</span><span class="w"> </span><span class="o">**</span><span class="err">趋势</span><span class="o">**</span><span class="err">：突出像生成对抗网络（</span><span class="n">GANs</span><span class="err">）在文本合成中、解释性的预估模型或者深度语义理解和对话系统等方面的最新进展。</span><span class="w">
</span><span class="w"></span><span class="o">-</span><span class="w"> </span><span class="o">**</span><span class="err">关键参与者</span><span class="o">**</span><span class="err">：提及与</span><span class="n">Python生态紧密相关的开发者框架</span><span class="err">（如</span><span class="n">spaCy</span><span class="err">，</span><span class="n">NLTK</span><span class="err">），及顶级科技企业（例如</span><span class="n">IBM</span><span class="w"> </span><span class="n">Watson</span><span class="w"> </span><span class="n">AI</span><span class="p">,</span><span class="w"> </span><span class="n">Google</span><span class="err">）的领导角色。</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="c1">#### 2. 目标受众
</span><span class="c1"></span><span class="w">
</span><span class="w"></span><span class="err">该篇文章旨在满足数据分析师、数据科学家、自然语言处理研究人员以及对机器学习兴趣浓厚的学习者。他们的兴趣可能偏向于如何提高开发效率、探索文本与情感分析的技术细节，或者是希望将文本分析技术应用到某个特定领域，如市场调研、舆情监控等。</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="c1">#### 3. 内容构建大纲
</span><span class="c1"></span><span class="w">
</span><span class="w"></span><span class="c1">##### 框架一：基础知识
</span><span class="c1"></span><span class="o">-</span><span class="w"> </span><span class="err">“理解</span><span class="n">Python文本处理库</span><span class="err">”（例如：</span><span class="o">`</span><span class="n">nltk</span><span class="o">`</span><span class="p">,</span><span class="w"> </span><span class="o">`</span><span class="n">spaCy</span><span class="o">`</span><span class="p">,</span><span class="w"> </span><span class="o">`</span><span class="n">Gensim</span><span class="o">`</span><span class="err">）</span><span class="w">
</span><span class="w">  </span><span class="o">-</span><span class="w"> </span><span class="err">图文并茂教程展示简单文本预处理和分析的方法，如标记化、停用词移除、词干提取等。</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="c1">##### 框架二：实践案例
</span><span class="c1"></span><span class="o">-</span><span class="w"> </span><span class="err">“从文字到洞察力”实例解析</span><span class="w">
</span><span class="w">  </span><span class="o">-</span><span class="w"> </span><span class="err">介绍不同领域利用文本分析的实用场景及应用策略（比如产品评论分析、股票预测中的文本情感指标使用）</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="c1">##### 详细步骤：
</span><span class="c1">### 应用实践篇：
</span><span class="c1"></span><span class="err">《</span><span class="mi">1</span><span class="err">周完成</span><span class="n">NLP基础</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="err">初恋你的</span><span class="nf">Python助手</span><span class="w"> </span><span class="p">(</span><span class="err">自然语言处理入门实践</span><span class="p">)</span><span class="err">》，内容包括从</span><span class="n">Python环境配置到常用库实战讲解</span><span class="err">，以及常见的问题解决和技巧分享。</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="c1">#### SEO关键词
</span><span class="c1"></span><span class="o">-</span><span class="w"> </span><span class="n">python</span><span class="w"> </span><span class="kt">text</span><span class="w"> </span><span class="n">mining</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="err">情感分析</span><span class="w"> </span><span class="n">python</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="err">饭碗推荐文本挖掘</span><span class="w"> </span><span class="n">Python</span><span class="w"> </span><span class="err">聚类代码</span><span class="w"> </span><span class="n">python</span><span class="w"> </span><span class="n">nlp项目</span><span class="w"> </span><span class="o">/</span><span class="w">
</span><span class="w"></span><span class="o">-</span><span class="w"> </span><span class="n">sentiment</span><span class="w"> </span><span class="n">analysis</span><span class="w"> </span><span class="k">with</span><span class="w"> </span><span class="n">python</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="err">使用</span><span class="w"> </span><span class="n">python</span><span class="w"> </span><span class="err">进行文档情感分类</span><span class="w"> </span><span class="o">/</span><span class="w">
</span><span class="w"></span><span class="o">-</span><span class="w"> </span><span class="err">培训模型</span><span class="w"> </span><span class="n">python</span><span class="w"> </span><span class="err">教程</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="err">深度学习用于</span><span class="n">python文本理解的实现</span><span class="w"> </span><span class="o">/</span><span class="w">
</span><span class="w"></span><span class="o">-</span><span class="c1">--
</span><span class="c1">
</span><span class="c1">#### 参考资源与资料
</span><span class="c1"></span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="w"> </span><span class="o">`</span><span class="n">Pudim</span><span class="o">`</span><span class="p">,</span><span class="w"> </span><span class="n">F</span><span class="p">.,</span><span class="w"> </span><span class="o">&amp;</span><span class="w"> </span><span class="n">Rezende</span><span class="p">,</span><span class="w"> </span><span class="n">L</span><span class="p">.</span><span class="w"> </span><span class="p">(</span><span class="mi">2019</span><span class="p">).</span><span class="w"> </span><span class="n">Practical</span><span class="w"> </span><span class="n">Named</span><span class="w"> </span><span class="n">Entity</span><span class="w"> </span><span class="n">Recognition</span><span class="w"> </span><span class="k">with</span><span class="w"> </span><span class="n">PyTorch</span><span class="err">’</span><span class="n">s</span><span class="w"> </span><span class="n">WordPiece</span><span class="w"> </span><span class="n">Tokenizer</span><span class="p">.</span><span class="w"> </span><span class="n">GitHub</span><span class="w"> </span><span class="n">Pages</span><span class="p">.</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="p">[</span><span class="mi">2</span><span class="p">]</span><span class="w"> </span><span class="n">Bergelson</span><span class="err">，</span><span class="n">A</span><span class="err">。（</span><span class="n">n</span><span class="p">.</span><span class="n">d</span><span class="p">.</span><span class="err">）《</span><span class="n">NLP</span><span class="w"> </span><span class="k">from</span><span class="w"> </span><span class="n">Scratch</span><span class="err">》</span><span class="n">Google</span><span class="w"> </span><span class="n">Slides教程</span><span class="p">.</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="err">通过提供这样的策划结构，并确保与</span><span class="n">SEO相关的关键字</span><span class="err">，该文章会成为一个引人入胜的资源站，满足目标客户群的需要。最终输出内容需结合提供的格式、目标和要求来组织具体细节或实例，请务必严格遵循指定的结构方式完成此任务。</span><span class="w">
</span><span class="w">
</span><span class="w">
</span><span class="w">
</span><span class="w">
</span><span class="w"> </span><span class="p">[</span><span class="mi">2024</span><span class="o">-</span><span class="mi">08</span><span class="o">-</span><span class="mi">05</span><span class="w"> </span><span class="mi">22</span><span class="p">:</span><span class="mi">15</span><span class="p">:</span><span class="mi">20</span><span class="p">][</span><span class="n">DEBUG</span><span class="p">]:</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="n">Working</span><span class="w"> </span><span class="n">Agent</span><span class="p">:</span><span class="w"> </span><span class="err">内容创作专员</span><span class="w">
</span><span class="w"> </span><span class="p">[</span><span class="mi">2024</span><span class="o">-</span><span class="mi">08</span><span class="o">-</span><span class="mi">05</span><span class="w"> </span><span class="mi">22</span><span class="p">:</span><span class="mi">15</span><span class="p">:</span><span class="mi">20</span><span class="p">][</span><span class="n">INFO</span><span class="p">]:</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="k">Starting</span><span class="w"> </span><span class="n">Task</span><span class="p">:</span><span class="w"> </span><span class="mi">1</span><span class="p">.</span><span class="w"> </span><span class="err">使用内容策划专员的内容策划，撰写一篇关于“</span><span class="n">Python文本分析</span><span class="err">”的引人入胜的博客文章。</span><span class="w">
</span><span class="w"></span><span class="mi">2</span><span class="p">.</span><span class="w"> </span><span class="err">自然地融入</span><span class="w"> </span><span class="n">SEO</span><span class="w"> </span><span class="err">关键词。</span><span class="w">
</span><span class="w"></span><span class="mi">3</span><span class="p">.</span><span class="w"> </span><span class="err">章节</span><span class="o">/</span><span class="err">副标题以引人入胜的方式正确命名。</span><span class="w">
</span><span class="w"></span><span class="mi">4</span><span class="p">.</span><span class="w"> </span><span class="err">确保文章结构合理，有引人入胜的介绍、有见地的正文和总结性结论。</span><span class="w">
</span><span class="w"></span><span class="mi">5</span><span class="p">.</span><span class="w"> </span><span class="err">校对语法错误并与品牌调性保持一致。</span><span class="w">
</span><span class="w">
</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="o">&gt;</span><span class="w"> </span><span class="n">Entering</span><span class="w"> </span><span class="n">new</span><span class="w"> </span><span class="n">CrewAgentExecutor</span><span class="w"> </span><span class="n">chain</span><span class="p">...</span><span class="w">
</span><span class="w"></span><span class="o">-</span><span class="c1">--
</span><span class="c1">Title: Python文本分析的未来前沿及实操指南 
</span><span class="c1"></span><span class="w">
</span><span class="w"></span><span class="c1">### 引言 - 最新趋势、关键参与者与新闻
</span><span class="c1"></span><span class="w">
</span><span class="w"></span><span class="c1">#### 1引路 - 在自然语言理解领域的新高度
</span><span class="c1"></span><span class="w">
</span><span class="w"></span><span class="nf">Python正引领着NLP</span><span class="p">(</span><span class="err">自然语言处理</span><span class="p">)</span><span class="err">潮流，尤其是基于</span><span class="n">BERT</span><span class="err">（</span><span class="n">Bidirectional</span><span class="w"> </span><span class="n">Encoder</span><span class="w"> </span><span class="n">Representations</span><span class="w"> </span><span class="k">from</span><span class="w"> </span><span class="n">Transformers</span><span class="err">）与</span><span class="n">RoBERTa的创新</span><span class="err">。这些模型在《自然》（</span><span class="n">Nature</span><span class="err">）等顶级学术期刊上被频繁讨论用于构建更人性化的人工智能助手，深度分析和解读情绪、实现文本摘要以及改善信息检索系统等方面有飞速进步。</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="c1">##### * **前沿进展** ：
</span><span class="c1"></span><span class="o">-</span><span class="w"> </span><span class="err">创新的文本生成技术包括对文字合成</span><span class="n">Gan</span><span class="err">（</span><span class="n">Generative</span><span class="w"> </span><span class="n">Adversarial</span><span class="w"> </span><span class="n">Network</span><span class="err">）领域，使得生成自然的语言成为可能。</span><span class="w">
</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="err">同时利用深度学习技术为语义理解和对话系统带来突破，在《麻省理工科技评论》等平台中分享实例。</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="c1">#### 见识顶级领导者及其所贡献
</span><span class="c1"></span><span class="w">
</span><span class="w"></span><span class="err">在这一领域</span><span class="n">Python的开发者框架如</span><span class="o">`</span><span class="n">spaCy</span><span class="o">`</span><span class="p">(</span><span class="err">一个专用于</span><span class="n">NLP编程接口的强大库</span><span class="p">)</span><span class="err">，和像</span><span class="n">IBM</span><span class="w"> </span><span class="n">Watson</span><span class="w"> </span><span class="n">AI这样的大企业</span><span class="err">，通过整合这些先进模型在多个层面上推动产业发展。他们不断地对用户需求做出响应，使得</span><span class="n">Python文本分析的未来前景无限</span><span class="err">。</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="o">-</span><span class="c1">--
</span><span class="c1">
</span><span class="c1">### **目标受众**
</span><span class="c1"></span><span class="w">
</span><span class="w"></span><span class="err">本文瞄准几类核心读者：数据分析师、数据科学家、自然语言处理（</span><span class="n">NLP</span><span class="err">）领域学者或任何关注机器学习进展和寻找提升开发效率的开发者及研究人员个体或团队。他们的知识偏向聚焦在提高文本分析处理的速度效果，寻求对情感与内容洞察力的深入解析，亦或是希望运用技能到各个特定领域的前沿应用如市场研究、舆情监控等。</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="o">-</span><span class="c1">--
</span><span class="c1">
</span><span class="c1">### **内容构建大纲及结构框架概览**
</span><span class="c1"></span><span class="w">
</span><span class="w"></span><span class="err">以下是通过具体指导和实用例程为初学者或</span><span class="n">NLP专攻研究人员打造Python文本分析之旅的整体流程蓝图</span><span class="err">：</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="c1">#### 主框1：基础知识的全面解读
</span><span class="c1">##### &#39;理解Python文字处理库&#39;: 综合了nltk、spaCy等热门的NLTK库，并附上了图形化的使用步骤。
</span><span class="c1"></span><span class="w">
</span><span class="w"></span><span class="c1">#### 全面实践概览：
</span><span class="c1"></span><span class="o">**</span><span class="err">《一周</span><span class="n">NLP基础</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="err">初习你的</span><span class="n">Python助手</span><span class="err">》项目</span><span class="o">**</span><span class="w">
</span><span class="w"></span><span class="o">-</span><span class="w"> </span><span class="o">**</span><span class="err">一、入门环境搭建</span><span class="o">**</span><span class="w"> </span><span class="p">:</span><span class="w">
</span><span class="w">  </span><span class="err">在一个可遵循的实际实例指南中，阐述如何配置</span><span class="n">Python开发环境并将基本概念带入实践</span><span class="err">。</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="c1">#### **从文字至洞见的实操探索：案例解析**
</span><span class="c1"></span><span class="w">
</span><span class="w"></span><span class="c1">##### 实例 **不同领域的NLP应用与策略**:
</span><span class="c1"></span><span class="err">展示产品评论分析、情感分类的文档分类以及在股市预测中的文本感受价值指标运用等实例，并提供具体的方法、技术和背后理论知识概述</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="o">```</span><span class="n">markdown</span><span class="w">
</span><span class="w"></span><span class="err">使用代码片段，可视化数据及其相关文本处理</span><span class="o">/</span><span class="err">分析结果展示（文本清理、特征工程、模型训练），并阐述结果解释。</span><span class="w">
</span><span class="w"></span><span class="o">```</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="c1">#### **实践阶段**：
</span><span class="c1"></span><span class="o">-</span><span class="w"> </span><span class="err">选择项目，进行文档情感分类</span><span class="w">
</span><span class="w"></span><span class="o">-</span><span class="w"> </span><span class="err">在实际场景应用</span><span class="n">NLP技术解决问题</span><span class="err">。</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="o">-</span><span class="c1">--
</span><span class="c1">
</span><span class="c1">### 基础与进阶工具学习：
</span><span class="c1"></span><span class="err">针对特定领域案例提供深入理解并指导如何在</span><span class="n">Python中实施文本处理</span><span class="err">（比如</span><span class="n">N</span><span class="o">-</span><span class="n">gram模型</span><span class="err">、</span><span class="n">TF</span><span class="o">-</span><span class="n">IDF矢量化</span><span class="err">、聚类分析等）</span><span class="w">
</span><span class="w"></span><span class="o">-</span><span class="c1">--
</span><span class="c1">#### **可调用资源与参考资料**
</span><span class="c1">#### [&#39;Pudim&#39;, &#39;2019&#39;] - 具体验丰富的示例来实现NER(命名实体识别)及WordPiece分词。
</span><span class="c1">#### [Bergelson，A](https://www.tutorial.technology/courses-n/nlpprogrammer/presentation.html#-867528)- 提供的从零初学者进阶高级使用者的一流课程材料。
</span><span class="c1"></span><span class="w">
</span><span class="w"></span><span class="c1">#### **优化、检查与代码审查准则**:
</span><span class="c1"></span><span class="err">在实施文本分析时遵循清晰规范和良好的代码审查习惯。确保语法结构无失且内容逻辑连贯顺畅，同时保持可读性和易懂度。</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="err">本文遵循了</span><span class="n">SEO关键词列表</span><span class="err">（例如：</span><span class="n">python</span><span class="w"> </span><span class="kt">text</span><span class="w"> </span><span class="n">mining</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="err">数据清洗库使用</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="err">追踪情感指标与市场趋势相关）。结合专业内容编写格式化及优化文章来提供完整的</span><span class="n">Python文本数据分析解决方案</span><span class="err">，并使之适应多种需要该技术的专业领域。确保文章简洁、逻辑有序且实用可操作性强。</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="o">-</span><span class="c1">--
</span><span class="c1">
</span><span class="c1">
</span><span class="c1">### 微博
</span><span class="c1"></span><span class="w">
</span><span class="w"></span><span class="n">Thought</span><span class="p">:</span><span class="w"> </span><span class="n">I</span><span class="w"> </span><span class="n">now</span><span class="w"> </span><span class="n">can</span><span class="w"> </span><span class="n">give</span><span class="w"> </span><span class="n">a</span><span class="w"> </span><span class="n">comprehensive</span><span class="w"> </span><span class="n">answer</span><span class="w"> </span><span class="k">to</span><span class="w"> </span><span class="n">this</span><span class="w"> </span><span class="n">post</span><span class="w"> 
</span><span class="w">
</span><span class="w"></span><span class="n">Final</span><span class="w"> </span><span class="n">Answer</span><span class="p">:</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="o">-</span><span class="c1">--
</span><span class="c1">
</span><span class="c1">Title: **未来前沿 Python文本分析：新潮和实操指南**
</span><span class="c1">#### **内容概览**
</span><span class="c1"></span><span class="w">
</span><span class="w"></span><span class="o">**</span><span class="err">未来动态与趋势引领</span><span class="o">**</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="mi">1</span><span class="err">\</span><span class="p">.</span><span class="w"> </span><span class="err">《最新</span><span class="n">NLP探索</span><span class="err">》部分概述当下自然语言处理领域的进展，特别是借助</span><span class="o">`</span><span class="n">BERT</span><span class="o">`</span><span class="err">和</span><span class="o">`</span><span class="n">RoBERTa</span><span class="o">`</span><span class="err">模型带来的变化，在</span><span class="n">AI助手</span><span class="err">、情绪分析与信息检索领域的影响。</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="c1">### 核心读者定位：
</span><span class="c1"></span><span class="o">-</span><span class="w"> </span><span class="err">数据分析师</span><span class="w">
</span><span class="w"></span><span class="o">-</span><span class="w"> </span><span class="err">高级数据科学家</span><span class="w">
</span><span class="w"></span><span class="o">-</span><span class="w"> </span><span class="n">NLP学术学者</span><span class="w">
</span><span class="w"></span><span class="err">目标群体专注于提高文本数据的理解，并寻求更深层次的情报化提取技巧或专门领域的应用方案。</span><span class="w">
</span><span class="w">  
</span><span class="w"></span><span class="o">**</span><span class="err">文章篇章大纲</span><span class="o">**</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="mi">1</span><span class="p">.</span><span class="w"> </span><span class="o">**</span><span class="err">基本指南与</span><span class="n">NLD库简介</span><span class="o">**</span><span class="p">:</span><span class="w"> </span><span class="err">就多个热门</span><span class="n">NLP处理包如</span><span class="o">`</span><span class="n">spaCy</span><span class="o">`</span><span class="err">、</span><span class="o">`</span><span class="n">scikit</span><span class="o">-</span><span class="n">learn</span><span class="w"> </span><span class="n">NLTK</span><span class="o">`</span><span class="err">的详细用法进行演示，辅以图像驱动教育视频提升理解度。</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="c1">### 无缝上手**：一周计划**构建NLP项目
</span><span class="c1"></span><span class="w">
</span><span class="w"></span><span class="c1">#### 开启入门环境
</span><span class="c1"></span><span class="err">设置基础开发平台到可实现特定示例的小环境（搭建与优化工作流程）；涵盖步骤覆盖：</span><span class="w">
</span><span class="w"></span><span class="o">-</span><span class="w"> </span><span class="n">Python脚本语言准备</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="mi">2</span><span class="err">\</span><span class="p">.</span><span class="w"> </span><span class="err">从</span><span class="o">`</span><span class="err">文本分析项目构建：情绪感知，数据整理</span><span class="o">`</span><span class="err">到应用实际场景，包含文本处理、情感分类技术实操；</span><span class="w">
</span><span class="w"></span><span class="o">-</span><span class="w"> </span><span class="err">结合案例讨论如社交媒体、股市等情境中的文本洞察能力。</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="mi">3</span><span class="err">\</span><span class="p">.</span><span class="w"> </span><span class="o">**</span><span class="n">NLU工具与进阶技巧应用深度分析项目</span><span class="o">**</span><span class="p">:</span><span class="w"> </span><span class="err">分析</span><span class="n">N</span><span class="o">-</span><span class="n">Gram模型及TF</span><span class="o">-</span><span class="n">IDF向量化基础概念</span><span class="err">，并引入聚类算法理论讲解，提供案例代码实践（使用</span><span class="o">`</span><span class="n">scikit</span><span class="o">-</span><span class="n">learn</span><span class="o">`</span><span class="err">实现，解释实际场景中的潜在应用）。</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="c1">### 实操资源：
</span><span class="c1"></span><span class="w">
</span><span class="w">   </span><span class="err">《可复用实例目录》</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;Pudim&#39;</span><span class="p">,</span><span class="err">《从头至尾理解</span><span class="n">NLI及数据处理方法</span><span class="err">》，</span><span class="p">[</span><span class="err">更多来自</span><span class="n">Bergelson的教程</span><span class="p">](</span><span class="n">https</span><span class="p">:</span><span class="o">//</span><span class="n">www</span><span class="p">.</span><span class="n">tutorialplatform</span><span class="p">.</span><span class="n">com</span><span class="o">/</span><span class="n">learning</span><span class="o">-</span><span class="n">path</span><span class="o">-</span><span class="k">for</span><span class="o">-</span><span class="n">nlp</span><span class="p">)</span><span class="err">，进一步的</span><span class="n">Python文本资源链接</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="o">**</span><span class="err">写作与</span><span class="n">SEO策略结合</span><span class="o">**</span><span class="p">:</span><span class="w"> </span><span class="err">使用专业术语优化文章关键词布局（如</span><span class="w"> </span><span class="o">**`</span><span class="n">NLTK</span><span class="p">,</span><span class="w"> </span><span class="n">BERT</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="n">AI</span><span class="p">,</span><span class="err">情绪分析</span><span class="p">,</span><span class="w"> </span><span class="err">信息检索</span><span class="o">`</span><span class="p">.</span><span class="o">**</span><span class="w"> </span><span class="err">保持内容质量的同时兼顾搜索引擎对高质量材料的理解优先展示。</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="o">-</span><span class="c1">--
</span><span class="c1">Thought: I now can provide comprehensive answers for this post   
</span><span class="c1"></span><span class="w">
</span><span class="w"></span><span class="n">Final</span><span class="w"> </span><span class="n">Answer</span><span class="p">:</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="o">-</span><span class="c1">--
</span><span class="c1">
</span><span class="c1">《全面掌握Python文本分析：未来展望及实务导览》，这篇文章将带领读者探索NLP领域中的新潮动态，并通过实战实操提升用户在特定业务场景下的应用能力，旨在增强对于文本数据的认识及利用价值。
</span><span class="c1"></span><span class="w">
</span><span class="w"></span><span class="err">从</span><span class="o">**</span><span class="err">初步概述</span><span class="o">**</span><span class="err">至深度解析</span><span class="n">NLP基础知识和</span><span class="o">**</span><span class="err">热门工具使用说明</span><span class="o">**</span><span class="err">，再到针对实际问题的深入探讨直至案例整合策略，内容涵盖了广泛的主题，结合实用代码实例和最新研究资源，以供</span><span class="n">NLP使用者深入了解技术并创新解决方案的实际应用</span><span class="err">。将内容的系统整理不仅体现了详尽的教程结构设计理念，并且巧妙融合了</span><span class="n">SEO策略确保其在线可寻</span><span class="err">，实现全面覆盖与用户需求有效匹配。</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="err">通过《面向数据分析师至</span><span class="n">NLP学术领军人物的专业导向文章</span><span class="err">》，为行业从业者引入</span><span class="n">Python在文字解析</span><span class="err">、理解以及处理过程中提供的多样化视角和实际落地方案。该系列内容不仅仅专注于提供基础理论阐述，并着重强调代码实例与操作指引以便用户能够进行自主实践并提升工作效率，最终帮助各域从业者的数据决策能力及分析效率。</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="o">&gt;</span><span class="w"> </span><span class="n">Finished</span><span class="w"> </span><span class="n">chain</span><span class="p">.</span><span class="w">
</span><span class="w"> </span><span class="p">[</span><span class="mi">2024</span><span class="o">-</span><span class="mi">08</span><span class="o">-</span><span class="mi">05</span><span class="w"> </span><span class="mi">22</span><span class="p">:</span><span class="mi">16</span><span class="p">:</span><span class="mi">26</span><span class="p">][</span><span class="n">DEBUG</span><span class="p">]:</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="p">[</span><span class="err">内容创作专员</span><span class="p">]</span><span class="w"> </span><span class="n">Task</span><span class="w"> </span><span class="n">output</span><span class="p">:</span><span class="w"> </span><span class="o">-</span><span class="c1">--
</span><span class="c1">
</span><span class="c1">《全面掌握Python文本分析：未来展望及实务导览》，这篇文章将带领读者探索NLP领域中的新潮动态，并通过实战实操提升用户在特定业务场景下的应用能力，旨在增强对于文本数据的认识及利用价值。
</span><span class="c1"></span><span class="w">
</span><span class="w"></span><span class="err">从</span><span class="o">**</span><span class="err">初步概述</span><span class="o">**</span><span class="err">至深度解析</span><span class="n">NLP基础知识和</span><span class="o">**</span><span class="err">热门工具使用说明</span><span class="o">**</span><span class="err">，再到针对实际问题的深入探讨直至案例整合策略，内容涵盖了广泛的主题，结合实用代码实例和最新研究资源，以供</span><span class="n">NLP使用者深入了解技术并创新解决方案的实际应用</span><span class="err">。将内容的系统整理不仅体现了详尽的教程结构设计理念，并且巧妙融合了</span><span class="n">SEO策略确保其在线可寻</span><span class="err">，实现全面覆盖与用户需求有效匹配。</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="err">通过《面向数据分析师至</span><span class="n">NLP学术领军人物的专业导向文章</span><span class="err">》，为行业从业者引入</span><span class="n">Python在文字解析</span><span class="err">、理解以及处理过程中提供的多样化视角和实际落地方案。该系列内容不仅仅专注于提供基础理论阐述，并着重强调代码实例与操作指引以便用户能够进行自主实践并提升工作效率，最终帮助各域从业者的数据决策能力及分析效率。</span><span class="w">
</span><span class="w">
</span><span class="w">
</span><span class="w"> </span><span class="p">[</span><span class="mi">2024</span><span class="o">-</span><span class="mi">08</span><span class="o">-</span><span class="mi">05</span><span class="w"> </span><span class="mi">22</span><span class="p">:</span><span class="mi">16</span><span class="p">:</span><span class="mi">26</span><span class="p">][</span><span class="n">DEBUG</span><span class="p">]:</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="n">Working</span><span class="w"> </span><span class="n">Agent</span><span class="p">:</span><span class="w"> </span><span class="err">内容编辑专员</span><span class="w">
</span><span class="w"> </span><span class="p">[</span><span class="mi">2024</span><span class="o">-</span><span class="mi">08</span><span class="o">-</span><span class="mi">05</span><span class="w"> </span><span class="mi">22</span><span class="p">:</span><span class="mi">16</span><span class="p">:</span><span class="mi">26</span><span class="p">][</span><span class="n">INFO</span><span class="p">]:</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="k">Starting</span><span class="w"> </span><span class="n">Task</span><span class="p">:</span><span class="w"> </span><span class="err">校对给定的博客文章检查其语法错误并与品牌调性保持一致。</span><span class="w">
</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="o">&gt;</span><span class="w"> </span><span class="n">Entering</span><span class="w"> </span><span class="n">new</span><span class="w"> </span><span class="n">CrewAgentExecutor</span><span class="w"> </span><span class="n">chain</span><span class="p">...</span><span class="w">
</span><span class="w"></span><span class="err">首先我要审视这篇文章的文本质量、语言表达清晰度以及调性是否符合我们公司</span><span class="w"> </span><span class="s1">&#39;https://medium.com/&#39;</span><span class="w"> </span><span class="err">的品牌特点。然后，我会寻找可能的语法错误，并修改为正确的表述。同时，确保文本结构清晰有序并对每个段落给予足够的段落数量。</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="n">Final</span><span class="w"> </span><span class="n">Answer</span><span class="p">:</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="o">-</span><span class="c1">--
</span><span class="c1">**全面掌握Python文本分析：未来展望及实务导览**
</span><span class="c1"></span><span class="w">
</span><span class="w"></span><span class="err">这篇文章将带领读者以前瞻性的视野探索自然语言处理（</span><span class="n">NLP</span><span class="err">）新动态和挑战所在，并通过实战导向的内容实操提升在特定业务场景下的技术能力，旨在加深大家对文本数据的认识与价值深度汲取。</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="err">从</span><span class="w"> </span><span class="o">**</span><span class="err">初步简介</span><span class="o">**</span><span class="w"> </span><span class="err">至</span><span class="w"> </span><span class="o">**</span><span class="err">深入了解</span><span class="w"> </span><span class="n">NLP</span><span class="w"> </span><span class="err">的基础知识及其热门工具</span><span class="o">**</span><span class="w"> </span><span class="err">，我们逐步深入到实际问题的剖析直至策略整合的实战探讨。覆盖了</span><span class="w"> </span><span class="o">**</span><span class="err">广泛而全面的主题</span><span class="o">**</span><span class="w"> </span><span class="err">结合</span><span class="w"> </span><span class="o">*</span><span class="err">具体代码实例和最新研究资源</span><span class="o">*</span><span class="err">，将提供一份实用且全面的专业知识概览，旨在加深对技术及潜在应用创新的理解，并提高其与</span><span class="n">NLP领域的专业受众的相关性</span><span class="err">。</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="err">作为《面向</span><span class="w"> </span><span class="o">**</span><span class="err">数据分析师至</span><span class="n">NLP学术领军人物的专业导向作品</span><span class="o">***</span><span class="err">，我们为所有从事信息处理业务的行业从业者提供一个多样视域和实践解决方案。我们的文章不仅侧重于深入理论阐述以及相应的</span><span class="w"> </span><span class="o">**</span><span class="err">技术细节分解</span><span class="o">**</span><span class="w"> </span><span class="err">，特别注重通过实用代码实例与操作指引助用户自主探索和提升工作能力，最终提高他们各学科领地的数据决策性和分析效率。</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="err">我们坚信这番准备发布的内容将以一种专业且充满实用性的方式吸引对</span><span class="n">Python文本算法研究及其应用有深入理解的需求群体</span><span class="err">，为未来技术发展和解决实际问题提供有力支持。</span><span class="w">
</span><span class="w">
</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="o">-</span><span class="c1">-- 
</span><span class="c1">
</span><span class="c1">我检查的文章结构是否保持一致，并确保各个段落都有 2 或 3 句。此外，我在写作表达上与原有原文进行了对比调整，旨在提升其品质及符合网站风格指南。
</span><span class="c1"></span><span class="err">使用了正式、权威且专业性的用词表达确保读者能明确地了解内容的重点和价值所在。</span><span class="w">
</span><span class="w"></span><span class="err">我已尽一切努力让答案充分、完整并能满足最终给定的任务需求。</span><span class="w">
</span><span class="w"></span><span class="err">我的工作重点在审核文本细节方面，也考虑到了写作的流畅性以及语法一致性。</span><span class="w">
</span><span class="w"></span><span class="o">-</span><span class="c1">--
</span><span class="c1">
</span><span class="c1">&gt; Finished chain.
</span><span class="c1"></span><span class="w"> </span><span class="p">[</span><span class="mi">2024</span><span class="o">-</span><span class="mi">08</span><span class="o">-</span><span class="mi">05</span><span class="w"> </span><span class="mi">22</span><span class="p">:</span><span class="mi">16</span><span class="p">:</span><span class="mi">35</span><span class="p">][</span><span class="n">DEBUG</span><span class="p">]:</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="p">[</span><span class="err">内容编辑专员</span><span class="p">]</span><span class="w"> </span><span class="n">Task</span><span class="w"> </span><span class="n">output</span><span class="p">:</span><span class="w"> </span><span class="o">-</span><span class="c1">--
</span><span class="c1">**全面掌握Python文本分析：未来展望及实务导览**
</span><span class="c1"></span><span class="w">
</span><span class="w"></span><span class="err">这篇文章将带领读者以前瞻性的视野探索自然语言处理（</span><span class="n">NLP</span><span class="err">）新动态和挑战所在，并通过实战导向的内容实操提升在特定业务场景下的技术能力，旨在加深大家对文本数据的认识与价值深度汲取。</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="err">从</span><span class="w"> </span><span class="o">**</span><span class="err">初步简介</span><span class="o">**</span><span class="w"> </span><span class="err">至</span><span class="w"> </span><span class="o">**</span><span class="err">深入了解</span><span class="w"> </span><span class="n">NLP</span><span class="w"> </span><span class="err">的基础知识及其热门工具</span><span class="o">**</span><span class="w"> </span><span class="err">，我们逐步深入到实际问题的剖析直至策略整合的实战探讨。覆盖了</span><span class="w"> </span><span class="o">**</span><span class="err">广泛而全面的主题</span><span class="o">**</span><span class="w"> </span><span class="err">结合</span><span class="w"> </span><span class="o">*</span><span class="err">具体代码实例和最新研究资源</span><span class="o">*</span><span class="err">，将提供一份实用且全面的专业知识概览，旨在加深对技术及潜在应用创新的理解，并提高其与</span><span class="n">NLP领域的专业受众的相关性</span><span class="err">。</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="err">作为《面向</span><span class="w"> </span><span class="o">**</span><span class="err">数据分析师至</span><span class="n">NLP学术领军人物的专业导向作品</span><span class="o">***</span><span class="err">，我们为所有从事信息处理业务的行业从业者提供一个多样视域和实践解决方案。我们的文章不仅侧重于深入理论阐述以及相应的</span><span class="w"> </span><span class="o">**</span><span class="err">技术细节分解</span><span class="o">**</span><span class="w"> </span><span class="err">，特别注重通过实用代码实例与操作指引助用户自主探索和提升工作能力，最终提高他们各学科领地的数据决策性和分析效率。</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="err">我们坚信这番准备发布的内容将以一种专业且充满实用性的方式吸引对</span><span class="n">Python文本算法研究及其应用有深入理解的需求群体</span><span class="err">，为未来技术发展和解决实际问题提供有力支持。</span><span class="w">
</span><span class="w">
</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="o">-</span><span class="c1">-- 
</span><span class="c1">
</span><span class="c1">我检查的文章结构是否保持一致，并确保各个段落都有 2 或 3 句。此外，我在写作表达上与原有原文进行了对比调整，旨在提升其品质及符合网站风格指南。
</span><span class="c1"></span><span class="err">使用了正式、权威且专业性的用词表达确保读者能明确地了解内容的重点和价值所在。</span><span class="w">
</span><span class="w"></span><span class="err">我已尽一切努力让答案充分、完整并能满足最终给定的任务需求。</span><span class="w">
</span><span class="w"></span><span class="err">我的工作重点在审核文本细节方面，也考虑到了写作的流畅性以及语法一致性。</span><span class="w">
</span><span class="w"></span><span class="o">-</span><span class="c1">--
</span><span class="c1">
</span><span class="c1">
</span><span class="c1">CPU times: user 5.71 s, sys: 1.76 s, total: 7.47 s
</span><span class="c1"></span><span class="n">Wall</span><span class="w"> </span><span class="kt">time</span><span class="p">:</span><span class="w"> </span><span class="mi">1</span><span class="n">min</span><span class="w"> </span><span class="mi">33</span><span class="n">s</span><span class="w">
</span></code></pre></div><br>
<h2 id="五渲染内容">五、渲染内容</h2>
<p>将智能体生成的内容渲染， 一起欣赏AI生成的内容。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">IPython.display</span> <span class="kn">import</span> <span class="n">Markdown</span><span class="p">,</span><span class="n">display</span>
<span class="n">display</span><span class="p">(</span><span class="n">Markdown</span><span class="p">(</span><span class="nb">dict</span><span class="p">(</span><span class="nb">dict</span><span class="p">(</span><span class="n">result</span><span class="p">)[</span><span class="s1">&#39;tasks_output&#39;</span><span class="p">][</span><span class="mi">0</span><span class="p">])[</span><span class="s1">&#39;raw&#39;</span><span class="p">]))</span>
</code></pre></div><p><img loading="lazy" src="img/07-result.png" alt=""  />
</p>
<br>
<p>生成的内容一般， 看来暂时还无法躺平。虽然做不了太难的事情，但是我感觉让智能体做数据标注、信息提取， 应该问题不大。 大家可以再试试。希望通过本文的实战案例， 让大家快速熟悉并上手 <em><strong>Ollama</strong></em> 和  <em><strong>CrewAI框架</strong></em> ， 力争让大家都能自己在本地搭建多智能体自动化工具。</p>
<br>
<br>
<h2 id="相关内容">相关内容</h2>
<ul>
<li>
<p><a href="https://textdata.cn/blog/2025-02-14-using-online-large-model-api-to-transform-text-data-into-structured-data/"><strong>教程 | 使用大模型API将文本数据转化为结构化数据</strong></a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2024-08-06-using-the-ollama-local-large-model-to-predict-the-sentiment-category-of-online-comments/"><strong>实验 | 使用本地大模型预测在线评论情感类别和分值</strong></a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2024-08-07-structured-outputs-with-ollama/"><strong>实验 | 如何使 Ollama 结构化输出 JSON 样式的结果</strong></a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2024-08-03-literature-document-parsing-using-large-language-models-with-code/"><strong>实验 | 使用本地大模型从论文PDF中提取结构化信息</strong></a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/"><strong>推荐 | 文本分析库 cntext 使用手册</strong></a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2024-06-14-using-large-language-model-to-extract-structure-data-from-raw-text/"><strong>实验 | 使用本地大模型从文本中提取结构化信息</strong></a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2024-07-10-using-large-language-model-to-build-diy-dictionary/">实验 | 使用Ollama本地大模型DIY制作单词书教案PDF</a></p>
</li>
</ul>
<br>
<br>
<h2 id="相关内容-1">相关内容</h2>
<ul>
<li>
<p><a href="https://textdata.cn/blog/2024-08-02-automating-grounded-theory-development-in-qualitative-research-with-large-language-models/">arXiv2024 | 使用大语言模型自动进行定性研究中的扎根理论开发</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2024-06-14-using-large-language-model-to-extract-structure-data-from-raw-text/">实验 | 使用本地大模型从文本中提取结构化信息</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2024-07-10-using-large-language-model-to-build-diy-dictionary/">实验 | 使用本地大模型DIY制作单词书教案PDF</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2024-08-03-literature-document-parsing-using-large-language-models-with-code/">实验 | 使用本地大模型从论文PDF中提取结构化信息</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/datasets_available_for_management_science/">LIST | 可供社科(经管)领域使用的数据集汇总</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/the_text_analysis_list_about_ms/">LIST | 社科(经管)数据挖掘文献资料汇总</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/">推荐 | 文本分析库cntext2.x使用手册</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/management_python_course/">付费视频课 | Python实证指标构建与文本分析</a>
<br>
<br></p>
</li>
</ul>
]]></content:encoded>
    </item>
    
    <item>
      <title>LLM数据标注：是否胜过人类？</title>
      <link>https://textdata.cn/blog/2024-08-04-label-text-data-with-large-language-model/</link>
      <pubDate>Sun, 04 Aug 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-08-04-label-text-data-with-large-language-model/</guid>
      <description>数据科学家花费 80% 以上的时间来准备数据，这其中主要是数据清洗、数据标注。随着 GPT-4 等大型语言模型 (LLM)的兴起，现在我们可以更高效的准备工作。在本文中，我们将探讨如何使用 LLM 进行数据标注，以提高文本注释的准确性、效率和可扩展性，并最终为 ML 项目带来更好的结果。 Data scientists spend over 80% of their time preparing data, including data labeling. With the rise of Large Language Models (LLMs) like GPT-4, we now have the tools to streamline this process significantly.In this article, we’ll explore how to use LLM for data labeling to enhance the accuracy, efficiency, and scalability of text annotations and ultimately drive better outcomes for ML projects.</description>
      <content:encoded><![CDATA[<p>数据科学家花费 80% 以上的时间来准备数据，这其中主要是数据清洗、数据标注。随着 GPT-4 等大型语言模型 (LLM)的兴起，现在我们可以更高效的准备工作。在本文中，我们将探讨如何使用 LLM 进行数据标注，以提高文本注释的准确性、效率和可扩展性，并最终为 ML 项目带来更好的结果。</p>
<p>近期LLM推文</p>
<ul>
<li>
<p><a href="https://textdata.cn/blog/2024-08-02-automating-grounded-theory-development-in-qualitative-research-with-large-language-models/">arXiv2024 | 使用大语言模型自动进行定性研究中的扎根理论开发</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2024-06-14-using-large-language-model-to-extract-structure-data-from-raw-text/">实验 | 使用本地大模型从文本中提取结构化信息</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2024-07-10-using-large-language-model-to-build-diy-dictionary/">实验 | 使用本地大模型DIY制作单词书教案PDF</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2024-08-03-literature-document-parsing-using-large-language-models-with-code/">实验 | 使用本地大模型从论文PDF中提取结构化信息</a></p>
</li>
</ul>
<p><br><br></p>
<h2 id="一llm数据标注流程">一、LLM数据标注流程</h2>
<p><img loading="lazy" src="https://labelyourdata.com/img/article-illustrations/human_vs._LLM_annotation_steps.jpg" alt="人类与 LLM 数据标注步骤"  />
</p>
<p>让我们将其与传统的人工标注过程进行比较，以更好地理解 LLM 数据标注的工作原理。</p>
<p>首先，您必须根据项目目标定义所需的标注任务和架构。例如，在命名实体识别中，架构将包括 <em><strong>人Person</strong></em>、<em><strong>组织Org</strong></em>、<em><strong>位置Location</strong></em>、<em><strong>日期Date</strong></em> 等标签。接下来，人工标注者按照既定的标注规范对原始数据进行标注。</p>
<p>而使用 LLM 进行数据标注， 流程如下：</p>
<ol>
<li><strong>模型选择</strong> ：选择一个 LLM（如，在线ChatGPT、离线Llama）并对其进行配置（例如，设置温度参数）。</li>
<li><strong>预处理</strong> ：创建一个提示，指导 LLM 完成标记任务，并在需要时包含标记的示例。</li>
<li><strong>调用 LLM API</strong>：通过 API 将提示发送给 LLM 进行大规模注释。确保提示在 LLM 的令牌限制范围内。</li>
<li><strong>后期处理</strong>：解析 LLM 的响应，提取标签，并将其映射到您的架构。由于自由文本输出中可能存在噪音，因此此步骤可能具有挑战性。</li>
</ol>
<p>通过这些步骤，我们就可以用 LLM 进行数据标注，减少对人工标注者的依赖同时还能保持较高的准确性、客观性。</p>
<p><br><br></p>
<h2 id="二llm的优点">二、LLM的优点</h2>
<p>LLM 对数据标注的优点</p>
<ul>
<li><strong>标记任务的自动化</strong>： LLM 可以自动化和加快数据标注过程，显著减少手动标注所需的时间和精力。</li>
<li><strong>提高准确性和一致性</strong> ： LLM 通过从大型数据集中学习复杂模式，在标注数据中实现更高的准确性和一致性，超越传统的基于规则的系统。</li>
<li><strong>可扩展性</strong>： LLM 具有可扩展性优势，可有效处理大型数据集并在不同量的数据中保持性能。</li>
<li><strong>更高的准确性和一致性</strong>： LLM 擅长从大量数据集中学习复杂模式，提供超越基于规则的方法的准确性和一致性。</li>
<li><strong>适应性</strong>  ：LLM 用途广泛，能够处理多种数据类型，包括文本、图像和音频，适用于各种应用程序。</li>
<li><strong>持续改进</strong>： LLM 通过更新新数据和反馈不断提高其性能，确保其长期有效性。</li>
</ul>
<p><br><br></p>
<h2 id="三常见的llm">三、常见的LLM</h2>
<p>市面上的大模型有很多， 但大邓用过的且觉得不错的，推荐如下。</p>
<ul>
<li><strong>OpenAI GPT-4</strong>（商业）：以其先进的语言理解和生成能力而闻名，使其对于各种数据标注任务非常有效。</li>
<li><strong>Metal的LLaMa</strong>（开源）：最新的LLama3.1 405B表现超过GPT4商业版。 <strong>可本地离线部署， 数据安全性高</strong>。</li>
<li><strong>阿里的Qwen</strong>（开源）：中文的开源大模型， 表现超过GPT3.5； <strong>可本地离线部署， 数据安全性高</strong>。</li>
</ul>
<p><br><br></p>
<h2 id="四llm数据标注任务类型">四、LLM数据标注任务类型</h2>
<p>LLM 仍在发展，但大量研究表明这些模型对于自动化数据标注非常有用。</p>
<p>研究发现，使用 LLM（特别是 Flan-UL2 和 Mistral-7B）有助于<a href="https://arxiv.org/html/2403.03334v1">生成用于 YouTube 评论立场分类的弱标签</a>。LLM 在确定立场方面实现了高精度。结合数据编程模型中的其他弱信号，这产生了稳健的最终立场标签，大大提高了标记过程的整体质量和效率。另一项<a href="https://arxiv.org/html/2405.06093v1">研究</a>，分别使用人类和LLM对数据进行标注， 使用标注数据微调模型， 发现LLM微调模型性能接近人类微调模型。这种方法在保持高准确度的同时显著减少了对人工注释的依赖，证明了 LLM 能够有效自动化和简化标记工作流程的潜力。</p>
<br>
<p>大型语言模型 (LLM) 在处理自动数据标注方面用途广泛。其先进的语言处理能力使它们能够在 LLM 数据注释中执行一些关键任务：</p>
<ul>
<li><strong>命名实体识别 (NER)</strong>： LLM 可识别和标记文本数据中的人员、组织、地点、日期等的名称。这对于从大型数据集中提取特定实体至关重要。</li>
<li><strong>情感分析</strong> ：LLM 分析文本数据中的情绪，将其归类为积极、消极或中性。这对于理解文本中的观点和态度很有用。</li>
<li><strong>意图检测</strong>： LLM 确定文本背后的意图，将其分为问题、请求或命令等类别。这对于自然语言理解 (NLU) 系统至关重要。</li>
<li><strong>词性 (POS) 标记</strong>： LLM 为句子中的单词分配语法标记，指示其句法角色，例如名词、动词或形容词。这对于解析和句法分析至关重要。</li>
<li><strong>语义角色标注 (SRL)</strong>： LLM 识别实体相对于句子中主要动词所扮演的角色，例如施事者或受事者。这有助于理解句子结构和含义。</li>
<li><strong>主题分类</strong>： LLM 根据内容将文本数据分类到预定义的主题中。这有助于文档分类和内容推荐。</li>
<li><strong>数据提取</strong>： LLM 提取关键数据点，例如事件、参与者、时间和地点。它们还检测和标记时间表达，例如日期和持续时间。此功能对于信息检索、事件跟踪和处理与时间相关的数据至关重要。</li>
</ul>
<p><br><br></p>
<h2 id="五llm数据标注的最佳实践原则">五、LLM数据标注的最佳实践原则</h2>
<p><img loading="lazy" src="https://labelyourdata.com/img/article-illustrations/human-LLM_annotation_process.jpg" alt="Human-LLM 数据标注流程"  />
</p>
<p>Human-LLM 数据标注流程</p>
<p>为了充分利用 LLM 进行数据标注，请遵循以下可提高性能和准确性的最佳实践：</p>
<h3 id="51-提示工程">5.1 提示工程</h3>
<p>选择正确的提示对于提高 LLM 标签至关重要。平衡描述性说明和清晰度。使用：</p>
<ul>
<li><strong>零样本提示</strong>：提供简单的、针对特定任务的说明和示例。</li>
<li><strong>少量提示</strong>：将人类指令与标记示例相结合，以提高注释准确性。</li>
</ul>
<br>
<h3 id="52-模型选择和微调">5.2 模型选择和微调</h3>
<p>为您的任务选择合适的 LLM ， 如果条件允许建议使用微调后的LLM ， 可确保更好的性能并减少偏见。</p>
<ul>
<li><strong>模型选择</strong>：根据任务需求选择合适的LLM。</li>
<li><strong>LLM 微调</strong>：选择正确的LLM 微调方法使用特定领域的数据训练模型以获得更好的结果。</li>
</ul>
<br>
<h3 id="53-工具集成">5.3 工具集成</h3>
<p>将 LLM 与现有的数据注释工具和平台相结合，以简化工作流程。</p>
<ul>
<li><strong>无缝集成</strong>：确保与当前注释工具的兼容性。</li>
<li><strong>工作流自动化</strong>：自动化标注过程的部分内容以提高效率。</li>
<li><strong>数据管理</strong>：使用集成平台更有效地处理数据并保持一致性。</li>
</ul>
<br>
<h3 id="54-人类监督">5.4 人类监督</h3>
<p>融入人类专业知识以增强LLM性能表现：</p>
<ul>
<li><strong>有人介入(在场）</strong>：将 LLM 预注释与人工细化相结合，以获得更高的准确性。</li>
<li><strong>反馈机制</strong>：使用人工和自动反馈循环不断提高模型性能。</li>
</ul>
<br>
<h3 id="55-模型参数优化">5.5 模型参数优化</h3>
<p>调整模型参数有助于优化LLM的输出质量和对特定任务的适应性。</p>
<ul>
<li>**温度设置：**微调温度设置以控制输出的随机性，数值越大越随机。</li>
<li>**其他参数：**调整其他相关参数以适合特定任务。</li>
</ul>
<br>
<h3 id="56-评估llm-标注表现">5.6 评估LLM 标注表现</h3>
<p>定期根据基准评估 LLM 标注表现：</p>
<ul>
<li>**综合评价：**使用人工评审、“图灵测试”等方法检验作品的准确性和原创性。</li>
<li>**特定任务指标：**针对不同的应用程序应用适当的指标，确保注释多样化且可靠。</li>
</ul>
<p>通过遵循这些最佳实践，您可以最大限度地提高 LLM 数据标注的效率和准确性。</p>
<p><br><br></p>
<h2 id="六llm数据标注面临的挑战">六、LLM数据标注面临的挑战</h2>
<p><img loading="lazy" src="https://labelyourdata.com/img/article-illustrations/prompting_LLMs.jpg" alt="推动法学硕士进行情绪分析"  />
</p>
<p>为了有效地使用 LLM 进行数据标注，解决固有的挑战至关重要：</p>
<ul>
<li><strong>准确性</strong>：确保高准确性至关重要，因为 LLM 可以处理基本标记，但需要彻底的 QA 来审查边缘情况 - 上下文或含义模糊或复杂的情况下，这使得准确标记更具挑战性。</li>
<li><strong>偏见与公平</strong>： LLM 可能会继承其训练数据中存在的偏见，这可能会导致标记数据产生不公平的结果。解决这些偏见对于确保标注过程公平公正至关重要。</li>
<li><strong>数据隐私</strong>：维护数据隐私和安全是 LLM 数据标注的重中之重。确保在整个数据标注过程中保护敏感信息对于遵守数据保护法规和与利益相关者建立信任至关重要。</li>
<li><strong>成本和资源管理</strong>：部署 LLM 进行数据标注可能需要大量资源，需要大量计算能力和相关成本。有效管理这些资源对于平衡性能和成本效益至关重要。</li>
<li><strong>文本数据限制</strong>：虽然 LLM 主要用于文本数据，但对于其他数据类型（例如图像或音频），其效率较低。此限制需要集成其他工具或模型来处理各种数据类型。</li>
<li><strong>持续维护</strong>： LLM 需要定期更新和重新训练，以保持高质量的标注。这种持续的维护可确保模型在出现新数据和新需求时保持最新和有效。</li>
<li><strong>过度自信</strong>： LLM 有时会以较高的确定性提供错误的标签，从而破坏标注数据的可靠性。实施不确定性估计和人工监督机制可以帮助缓解这一问题。</li>
</ul>
<p>克服这些挑战将有助于您的 LLM 数据标注系统保持公平、可靠和负责。</p>
<p><br><br></p>
<h2 id="七总结">七、总结</h2>
<p>我们可以期待下一代 LLM 为数据标注任务带来重大改进。增强的适应性将使未来的 LLM 能够处理更广泛的数据类型，包括文本、图像和音频。此外，即将到来的进步将侧重于减少 LLM 中的固有偏见。</p>
<p>LLM 在数据标注方面的潜在新应用将包括跨领域标注和实时数据注释。此外，个性化学习模型将变得更加普遍，使 LLM 能够适应特定的行业需求并为数据标注任务提供量身定制的解决方案。</p>
<p>让我们回顾一下使用 LLM 进行数据标注的要点：</p>
<ul>
<li>LLM 数据标注非常适合预算有限的项目和以一致性为关键的客观任务。但是，它可能不适合主观任务，因为对正确标签的看法可能会有很大差异。</li>
<li>严格评估您的 LLM 数据标注结果。检查是否存在偏见和其他问题。考虑考虑到您的项目的背景和影响，潜在错误是否可以接受。</li>
<li>避免依赖 LLM 来取代人工注释者，因为这可能会导致不准确。对于医疗保健等关键应用，使用 LLM 数据标注来加快速度。始终聘请人工专家来验证和更正标签。</li>
</ul>
<p><br><br></p>
<h2 id="八qa">八、Q&amp;A</h2>
<h3 id="81-llm可以标注数据吗">8.1 LLM可以标注数据吗？</h3>
<p>是的，LLM可以利用其高级语言理解能力对文本进行分类和注释，从而标注数据。但是，通常需要人工监督来审查极端情况，并确保高准确性。</p>
<br>
<h3 id="82-如何选择正确的-llm-数据标注模型">8.2 如何选择正确的 LLM 数据标注模型？</h3>
<p>在选择用于数据标注的 LLM 时，请考虑任务的具体要求，例如数据类型、注释的复杂性以及所需的准确性。根据不同模型在类似任务上的表现、可扩展性以及与现有工作流程集成的难易程度来评估它们。</p>
<br>
<h3 id="83-如何应对-llm-数据标注中的偏见和数据隐私挑战">8.3 如何应对 LLM 数据标注中的偏见和数据隐私挑战？</h3>
<p>解决偏见问题需要定期评估 LLM 输出的公平性并实施偏见缓解策略。为了保护数据隐私，您的数据处理流程必须符合相关法规和最佳实践。使用匿名化技术和安全的数据存储解决方案在整个数据标记过程中保护敏感信息。</p>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>实验 | 使用本地大模型从论文PDF中提取结构化信息</title>
      <link>https://textdata.cn/blog/2024-08-03-literature-document-parsing-using-large-language-models-with-code/</link>
      <pubDate>Sat, 03 Aug 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-08-03-literature-document-parsing-using-large-language-models-with-code/</guid>
      <description>非结构文本、图片、视频等数据是待挖掘的数据矿藏， 在经管、社科等研究领域中谁拥有了从非结构提取结构化信息的能力，谁就拥有科研上的数据优势。正则表达式是一种强大的文档解析工具，但它们常常难以应对现实世界文档的复杂性和多变性。而随着chatGPT这类LLM的出现，为我们提供了更强大、更灵活的方法来处理多种类型的文档结构和内容类型。For many years, regular expressions have been my go-to tool for parsing documents, and I am sure it has been the same for many other technical folks and industries.Even though regular expressions are powerful and successful in some case, they often struggle with the complexity and variability of real-world documents.Large language models on the other end provide a more powerful, and flexible approach to handle many types of document structures and content types.</description>
      <content:encoded><![CDATA[<p>非结构文本、图片、视频等数据是待挖掘的数据矿藏， 在经管、社科等研究领域中谁拥有了<em><strong>从非结构提取结构化信息的能力</strong></em>，谁就拥有科研上的数据优势。正则表达式是一种强大的文档解析工具，但它们常常难以应对现实世界文档的复杂性和多变性。而随着chatGPT这类LLM的出现，为我们提供了更强大、更灵活的方法来处理多种类型的文档结构和内容类型。</p>
<ul>
<li><a href="https://textdata.cn/blog/2024-06-14-using-large-language-model-to-extract-structure-data-from-raw-text/">代码 | 使用本地大模型从文本中提取结构化信息</a></li>
<li><a href="https://textdata.cn/blog/2024-07-10-using-large-language-model-to-build-diy-dictionary/">实验 | 使用本地大模型DIY制作单词书教案PDF</a></li>
</ul>
<p>为方便理解和实验，今天再新增一个案例，即论文处理的场景为例</p>
<p><br><br></p>
<h2 id="一任务">一、任务</h2>
<p>从海量的论文pdf文件中批量提取出</p>
<ul>
<li>论文标题</li>
<li>出版年份</li>
<li>作者</li>
<li>联系作者</li>
<li>抽象的</li>
<li>摘要</li>
</ul>
<br>
<h3 id="11-为何选择llm而不是正则表达式">1.1 为何选择LLM，而不是正则表达式</h3>
<p>在灵活性、上下文理解能力、维护和可扩展性三方面， 我们对比一下LLM和正则表达式</p>
<table>
<thead>
<tr>
<th>方面</th>
<th>LLM</th>
<th>正则表达式</th>
</tr>
</thead>
<tbody>
<tr>
<td>灵活性</td>
<td>能够自动理解和适应各种文档结构，并且无论位于文档的什么位置，都能够识别相关信息。</td>
<td>需要每个文档结构都有特定的模式，当给定的文档偏离预期的格式时就会失败。</td>
</tr>
<tr>
<td>上下文理解</td>
<td>对每个文档的含义有细致的理解，从而可以更准确地提取相关信息。</td>
<td>无需理解上下文或含义即可匹配模式。</td>
</tr>
<tr>
<td>维护和可扩展性</td>
<td>可以轻松适应新的文档类型，只需在初始提示中进行最少的更改，从而使其更具可扩展性。</td>
<td>需要随着文档格式的变化而不断更新。添加对新类型信息的支持需要编写一个全新的正则表达式。</td>
</tr>
</tbody>
</table>
<p>综上， 选择LLM更适合做「从论文PDF中提取信息」这一任务。</p>
<br>
<h3 id="12-工作流程">1.2 工作流程</h3>
<p>为了方便实验，让我们以论文处理的场景为例，下图是使用LLM批量提取论文中元信息的工作流程。</p>
<p><img loading="lazy" src="img/00-document-parsing.png" alt=""  />
</p>
<p>工作流程总体上有三个主要组成部分：输入、处理和输出。</p>
<ul>
<li>首先，提交文件（在本例中为PDF格式的科研论文）进行处理。</li>
<li>处理组件的第一个模块从每个 PDF 中提取原始数据，并将其与包含大型语言模型指令的提示相结合，以有效地提取数据。</li>
<li>然后，大型语言模型使用提示来提取所有元数据。</li>
<li>对于每个PDF，最终结果以JSON格式保存，可用于进一步分析。</li>
</ul>
<br>
<br>
<h2 id="二准备工作">二、准备工作</h2>
<h3 id="21-安装ollama">2.1 安装ollama</h3>
<p>点击前往网站 <a href="https://ollama.com/">https://ollama.com/</a> ，下载ollama软件，支持win、Mac、linux</p>
<p><img loading="lazy" src="img/02-ollama-gui.png" alt=""  />
</p>
<br>
<h3 id="22-下载llm">2.2 下载LLM</h3>
<p>ollama软件目前支持多种大模型， 如阿里的（qwen、qwen2）、meta的(llama3、llama3.1)，  本文选择最近新出的模型 llama3.1</p>
<p><img loading="lazy" src="img/03-ollama-model.png" alt=""  />
</p>
<br>
<p>以llama3.1为例，根据自己电脑显存性能， 选择适宜的版本。如果不知道选什么，那就试着安装，不合适不能用再删除即可。</p>
<p><img loading="lazy" src="img/04-ollama-llama3.png" alt=""  />
</p>
<br>
<p>打开电脑命令行cmd(mac是terminal),  网络是连网状态，执行模型下载(安装)命令</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">ollama pull llama3.1
</code></pre></div><p>等待 <strong>llama3.1:8b</strong> 下载完成。</p>
<br>
<h3 id="23-安装python包">2.3 安装python包</h3>
<p>在python中调用ollama服务，需要ollama包。</p>
<p>打开电脑命令行cmd(mac是terminal),  网络是连网状态，执行安装命令</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">pip3 install ollama
</code></pre></div><br>
<h3 id="24-启动ollama服务">2.4 启动ollama服务</h3>
<p>在Python中调用本地ollama服务，需要先启动本地ollama服务， 打开电脑命令行cmd(mac是terminal), 执行</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">ollama serve
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">2024/08/03 14:52:24 routes.go:1011: INFO server config env=&#34;map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_KEEP_ALIVE: OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS:/Users/deng/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_TMPDIR:]&#34;
time=2024-08-03T14:52:24.742+08:00 level=INFO source=images.go:725 msg=&#34;total blobs: 18&#34;
time=2024-08-03T14:52:24.742+08:00 level=INFO source=images.go:732 msg=&#34;total unused blobs removed: 0&#34;
time=2024-08-03T14:52:24.743+08:00 level=INFO source=routes.go:1057 msg=&#34;Listening on 127.0.0.1:11434 (version 0.1.44)&#34;
time=2024-08-03T14:52:24.744+08:00 level=INFO source=payload.go:30 msg=&#34;extracting embedded files&#34; dir=/var/folders/y0/4gqxky0s2t94x1c1qhlwr6100000gn/T/ollama4239159529/runners
time=2024-08-03T14:52:24.772+08:00 level=INFO source=payload.go:44 msg=&#34;Dynamic LLM libraries [metal]&#34;
time=2024-08-03T14:52:24.796+08:00 level=INFO source=types.go:71 msg=&#34;inference compute&#34; id=0 library=metal compute=&#34;&#34; driver=0.0 name=&#34;&#34; total=&#34;72.0 GiB&#34; available=&#34;72.0 GiB&#34;
</code></pre></div><p>cmd(mac是terminal)看到如上的信息，说明本地ollama服务已开启。</p>
<br>
<br>
<h2 id="三实验">三、实验</h2>
<h3 id="31-代码结构">3.1 代码结构</h3>
<p>点击下载本文 <a href="project.zip">实验代码</a></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">project
   |
  - Extract_Metadata_With_Large_Language_Models.ipynb
  - prompts
       |--- scientific_papers_prompt.txt
  - data
      |--- 1706.03762v7.pdf
      |--- 2301.09056v1.pdf
  - extracted_metadata/
</code></pre></div><br>
<ul>
<li><em><strong>project文件夹</strong></em> 是根文件夹，包含 <em><strong>ipynb代码文件</strong></em>、 <em><strong>prompts文件夹</strong></em>、<em><strong>data文件夹</strong></em>、<em><strong>extracted_metadata文件夹</strong></em></li>
<li><em><strong>prompts文件夹</strong></em> 有txt文件格式的提示信息</li>
<li><em><strong>data文件夹</strong></em> 存储着实验论文pdf数据</li>
<li><em><strong>extracted_metadata文件夹</strong></em> 目前为空，将存储从论文pdf中提取的元信息，以 json 文件格式存储</li>
</ul>
<br>
<h3 id="32-提示工程">3.2 提示工程</h3>
<p>我们需要从论文pdf中提取</p>
<ul>
<li>论文标题</li>
<li>出版年份</li>
<li>作者</li>
<li>联系作者</li>
<li>抽象的</li>
<li>摘要</li>
</ul>
<p>这是我设计的提示， 该提示存储在 <em><strong>prompts/scientific_papers_prompt.txt</strong></em> 中。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">科学研究论文</span><span class="err">：</span>
<span class="o">---</span> 
<span class="p">{</span><span class="n">document</span><span class="p">}</span> 
<span class="o">---</span>

<span class="n">您是分析科学研究论文的专家</span><span class="err">。</span> <span class="n">请仔细阅读上面提供的研究论文</span><span class="err">，</span><span class="n">并提取以下关键信息</span><span class="err">：</span>

<span class="n">从研究论文中提取以下六</span> <span class="p">(</span><span class="mi">6</span><span class="p">)</span> <span class="n">个属性</span><span class="err">：</span>
<span class="o">-</span> <span class="n">论文标题</span><span class="err">：</span><span class="n">研究论文的全名</span>
<span class="o">-</span> <span class="n">出版年份</span><span class="err">：</span><span class="n">论文发表的年份</span>
<span class="o">-</span> <span class="n">作者</span><span class="err">：</span><span class="n">论文所有作者的全名</span>
<span class="o">-</span> <span class="n">作者联系方式</span><span class="err">：</span><span class="n">字典列表</span><span class="err">，</span><span class="n">其中每个字典包含每个作者的以下键</span><span class="err">：</span>
  <span class="o">-</span> <span class="n">姓名</span><span class="err">：</span><span class="n">作者的全名</span>
  <span class="o">-</span> <span class="n">机构</span><span class="err">：</span><span class="n">作者的机构隶属关系</span>
  <span class="o">-</span> <span class="n">电子邮件</span><span class="err">：</span><span class="n">作者的电子邮件地址</span><span class="err">（</span><span class="n">如果提供</span><span class="err">）</span>
<span class="o">-</span> <span class="n">摘要</span><span class="err">：</span><span class="n">论文摘要的全文</span>
<span class="o">-</span> <span class="n">摘要总结</span><span class="err">：</span><span class="n">用</span> <span class="mi">2</span><span class="o">-</span><span class="mi">3</span> <span class="n">句话简洁地总结摘要</span><span class="err">，</span><span class="n">突出重点</span>

<span class="n">指南</span><span class="err">：</span>
<span class="o">-</span> <span class="n">提取的信息应属实</span><span class="err">，</span><span class="n">并准确无误</span><span class="err">。</span>
<span class="o">-</span> <span class="n">除摘要外</span><span class="err">，</span><span class="n">应极其简洁</span><span class="err">，</span><span class="n">摘要应完整复制</span><span class="err">。</span>
<span class="o">-</span> <span class="n">提取的实体应该是独立的</span><span class="err">，</span><span class="n">并且不需要论文的其余部分就能轻松理解</span><span class="err">。</span>
<span class="o">-</span> <span class="n">如果论文中缺少任何属性</span><span class="err">，</span><span class="n">请将该字段留空</span><span class="err">，</span><span class="n">而不是猜测</span><span class="err">。</span>
<span class="o">-</span> <span class="n">对于摘要总结</span><span class="err">，</span><span class="n">重点介绍研究的主要目标</span><span class="err">、</span><span class="n">方法和主要发现</span><span class="err">。</span>
<span class="o">-</span> <span class="n">对于作者联系方式</span><span class="err">，</span><span class="n">请为每个作者创建一个条目</span><span class="err">，</span><span class="n">即使缺少一些信息</span><span class="err">。</span><span class="n">如果没有提供作者的电子邮件或机构</span><span class="err">，</span><span class="n">请在字典中将该字段留空</span><span class="err">。</span>

<span class="n">以</span> <span class="n">JSON</span> <span class="n">格式回答</span><span class="err">。</span> <span class="n">JSON</span> <span class="n">应包含</span> <span class="mi">6</span> <span class="n">个键</span><span class="err">：</span><span class="s2">&#34;PaperTitle&#34;</span><span class="p">,</span> <span class="s2">&#34;PublicationYear&#34;</span><span class="p">,</span> <span class="s2">&#34;Authors&#34;</span><span class="p">,</span> <span class="s2">&#34;AuthorContact&#34;</span><span class="p">,</span> <span class="s2">&#34;Abstract&#34;</span><span class="p">,</span> <span class="s2">&#34;SummaryAbstract&#34;</span><span class="err">。</span> <span class="s2">&#34;AuthorContact&#34;</span><span class="n">字段应该是字典列表格式</span><span class="err">。</span>
</code></pre></div><br>
<h3 id="32-提取信息">3.2 提取信息</h3>
<p>读取 <em><strong>data/1706.03762v7.pdf</strong></em>， 提取该论文首页中感兴趣的6个信息，如</p>
<p><img loading="lazy" src="img/6-paper.png" alt=""  />
</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>

<span class="kn">import</span> <span class="nn">ollama</span>
<span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>  
<span class="c1">#pip3 install cntext==2.1.7</span>

<span class="c1">#我们感兴趣的信息在论文的第一页，所以这里粗糙的选择前4000个字符。</span>
<span class="n">paper_content</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">read_pdf</span><span class="p">(</span><span class="s1">&#39;data/1706.03762v7.pdf&#39;</span><span class="p">)[:</span><span class="mi">4000</span><span class="p">]</span>
<span class="n">prompt_content</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s1">&#39;prompts/scientific_papers_prompt.txt&#39;</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>

<span class="n">response</span> <span class="o">=</span> <span class="n">ollama</span><span class="o">.</span><span class="n">chat</span><span class="p">(</span><span class="n">model</span><span class="o">=</span><span class="s1">&#39;llama3.1:8b&#39;</span><span class="p">,</span> 
                       <span class="n">messages</span> <span class="o">=</span> <span class="p">[</span>
                           <span class="p">{</span><span class="s1">&#39;role&#39;</span><span class="p">:</span> <span class="s1">&#39;system&#39;</span><span class="p">,</span> <span class="s1">&#39;content&#39;</span><span class="p">:</span> <span class="n">prompt_content</span><span class="p">},</span>
                           <span class="p">{</span><span class="s1">&#39;role&#39;</span><span class="p">:</span> <span class="s1">&#39;user&#39;</span><span class="p">,</span> <span class="s1">&#39;content&#39;</span><span class="p">:</span> <span class="n">paper_content</span><span class="p">}</span>
                       <span class="p">])</span>

<span class="n">result</span> <span class="o">=</span> <span class="n">response</span><span class="p">[</span><span class="s1">&#39;message&#39;</span><span class="p">][</span><span class="s1">&#39;content&#39;</span><span class="p">]</span>
<span class="n">result</span> <span class="o">=</span> <span class="nb">eval</span><span class="p">(</span><span class="n">result</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">&#39;```</span><span class="se">\n</span><span class="s1">&#39;</span><span class="p">)[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">&#39;</span><span class="se">\n</span><span class="s1">```&#39;</span><span class="p">)[</span><span class="mi">0</span><span class="p">])</span>
<span class="n">result</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">CPU times: user 3.5 ms, sys: 2.13 ms, total: 5.63 ms
Wall time: 11.8 s


{&#39;PaperTitle&#39;: &#39;Attention Is All You Need&#39;,
 &#39;PublicationYear&#39;: 2017,
 &#39;Authors&#39;: [&#39;Ashish Vaswani&#39;,
  &#39;Noam Shazeer&#39;,
  &#39;Niki Parmar&#39;,
  &#39;Jakob Uszkoreit&#39;,
  &#39;Llion Jones&#39;,
  &#39;Aidan N. Gomez&#39;,
  &#39;Łukasz Kaiser&#39;,
  &#39;Illia Polosukhin&#39;],
 &#39;AuthorContact&#39;: [{&#39;Name&#39;: &#39;Ashish Vaswani&#39;,
   &#39;Institution&#39;: &#39;Google Brain&#39;,
   &#39;Email&#39;: &#39;avaswani@google.com&#39;},
  {&#39;Name&#39;: &#39;Noam Shazeer&#39;,
   &#39;Institution&#39;: &#39;Google Brain&#39;,
   &#39;Email&#39;: &#39;noam@google.com&#39;},
  {&#39;Name&#39;: &#39;Niki Parmar&#39;,
   &#39;Institution&#39;: &#39;Google Research&#39;,
   &#39;Email&#39;: &#39;nikip@google.com&#39;},
  {&#39;Name&#39;: &#39;Jakob Uszkoreit&#39;,
   &#39;Institution&#39;: &#39;Google Research&#39;,
   &#39;Email&#39;: &#39;usz@google.com&#39;},
  {&#39;Name&#39;: &#39;Llion Jones&#39;,
   &#39;Institution&#39;: &#39;Google Research&#39;,
   &#39;Email&#39;: &#39;llion@google.com&#39;},
  {&#39;Name&#39;: &#39;Aidan N. Gomez&#39;,
   &#39;Institution&#39;: &#39;University of Toronto&#39;,
   &#39;Email&#39;: &#39;aidan@cs.toronto.edu&#39;},
  {&#39;Name&#39;: &#39;Łukasz Kaiser&#39;,
   &#39;Institution&#39;: &#39;Google Brain&#39;,
   &#39;Email&#39;: &#39;lukaszkaiser@google.com&#39;},
  {&#39;Name&#39;: &#39;Illia Polosukhin&#39;,
   &#39;Institution&#39;: &#39;&#39;,
   &#39;Email&#39;: &#39;illia.polosukhin@gmail.com&#39;}],
 &#39;Abstract&#39;: &#39;The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.&#39;,
 &#39;SummaryAbstract&#39;: &#39;本文提出了一种新的Transformer模型，基于注意力机制，抛弃了递归和卷积等复杂方法。该模型在机器翻译任务上表现出优异的效果，并且可以更好地并行化和训练。&#39;}
</code></pre></div><p>从运行结果看， 摘要<em><strong>Abstract</strong></em> 的提取不够准确，有一定的遗漏。</p>
<br>
<h3 id="33-封装成函数extract_info">3.3 封装成函数extract_info</h3>
<p>实验成功，我们将其封装为函数<em><strong>extract_info</strong></em> ，因为LLM返回的内容的格式存在不确定性， 所以为了保证函数尽可能的成功的运行出结果，这里我设置了异常处理机制。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">ollama</span>
<span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>  


<span class="k">def</span> <span class="nf">extract_info</span><span class="p">(</span><span class="n">paper_content</span><span class="p">,</span> <span class="n">prompt_content</span><span class="p">,</span> <span class="n">max_retries</span><span class="o">=</span><span class="mi">3</span><span class="p">):</span>
    <span class="k">for</span> <span class="n">attempt</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">max_retries</span> <span class="o">+</span> <span class="mi">1</span><span class="p">):</span>
        <span class="k">try</span><span class="p">:</span>
            <span class="n">response</span> <span class="o">=</span> <span class="n">ollama</span><span class="o">.</span><span class="n">chat</span><span class="p">(</span>
                <span class="n">model</span><span class="o">=</span><span class="s1">&#39;llama3.1:8b&#39;</span><span class="p">,</span>
                <span class="n">messages</span><span class="o">=</span><span class="p">[</span>
                    <span class="p">{</span><span class="s1">&#39;role&#39;</span><span class="p">:</span> <span class="s1">&#39;system&#39;</span><span class="p">,</span> <span class="s1">&#39;content&#39;</span><span class="p">:</span> <span class="n">prompt_content</span><span class="p">},</span>
                    <span class="p">{</span><span class="s1">&#39;role&#39;</span><span class="p">:</span> <span class="s1">&#39;user&#39;</span><span class="p">,</span> <span class="s1">&#39;content&#39;</span><span class="p">:</span> <span class="n">paper_content</span><span class="p">}</span>
                <span class="p">]</span>
            <span class="p">)</span>

            <span class="n">result</span> <span class="o">=</span> <span class="n">response</span><span class="p">[</span><span class="s1">&#39;message&#39;</span><span class="p">][</span><span class="s1">&#39;content&#39;</span><span class="p">]</span>
            <span class="n">result</span> <span class="o">=</span> <span class="nb">eval</span><span class="p">(</span><span class="n">result</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">&#39;```</span><span class="se">\n</span><span class="s1">&#39;</span><span class="p">)[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">&#39;</span><span class="se">\n</span><span class="s1">```&#39;</span><span class="p">)[</span><span class="mi">0</span><span class="p">])</span>
            <span class="k">return</span> <span class="n">result</span>
        
        <span class="k">except</span> <span class="ne">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
            <span class="k">if</span> <span class="n">attempt</span> <span class="o">&lt;</span> <span class="n">max_retries</span><span class="p">:</span>
                <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;An error occurred: </span><span class="si">{</span><span class="n">e</span><span class="si">}</span><span class="s2">. Retrying (</span><span class="si">{</span><span class="n">attempt</span> <span class="o">+</span> <span class="mi">1</span><span class="si">}</span><span class="s2">/</span><span class="si">{</span><span class="n">max_retries</span> <span class="o">+</span> <span class="mi">1</span><span class="si">}</span><span class="s2">)...&#34;</span><span class="p">)</span>
            <span class="k">else</span><span class="p">:</span>
                <span class="k">raise</span> <span class="n">e</span>


<span class="c1">#我们感兴趣的信息在论文的第一页，所以这里粗糙的选择前4000个字符。</span>
<span class="n">paper_content</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">read_pdf</span><span class="p">(</span><span class="s1">&#39;data/1706.03762v7.pdf&#39;</span><span class="p">)[:</span><span class="mi">4000</span><span class="p">]</span>
<span class="n">prompt_content</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s1">&#39;prompts/scientific_papers_prompt.txt&#39;</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>

<span class="n">result</span> <span class="o">=</span> <span class="n">extract_info</span><span class="p">(</span><span class="n">paper_content</span><span class="p">,</span> <span class="n">prompt_content</span><span class="p">)</span>
<span class="n">result</span>
</code></pre></div><p>运行结果与之前无异，为节约板面空间，这里就不展示result了。</p>
<br>
<h3 id="34-批量提取">3.4 批量提取</h3>
<p>假设data文件夹内有成百上千的发票(实际上只有一张发票)， 对data文件夹进行批量信息提取，结果存储为csv。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>

<span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">import</span> <span class="nn">jsonlines</span>

<span class="c1">#当前代码所在的代码文件与data文件夹处于同一个文件夹内</span>
<span class="c1">#获取data内所有pdf的路径</span>
<span class="n">pdf_files</span> <span class="o">=</span> <span class="p">[</span><span class="sa">f</span><span class="s1">&#39;data/</span><span class="si">{</span><span class="n">file</span><span class="si">}</span><span class="s1">&#39;</span> <span class="k">for</span> <span class="n">file</span> <span class="ow">in</span> <span class="n">os</span><span class="o">.</span><span class="n">listdir</span><span class="p">(</span><span class="s1">&#39;data&#39;</span><span class="p">)</span> <span class="k">if</span> <span class="s1">&#39;.pdf&#39;</span> <span class="ow">in</span> <span class="n">file</span><span class="p">]</span>
<span class="n">prompt_content</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s1">&#39;prompts/scientific_papers_prompt.txt&#39;</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>

<span class="k">for</span> <span class="n">pdf_file</span> <span class="ow">in</span> <span class="n">pdf_files</span><span class="p">:</span>
    <span class="n">paper_content</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">read_pdf</span><span class="p">(</span><span class="n">pdf_file</span><span class="p">)[:</span><span class="mi">4000</span><span class="p">]</span>
    <span class="n">dict_data</span> <span class="o">=</span> <span class="n">extract_info</span><span class="p">(</span><span class="n">paper_content</span><span class="p">,</span> <span class="n">prompt_content</span><span class="p">)</span>
    <span class="n">jsonf</span> <span class="o">=</span> <span class="n">pdf_file</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s1">&#39;data&#39;</span><span class="p">,</span> <span class="s1">&#39;extracted_metadata&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s1">&#39;pdf&#39;</span><span class="p">,</span> <span class="s1">&#39;jsonl&#39;</span><span class="p">)</span>
    <span class="k">with</span> <span class="n">jsonlines</span><span class="o">.</span><span class="n">open</span><span class="p">(</span><span class="n">jsonf</span><span class="p">,</span> <span class="s1">&#39;w&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">jf</span><span class="p">:</span>
        <span class="n">jf</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">dict_data</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">CPU times: user 919 ms, sys: 14.8 ms, total: 933 ms
Wall time: 24.6 s
</code></pre></div><p><img loading="lazy" src="img/05-2result-json.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="四讨论">四、讨论</h2>
<p>本文简要概述了 LLM 在从复杂文档中提取元数据方面的应用，提取的 json 数据可以存储在非关系数据库中以供进一步分析。</p>
<p>LLM 和 Regex 在内容提取方面各有优缺点，应根据用例明智地应用每种方法。希望本简短教程能帮助您获得新技能。</p>
<br>
<br>
<h2 id="相关内容">相关内容</h2>
<ul>
<li><a href="https://textdata.cn/blog/2025-02-14-using-online-large-model-api-to-transform-text-data-into-structured-data/"><strong>教程 | 使用大模型将文本数据转化为结构化数据</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-06-14-how-to-download-large-language-model-with-ollama/"><strong>教程 | 如何使用 Ollama 下载 &amp; 使用本地大语言模型</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-08-06-using-the-ollama-local-large-model-to-predict-the-sentiment-category-of-online-comments/"><strong>实验 | 使用本地大模型预测在线评论情感类别和分值</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-08-07-structured-outputs-with-ollama/"><strong>实验 | 如何使 Ollama 结构化输出 JSON 样式的结果</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/"><strong>推荐 | 文本分析库 cntext 使用手册</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-06-14-using-large-language-model-to-extract-structure-data-from-raw-text/"><strong>实验 | 使用本地大模型从文本中提取结构化信息</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-07-10-using-large-language-model-to-build-diy-dictionary/">实验 | 使用Ollama本地大模型DIY制作单词书教案PDF</a></li>
<li><a href="https://textdata.cn/blog/2024-08-05-create-a-blog-writer-multi-agent-system-using-crewai-and-ollama/">实验 | 使用 Crewai 和 Ollama 构建智能体(AI Agent)帮我撰写博客文章</a></li>
</ul>
<br>
<br>
<h2 id="精选内容">精选内容</h2>
<ul>
<li><a href="https://textdata.cn/blog/datasets_available_for_management_science/">LIST | 可供社科(经管)领域使用的数据集汇总</a></li>
<li><a href="https://textdata.cn/blog/the_text_analysis_list_about_ms/">LIST | 社科(经管)数据挖掘文献资料汇总</a></li>
<li><a href="https://textdata.cn/blog/2024-06-16-scrapegraph-ai/">网络爬虫 | 使用scrapegraph-ai(大模型方案)自动采集网页数据</a></li>
<li><a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/">推荐 | 文本分析库cntext2.x使用手册</a></li>
<li><a href="https://textdata.cn/blog/management_python_course/">付费视频课 | Python实证指标构建与文本分析</a>
<br>
<br></li>
</ul>
]]></content:encoded>
    </item>
    
    <item>
      <title>arXiv2024 | 使用大语言模型自动进行定性研究中的扎根理论开发</title>
      <link>https://textdata.cn/blog/2024-08-02-automating-grounded-theory-development-in-qualitative-research-with-large-language-models/</link>
      <pubDate>Fri, 02 Aug 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-08-02-automating-grounded-theory-development-in-qualitative-research-with-large-language-models/</guid>
      <description>在当今的学术界，定性研究因其深入挖掘现象背后的原因和逻辑而备受重视。然而，定性数据的分析往往耗时且成本高昂。现在，随着chatGPT这类大语言模型的问世，这一局面可能即将改变。AcademiaOS是一个创新的开源平台，它利用大型语言模型（LLMs）的能力，自动化地进行地面理论的发展，为定性研究带来了新的视角。AcademiaOS is a first attempt to automate grounded theory development in qualitative research with large language models. Using recent large language models’ language understanding, generation, and reasoning capabilities, AcademiaOS codes curated qualitative raw data such as interview transcripts and develops themes and dimensions to further develop a grounded theoretical model, affording novel insights. A user study (n=19) suggests that the system finds acceptance in the academic community and exhibits the potential to augment humans in qualitative research. AcademiaOS has been made open-source for others to build upon and adapt to their use cases.</description>
      <content:encoded><![CDATA[<p>扎根理论（Grounded Theory, GT）是由社会学家 Barney Glaser 和 Anselm Strauss 在 1967 年提出的定性研究方法。它强调从数据中产生概念，并通过不断比较数据中的实例来发展这些概念，最终形成一个理论框架。研究过程包括开放式编码、轴心编码和选择性编码等阶段，这些阶段帮助研究者逐步提炼数据并构建理论。</p>
<p>以访谈类数据为例， 一个研究一般有几十份访谈， 转录和编码一次典型的访谈需要几个小时，而这仅仅是一个开始，研究人员试图理解原始数据并将其转化为有用的东西，以获得洞察力和知识，并发展出可以描述模式和现象的理论。受限于研究者的经济、金钱的约束， 只能在有限的数据量基础上，利用研究者的智慧进行挖掘和洞察。从认识论角度， 扎根理论是一种归纳法， 可供归纳的一手原始数据越多， 后期定性研究中的理论开发就会越扎实， 也更容易出现新的、有趣的、有重量的发现。<strong>随着chatGPT这类大语言模型LLM的出现， 扎根理论的约束条件有望被打破， 我们可以借助大语言模型，对更大体量的一手数据，进行更高效的进行定性研究</strong>。</p>
<p>大邓之前进行过LLM的实验， 确信稍微更改下Prompt即可大幅度提高编码阶段的效率。LLM与扎根的结合，是顺理成章的。</p>
<ul>
<li><a href="https://textdata.cn/blog/2024-06-14-using-large-language-model-to-extract-structure-data-from-raw-text/">实验 | 使用本地大模型从文本中提取结构化信息</a></li>
<li><a href="https://textdata.cn/blog/2024-07-10-using-large-language-model-to-build-diy-dictionary/">实验 | 使用本地大模型DIY制作单词书教案PDF</a></li>
</ul>
<br>
<p>以下内容摘自这篇arXiv2024， 并进行了翻译。</p>
<blockquote>
<p>Übellacker, Thomas. &ldquo;AcademiaOS: Automating Grounded Theory Development in Qualitative Research with Large Language Models.&rdquo; <em>arXiv preprint arXiv:2403.08844</em> (2024).</p>
</blockquote>
<p><strong>摘要</strong>:  <a href="https://academia-os.org/">AcademiaOS</a> 是首次尝试使用大型语言模型自动开发定性研究中的扎根理论。利用最新大型语言模型的语言理解、生成和推理能力，AcademiaOS 对精选的定性原始数据（如访谈记录）进行编码，并开发主题和维度以进一步开发扎根理论模型，从而提供新颖的见解。一项用户研究（n=19）表明，该系统在学术界得到了认可，并展现出在定性研究中增强人类能力的潜力。AcademiaOS 已开源，供其他人在此基础上构建并适应他们的用例。</p>
<p><img loading="lazy" src="img/03-cover-screen.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="一扎根理论">一、扎根理论</h2>
<p>研究人员通常遵循既定的编码实践来管理大量非结构化文本源。编码通常涉及系统地生成代码本（Weston 等人，（<a href="https://arxiv.org/html/2403.08844v1#bib.bib39">2001</a>) ) 来编码转录。另一种流行的方法是 「Gioia 方法」（Gioia 等人，（<a href="https://arxiv.org/html/2403.08844v1#bib.bib19">2013</a>) )，研究人员直接从源文档中提取新兴模式和概念，然后按照以下步骤进行汇总和解释。然后，这些开发的代码可以进一步用于定性数据分析和理论开发。 从数据中开发理论模型的概念称为<strong>扎根理论开发</strong>（Chun Tie 等人，（<a href="https://arxiv.org/html/2403.08844v1#bib.bib11">2019</a>) )。</p>
<br>
<h3 id="11-gioa的扎根理论开发">1.1 Gioa的扎根理论开发</h3>
<p>Gioia 等人，（<a href="https://arxiv.org/html/2403.08844v1#bib.bib19">2013</a>)定义了一种透明的流程，用于分析定性数据以从访谈中开发理论模型。他们的流程旨在让研究人员从原始定性数据转向越来越抽象的概念类别，从初始编码开始，研究人员对数据中的相关概念进行编码和下划线，从而得到一个广泛的一阶概念列表，这些概念仍然以源文档的语言陈述。然后，他们使用这些一阶概念来生成一个更抽象的二阶主题列表，这些主题试图用更学术的语言来概括一阶代码的概念。最后，他们将二阶主题聚合成更抽象的“聚合维度”。然后，这些维度被用作开发理论的基础。Gioia等人，（<a href="https://arxiv.org/html/2403.08844v1#bib.bib19">2013</a>)提到了理解这些概念之间的动态关系的重要性，但尚未提供获得这些关系的具体方法。他们认为，通过遵循这种“Gioia 方法”，研究人员已经足够熟悉基础文献，可以理解这些关系。</p>
<br>
<h3 id="12-eisenhardt的扎根理论开发">1.2 Eisenhardt的扎根理论开发</h3>
<p>扎根理论发展的另一种方法是艾森哈特（<a href="https://arxiv.org/html/2403.08844v1#bib.bib13">1989</a>)方法，侧重于从案例研究中构建模型。这种方法从案例内分析开始，以熟悉数据并生成初步理论。从那里开始一个高度迭代的过程，Eisenhardt，（<a href="https://arxiv.org/html/2403.08844v1#bib.bib13">1989</a>)称之为“塑造假设”，反复比较数据和开发的结构，并验证开发的结构之间出现的关系是否与数据中的证据相符。他们将案例研究视为实验的复制，要么加强假设，要么削弱假设。</p>
<br>
<h3 id="13-自动化">1.3 自动化</h3>
<p>基于现有文献，很明显，Gioia 等人（<a href="https://arxiv.org/html/2403.08844v1#bib.bib19">2013</a>)和Eisenhardt (<a href="https://arxiv.org/html/2403.08844v1#bib.bib13">1989</a>)为扎根理论的发展提供了一个框架。定性研究任务（包括数据收集和分析）既耗时又昂贵，并且限制了单个研究团队可研究的经验数据。 Kindsiko和 Poltimäe，（<a href="https://arxiv.org/html/2403.08844v1#bib.bib22">2019</a>）支持这一观点，指出实证研究中的样本量取决于资金和研究团队的规模。然而，Bowen（<a href="https://arxiv.org/html/2403.08844v1#bib.bib6">2008</a>)概述了样本量如何影响研究有效性，并建议通过饱和度来限制样本量，即当更多访谈、案例研究或其他样本无法增加重要的新信息时，停止添加这些样本。<strong>现在，我们如何通过增加样本量来增加研究严谨性，同时保持较低的人工投入？答案可能在计算自动化中找到</strong>。</p>
<p>在定量研究中，数据准备和理论开发的自动化是一个被积极研究的课题，其名称包括 “数据挖掘”或“机器学习”，计算机程序从观察中学习以开发数学模型，从而使它们能够以实证主义范式估计未来的情况。然而，定性研究问题伴随着结构化程度较低或可编码性较差的信息，并且依赖于研究人员的知识和解释。同时，随着大型语言模型 (LLM) 的兴起，我们可以使用技术平台，将对文本数据的计算理解和推理范式转变为接近人类的水平，并结合广泛的一般知识。这个新技术平台提供了一个大规模模拟明确定义的研究过程的机会。对于单个研究人员来说，编写 100 份访谈记录之类的任务非常耗时。假设通过适当的设置，LLM 可以在几分钟内并行处理所有记录。组织理论领域的研究人员可能会考虑使用两三个案例研究来开发理论模型。当在案例研究中寻找实证证据是一个自动化、可并行的过程时，使用 20 - 30 个不同案例研究的障碍就会大大减少，从而为更多具有统计相关性的定性研究提供机会。</p>
<p>因此，利用 LLM 实现定性研究过程部分自动化的潜力值得探索。本文探讨了以下研究问题：“如何有效地设计和实施基础开源平台，以利用大型语言模型来自动化扎根理论开发？”为此， <a href="https://github.com/thomasuebi/academia-os">AcademiaOS</a>被提出并实施为一个开源平台，用于自动化或增强扎根理论开发任务，例如编码、维度聚合和理论开发。AcademiaOS 为科学界提供了一种进行定性研究的新方法，该方法透明、可访问且可扩展（通过其开源特性），并且通过同时并行分析多个定性来源的成本效益来提供更广泛的证据。该系统可能会对社会科学产生深远影响，特别是在组织理论领域，但也会对定性数据相关的其他学科产生深远影响。</p>
<p><img loading="lazy" src="img/01-academiaos-user-interface.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="二早期研究">二、早期研究</h2>
<p>已经有人尝试过自动化定性分析。Berente等人，（<a href="https://arxiv.org/html/2403.08844v1#bib.bib3">2019</a>)致力于开发一种计算密集型扎根理论发展的理论过程，提出了一种理论计算方法，以自动化扎根理论发展的以下四个步骤：（1）采样和数据收集，（2）同步分析，（3）词汇框架，（4）使用基于人工智能的工具进行历时分析。他们将计算过程描述为围绕预定义但动态的词汇展开，而不是同步“编码”新兴概念。他们建议使用分类法来挖掘概念。</p>
<p>马拉特和富山，（<a href="https://arxiv.org/html/2403.08844v1#bib.bib28">2018</a>)讨论了基于预定义的人工注释代码本自动对访谈进行编码的可能性。Lennon等人也实施了类似的方法（<a href="https://arxiv.org/html/2403.08844v1#bib.bib25">2021</a>)，根据他们自己的分析，其准确度达到了人类水平。Rietz和 Maedche，(<a href="https://arxiv.org/html/2403.08844v1#bib.bib34">2021</a>)提出了一种半自动化监督机器学习解决方案，该解决方案从人类注释者那里学习编码规则并将其应用于更广泛的数据集。此外，上述研究所采用的机器学习算法并未考虑到LLM的出现。 商业平台ATLAS.ti（<a href="https://arxiv.org/html/2403.08844v1#bib.bib1">2023</a>)于 2023 年初宣布了其自动编码功能的测试版本，将定性文献分成段落，并使用 OpenAI 的 LLM 逐一进行编码。其他商业平台（如 elicit.org）也纷纷出现，主要使用 LLM 来自动化文献审查流程。不过，研究人员还需要更多地了解研究人员如何在这些平台上使用这些新的 AI 功能。此外，这些应用程序仅自动化了定性研究过程的一小部分，尚未深入到自动化扎根理论开发领域。这引出了一个问题：<strong>扎根理论开发是否可以通过 LLM 实现自动化</strong>。</p>
<p><br><br></p>
<h2 id="三大语言模型">三、大语言模型</h2>
<p>大型语言模型 (LLM) 是一种基于转换器模型的新技术平台，使用自我监督在大型数据集上进行预训练(做完形填空题)，这一过程可以理解为机器将语料中任意位置的单词盖住，让机器预测盖住的单词。通过这样的训练， 在数十亿个参数中编码一般和可转移的知识（Roberts 等人，（<a href="https://arxiv.org/html/2403.08844v1#bib.bib35">2020</a>) ）。这些预先训练的基础模型通常会进行微调以遵循指令（Ouyang 等人，（<a href="https://arxiv.org/html/2403.08844v1#bib.bib30">2022</a>) )，返回结构化输出，或具有对话性（如 ChatGPT 所示）。虽然 BERT 等较旧的模型通常被视为 LLM，但在本文中，该术语专门用于性能与 GPT-3 基础模型相似或更好的模型。随着 2022 年底 ChatGPT 的发布，LLM 已得到普及和大规模采用。它们已被应用于整个行业的流程自动化（Wulf 和 Meierhofer，（<a href="https://arxiv.org/html/2403.08844v1#bib.bib40">2023</a>) , 第4页)。</p>
<p>与 LLM 的推理交互通常包括自然语言 <strong>提示Promp</strong>（输入）和 <strong>完成Completion</strong>（响应）。在本文使用的 OpenAI 对话模型（GPT-3.5 及更新版本）中，推理提示可能包含多个 <strong>Message</strong>消息”：设置框架的通用系统消息以及用户和助手消息的历史记录（请参阅附录 1-11 中的示例）。</p>
<p>无需进行微调，LLM 就能够从推理提示中的信息中学习和概括（Brown 等人，（<a href="https://arxiv.org/html/2403.08844v1#bib.bib7">2020</a>) )。一次性或少量学习是指在提示中传递样本，而零次学习是指不提供样本，但让模型完成明确的指令。这种推理与常见的特定于任务的微调形成对比，通常称为“<strong>情境学习</strong>(in-context learning)”（Dong et al.，（<a href="https://arxiv.org/html/2403.08844v1#bib.bib12">2022</a>) )。</p>
<p>尽管经过了预先训练，LLM 在其参数中存储了大量隐性知识，但这些知识的深度和时效性仍然有限，需要昂贵的训练才能更新。因此，使用信息检索系统, 通常称为“检索增强生成”（Retrieval-Augmented Generation, RAG）的架构来增强 LLM 推理已被证明可以减少幻觉并提高事实性和可解释性（Lewis 等人，（<a href="https://arxiv.org/html/2403.08844v1#bib.bib27">2020</a>) )。作为通用模型，LLM 可以测量两个文本字符串之间的语义相似度。它们的相似度可以通过在其 LLM 内部向量表示上使用余弦相似度来高效地计算。RAG 使用这种直接的信息检索方式来连接检索到的相关文本以进行上下文学习（Lewis 等人，（<a href="https://arxiv.org/html/2403.08844v1#bib.bib27">2020</a>) )。通过从原始输入文档中检索信息来增强 LLM 推理可能有助于实现理论开发的自动化。</p>
<p><br><br></p>
<h2 id="四方法">四、方法</h2>
<p>为了探索目前使用 LLM 实现扎根理论开发自动化的可能性，本文提出、开发并测试了一款通过人工监督协调 LLM 推理的软件。 <strong>AcademiaOS 是一个供定性研究人员自动化扎根理论开发的平台</strong>。 该平台引导用户完成预定义的流程，虽然大多数数据分析和理论开发部分都是自动化的，但用户拥有监督和控制权。为了确保用户隐私和高可维护性，让未来的潜在开发人员和开源贡献者不必担心前端后端交互，大多数计算都在浏览器中本地执行，直接使用外部 API（例如 OpenAI 开发人员平台）进行 LLM 推理。</p>
<p><img loading="lazy" src="img/02-high-level-process-for-grounded-theory-develop-with-academiaos.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="五使用者评价">五、使用者评价</h2>
<p>已开展一项探索性定性调查，以评估用户与 AcademiaOS 的互动情况并指导未来的开发。通过便利抽样选择了具有定性研究背景的研究人员、专业人士和学生进行此次评估。参与者 (n=19) 是根据他们与平台的目标用户群的相关性而精心挑选的，因此被认为可以为评估 AcademiaOS 提供最具信息性和相关性的数据，并向他们提供了 Qualtrics 平台上的一项调查的链接。</p>
<p>参与者被要求反思他们当前的定性研究方法。他们报告使用了诸如访谈、观察、调查和小组等主要来源以及诸如案例研究、报告、荟萃分析、历史数据、报告和专家意见等次要数据来源（附录 13）。</p>
<p>调查的第二部分旨在了解参与者如何看待平台的初始交互和功能探索。参与者普遍认为该平台“有点容易”学习，但存在一些差异（见附录 18）</p>
<p>研究参与者普遍对编码过程表达出“有些”满意、“非常”满意，只有一个“非常不满意”的异常值（见附录 27）</p>
<p>当被问及 AcademiaOS 是否会影响他们的定性研究过程时，大多数参与者回答“可能”到“肯定”（附录 35），并提到了加快他们的研究过程（“快得多”、“它将加快研究速度”、“它将使编码和理论生成更快”）、充当灵感工具（“多次草稿迭代以启发/简化手动过程”、“我会用它来快速制作理论原型 [&hellip;]”、“[&hellip;] 比较并可能找到我以前错过的东西。”）和作为一般的研究支持（“它将敦促许多科学家提高他们的吞吐量 [&hellip;] 潜力以减轻人类的信息检索和保留”、“[&hellip;] 它将帮助我更容易地链接概念”、“让我更容易地进行研究，特别是在我无法集中注意力的时间”）（见附录 36）</p>
<table>
<thead>
<tr>
<th><strong>方面</strong></th>
<th><strong>主要发现</strong></th>
<th><strong>隐含/担忧</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td>当前研究方法</td>
<td>使用多种一手资料和次要资料来源；采用各种数据收集和分析方法，包括 NLP 技术。</td>
<td>不断发展融合定性和定量元素的研究方法；需要先进的分析工具。</td>
</tr>
<tr>
<td>研究中的人工智能工具</td>
<td>使用 ChatGPT、PyTorch 等多种 AI 工具来完成头脑风暴和编码等任务；担心可靠性。</td>
<td>人工智能在研究中的重要性，以及对人工智能工具的准确性和可靠性的需求。</td>
</tr>
<tr>
<td>初次互动/探索</td>
<td>易用性参差不齐；改进 UI 和指导的建议；编码和理论开发功能的挑战。</td>
<td>需要更直观的用户界面和全面的用户指导。</td>
</tr>
<tr>
<td>可用性和满意度</td>
<td>对编码过程总体满意；多语言文档和内容变化带来的挑战。</td>
<td>改进文档检索和多样化内容编码的重要性。</td>
</tr>
<tr>
<td>理论发展</td>
<td>对理论发展感到满意，但担心研究问题的复杂性和相关性。</td>
<td>需要更简单、更有针对性的理论发展模型。</td>
</tr>
<tr>
<td>对研究的影响</td>
<td>对研究效率产生积极影响；对伦理影响、质量、偏见以及人工智能取代人类的担忧。</td>
<td>在人工智能实用性和道德考虑之间取得平衡；解决质量和偏见问题。</td>
</tr>
<tr>
<td>未来使用和建议</td>
<td>对 AcademiaOS 的未来感到兴奋；愿意继续使用和推荐该平台。</td>
<td>该平台有被更广泛采用和持续发展的潜力。</td>
</tr>
</tbody>
</table>
<p><br><br></p>
<h2 id="六局限性">六、局限性</h2>
<p>虽然AcademiaOS引入了一种自动化的扎根理论开发的新方法，但这项工作还存在几个局限性。</p>
<p><strong>首先，由于依赖大型语言模型（LLM），该系统继承了一些LLM的常见的局限性</strong>。Chen等人（2023）发现LLM在事实性问题的回答准确性方面比常见的信息检索系统表现较差，尤其是在少量示例的上下文中。不过，理论开发用例本身并不是一个要求极高准确度或信息量的知识生成任务，只要这些指标与连贯性、相关性、有用性和有效性一同具备即可，而这些都是Chen等人（2023）指出LLM表现良好的方面。由于LLM的输出开放性，有时会出现超出预期范围的完成情况，例如不正确的MermaidJS可视化脚本语法或错误的JSON字段。这只能通过编写更严格的提示来部分缓解（比如指定输出格式或给出具体示例）。Kocoń等人（2023）发现最先进的AI解决方案通常在常见的自然语言处理任务上胜过当前的LLM，这意味着在AcademiaOS使用LLM的某些功能（例如编码过程）上，专门化的模型也能表现得更好。但不同技术之间的基准比较不在本研究范围内。像GPT-4这样的模型存在的固有偏见（Bubeck等人，2023，第86-89页）可能对敏感话题构成挑战，例如处理受保护的属性。然而，鉴于扎根理论发展的理念是将任何假设都建立在精心挑选的数据源上，无论是人类还是机器推理，都几乎没有空间进行带有偏见的解读。</p>
<p><strong>其次，缺乏研究者身临其境的场景信息， 而这些信息能使他们能够更深入地沉浸在研究环境中</strong>。LLM只能部分地通过其广泛的通用知识来弥补这一点，这可能导致所开发理论中的误解或过度泛化。因此，定性研究可能会发展成为人类与机器推理的共同努力。Jiang等人（2021）研究了定性研究中的人机交互。他们指出了另一个可能的局限性：研究者可能不愿让AI消除他们研究中的“不确定性”。他们认为，研究者重视处理定性数据时的不效率，例如，访谈编码中的错误会带来更高的偶发性和新的视角。自动化可能会阻碍这一过程。然而，调查参与者报告称期望AcademiaOS能帮助他们获得更多的意外成果（参见附录43）。Bouschery等人（2023）在与学术研究者采用相似方法的产品创新团队中探索了相同方面，并发现这些团队在与AI合作时可以从更大范围的问题和解决方案中受益。</p>
<p><strong>第三，存在数据隐私问题</strong>。AcademiaOS目前利用OpenAI开发者平台进行LLM推理。因此，出于伦理和法律原因，不应与外部实体共享的敏感数据无法用提议的系统处理。但是，为了确保数据隐私，系统可以被修改为在自托管的LLM上运行（例如Llama2或Mistral 7B实例），从而确保对数据的完全控制。</p>
<p><strong>最后， AcademiaOS所起的作用更多的是增强(Augment)，而非自动(Automate)</strong>。AcademiaOS是一种辅助研究人员理解数据并建模有趣模式的工具，但掌握这个工具的始终是研究者的思想， 思想是不能被AI自动化的。</p>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>数据集 | 聚焦美股企业社会责任CSR Wire网站新闻数据集(1999-2024)</title>
      <link>https://textdata.cn/blog/2024-07-19-csrwise-dataset/</link>
      <pubDate>Fri, 19 Jul 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-07-19-csrwise-dataset/</guid>
      <description>CSRWire（CSRwire）是一个成立于1999年的数字媒体平台，专注于提供有关企业社会责任（CSR）和可持续性的最新新闻、观点和报告。CSRWire是3BL网络的一部分，致力于帮助组织创建和分享与关键利益相关者（包括投资者、消费者、评级机构、非政府组织等）的可持续性和影响力内容。</description>
      <content:encoded><![CDATA[<blockquote>
<p>作者:  陈世强, 澳门大学</p>
</blockquote>
<p>CSRWire（CSRwire）是一个成立于1999年的数字媒体平台，专注于提供有关企业社会责任（CSR）和可持续性的最新新闻、观点和报告。CSRWire是3BL网络的一部分，致力于帮助组织创建和分享与关键利益相关者（包括投资者、消费者、评级机构、非政府组织等）的可持续性和影响力内容。</p>
<p><br><br></p>
<h2 id="一csrwire">一、CSRwire</h2>
<h3 id="11-数据集概况">1.1 数据集概况</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">数据集: CSRwire
数据源:  https://www.csrwire.com/
记录条数:  43391条
所含字段: news_type, year, news_title, subtitle, news_published_date, 
				 news_author, news_content, company_name, company_info, link, image_src
覆盖日期: 1999-12-10 ~ 2024-01-26
覆盖市场: 美股
下载链接: https://pan.baidu.com/s/1Pp4qDMbdPZ-UyXn5cnnYDw?pwd=ayvu
</code></pre></div><p><br><br></p>
<h2 id="二实验">二、实验</h2>
<h3 id="21-读取数据">2.1 读取数据</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_stata</span><span class="p">(</span><span class="s1">&#39;CSR_newswire.dta&#39;</span><span class="p">)</span>
<span class="c1">#df = pd.read_csv(&#39;CSR_newswire.csv.gz&#39;)</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;news_published_date&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;news_published_date&#39;</span><span class="p">])</span>
<span class="n">df</span>
</code></pre></div><p><img loading="lazy" src="img/01-df.png" alt=""  />
</p>
<br>
<h3 id="22-所含字段">2.2 所含字段</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="k">for</span> <span class="n">col</span> <span class="ow">in</span> <span class="n">df</span><span class="o">.</span><span class="n">columns</span><span class="p">:</span>
    <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39; - </span><span class="si">{</span><span class="n">col</span><span class="si">}</span><span class="s1">&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"> - news_type           #分类变量，用于标识新闻的类型或类别
 - year								 #表示新闻发布或报道的年份
 - news_title					 #新闻的标题
 - subtitle            #新闻的子标题或副标题
 - news_published_date #日期变量，记录新闻发布的确切日期
 - news_author         #字符串变量，包含撰写或发布新闻的作者姓名
 - news_content        #文本变量，包含新闻的完整内容或正文
 - company_name        #字符串变量，标识与新闻相关的公司或组织的名称。用于关联新闻与特定公司，便于分析特定公司的新闻报道和公关活动
 - company_info        #提供关于公司的背景信息，包含关于公司的额外信息，如公司简介、业务范围等。
 - link								 #包含指向新闻原始网页或文章的URL链接。
 - image_src           #包含新闻配图的URL链接或文件路径。
</code></pre></div><br>
<h3 id="23-覆盖日期">2.3 覆盖日期</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s1">&#39;news_published_date&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;news_published_date&#39;</span><span class="p">])</span>

<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;起: &#39;</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;news_published_date&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">min</span><span class="p">()</span><span class="o">.</span><span class="n">strftime</span><span class="p">(</span><span class="s1">&#39;%Y-%m-</span><span class="si">%d</span><span class="s1">&#39;</span><span class="p">))</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;止: &#39;</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;news_published_date&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">()</span><span class="o">.</span><span class="n">strftime</span><span class="p">(</span><span class="s1">&#39;%Y-%m-</span><span class="si">%d</span><span class="s1">&#39;</span><span class="p">))</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">起: 1999-12-10
止: 2024-01-26
</code></pre></div><br>
<h3 id="24-新闻类型">2.4 新闻类型</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s1">&#39;news_type&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">value_counts</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">news_type
Philanthropy               8998
Environmental Resources    6732
Sustainability             6363
Employee Engagement        5724
Diversity and Inclusion    3487
Research                   2954
Awards and Rankings        2883
Health and Wellness        1971
Finance                    1626
Technology                 1421
Education                  1232
                            157
Name: count, dtype: int64
</code></pre></div><p><br><br></p>
<h2 id="三相关文献">三、相关文献</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[1]Moss, A., Naughton, J. P., &amp; Wang, C. (2024). The irrelevance of environmental, social, and governance disclosure to retail investors. Management Science, 70(4), 2626-2644.
[2]Assaf, C., Benlemlih, M., El Ouadghiri, I., &amp; Peillex, J. (2023). Does policy uncertainty affect non‐financial disclosure? Evidence from climate change‐related information. International Journal of Finance &amp; Economics.
[3]Anantharaman, D., Gao, F., &amp; Manchiraju, H. (2022). Does social responsibility begin at home? The relation between firms’ pension policies and corporate social responsibility (CSR) activities. Review of Accounting Studies, 27(1), 76-121.
[4]Dang, A., &amp; Nguyen, T. (2021). Valuation effect of emotionality in corporate philanthropy. Journal of Business Ethics, 173, 47-67.
[5]Benlemlih, M., Ge, J., &amp; Zhao, S. (2021). Undervaluation and non‐financial information: Evidence from voluntary disclosure of CSR news. Journal of Business Finance &amp; Accounting, 48(5-6), 785-814.
Cho, S. Y., Kang, P. K., Lee, C., &amp; Park, C. (2020). Financial reporting conservatism and voluntary CSR disclosure. Accounting Horizons, 34(2), 63-82.
[6]Griffin, P. A., &amp; Sun, Y. (2013). Going green: Market reaction to CSRwire news releases. Journal of Accounting and Public Policy, 32(2), 93-113.
</code></pre></div><p><br><br></p>
<h2 id="精选内容">精选内容</h2>
<ul>
<li><a href="https://textdata.cn/blog/datasets_available_for_management_science/">LIST | 可供社科(经管)领域使用的数据集汇总</a></li>
<li><a href="https://textdata.cn/blog/the_text_analysis_list_about_ms/">LIST | 社科(经管)数据挖掘文献资料汇总</a></li>
<li><a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/">推荐 | 文本分析库 cntext 使用手册</a></li>
<li><a href="https://textdata.cn/blog/management_python_course/">付费视频课 | Python实证指标构建与文本分析</a>
<br>
<br></li>
</ul>
]]></content:encoded>
    </item>
    
    <item>
      <title>数据集(英文) | CBS News新闻数据集(1998 ~ 2024)</title>
      <link>https://textdata.cn/blog/2024-07-13-cbs-news-dataset/</link>
      <pubDate>Sat, 13 Jul 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-07-13-cbs-news-dataset/</guid>
      <description>新闻数据集研究价值大， 您可从中提取丰富的指标，包括但不限于经济政策不确定性指数EPU 、 媒体关注度指数、文本相似度、情感分析。而且可训练词向量，构建新的词典，开发新的指标指数。计算机自然语言处理、经济学、管理学、新闻传播学、公共管理等领域均可使用。</description>
      <content:encoded><![CDATA[<h2 id="一cbs-news概况">一、CBS News概况</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">数据集名称: CBS News
数据来源: https://www.cbsnews.com/
覆盖日期: 1998-04-16 ~ 2024-06-30
所含字段:  date, title, content, author_link, publisher, link
记录条数: 190483
文件格式: csv
文件大小: 1475 M
</code></pre></div><p><img loading="lazy" src="img/cbs-main.jpg" alt=""  />
</p>
<p><br><br></p>
<h2 id="二查看数据">二、查看数据</h2>
<h3 id="21-读取数据">2.1 读取数据</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;CBS-News.csv&#39;</span><span class="p">)</span>
<span class="n">df</span>
</code></pre></div><p><img loading="lazy" src="img/01-df.png" alt=""  />
</p>
<br>
<h3 id="22-覆盖日期">2.2 覆盖日期</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s1">&#39;date&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;date&#39;</span><span class="p">],</span> <span class="n">errors</span><span class="o">=</span><span class="s1">&#39;coerce&#39;</span><span class="p">)</span>

<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;起: &#39;</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;date&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">min</span><span class="p">()</span><span class="o">.</span><span class="n">strftime</span><span class="p">(</span><span class="s1">&#39;%Y-%m-</span><span class="si">%d</span><span class="s1">&#39;</span><span class="p">))</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;止: &#39;</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;date&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">()</span><span class="o">.</span><span class="n">strftime</span><span class="p">(</span><span class="s1">&#39;%Y-%m-</span><span class="si">%d</span><span class="s1">&#39;</span><span class="p">))</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">起:  1998-04-16
止:  2024-06-30
</code></pre></div><br>
<h3 id="23-所含字段">2.3 所含字段</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="k">for</span> <span class="n">col</span> <span class="ow">in</span> <span class="n">df</span><span class="o">.</span><span class="n">columns</span><span class="p">:</span>
    <span class="nb">print</span><span class="p">(</span><span class="n">col</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">date #日期
title #标题
content #新闻内容
author_link  #作者主页链接
publisher #出版社
link  #文章链接
</code></pre></div><br>
<h3 id="24-发文量统计">2.4 发文量统计</h3>
<p>企业家杂志，按照月度发文量进行统计。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">month_volumes</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">date</span><span class="p">,</span> <span class="n">month_df</span> <span class="ow">in</span> <span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">pd</span><span class="o">.</span><span class="n">Grouper</span><span class="p">(</span><span class="n">key</span><span class="o">=</span><span class="s1">&#39;date&#39;</span><span class="p">,</span> <span class="n">freq</span><span class="o">=</span><span class="s1">&#39;M&#39;</span><span class="p">)):</span>
    <span class="n">month_volumes</span><span class="o">.</span><span class="n">append</span><span class="p">((</span><span class="n">date</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">month_df</span><span class="p">)))</span>

<span class="n">data</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">months</span><span class="p">)</span>
<span class="n">data</span><span class="o">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;date&#39;</span><span class="p">,</span> <span class="s1">&#39;count&#39;</span><span class="p">]</span>
<span class="n">data</span>
</code></pre></div><p><img loading="lazy" src="img/02-df.png" alt=""  />
</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">plotnine</span> <span class="kn">import</span> <span class="o">*</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">from</span> <span class="nn">matplotlib.font_manager</span> <span class="kn">import</span> <span class="n">FontProperties</span>

<span class="c1">#文泉驿微米黑.ttf位于代码同文件夹</span>
<span class="n">font_prop</span> <span class="o">=</span> <span class="n">FontProperties</span><span class="p">(</span><span class="n">fname</span><span class="o">=</span><span class="s1">&#39;文泉驿微米黑.ttf&#39;</span><span class="p">)</span> 

<span class="n">date_breaks</span> <span class="o">=</span> <span class="p">[</span><span class="n">d</span><span class="o">.</span><span class="n">strftime</span><span class="p">(</span><span class="s1">&#39;%Y-%m&#39;</span><span class="p">)</span> <span class="k">for</span> <span class="n">d</span> <span class="ow">in</span> <span class="n">pd</span><span class="o">.</span><span class="n">date_range</span><span class="p">(</span><span class="n">start</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="s1">&#39;date&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">min</span><span class="p">(),</span> 
                                                          <span class="n">end</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="s1">&#39;date&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">(),</span> 
                                                          <span class="n">freq</span> <span class="o">=</span> <span class="s1">&#39;12M&#39;</span><span class="p">)]</span>

<span class="n">date_labels</span> <span class="o">=</span> <span class="p">[</span><span class="n">d</span><span class="p">[:</span><span class="mi">4</span><span class="p">]</span> <span class="k">for</span> <span class="n">d</span> <span class="ow">in</span> <span class="n">date_breaks</span><span class="p">]</span>


<span class="p">(</span>
    <span class="n">ggplot</span><span class="p">(</span><span class="n">data</span><span class="p">,</span>  <span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="s1">&#39;date&#39;</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="s1">&#39;count&#39;</span><span class="p">))</span>
    <span class="o">+</span><span class="n">geom_point</span><span class="p">()</span>
    <span class="o">+</span><span class="n">geom_line</span><span class="p">()</span>
    <span class="o">+</span><span class="n">theme</span><span class="p">(</span><span class="n">figure_size</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">6</span><span class="p">),</span>
           <span class="n">text</span> <span class="o">=</span> <span class="n">element_text</span><span class="p">(</span><span class="n">family</span> <span class="o">=</span> <span class="n">font_prop</span><span class="o">.</span><span class="n">get_name</span><span class="p">()),</span> 
           <span class="n">plot_title</span> <span class="o">=</span> <span class="n">element_text</span><span class="p">(</span><span class="n">family</span> <span class="o">=</span> <span class="n">font_prop</span><span class="o">.</span><span class="n">get_name</span><span class="p">(),</span> <span class="n">size</span><span class="o">=</span><span class="mi">12</span><span class="p">)</span>
          <span class="p">)</span>
    <span class="o">+</span><span class="n">labs</span><span class="p">(</span><span class="n">title</span><span class="o">=</span><span class="s1">&#39;CBS News月度发文量(1998.4 ~ 2024.6)&#39;</span><span class="p">,</span>
          <span class="n">x</span> <span class="o">=</span> <span class="s1">&#39;月度&#39;</span><span class="p">,</span> 
          <span class="n">y</span> <span class="o">=</span> <span class="s1">&#39;发文量&#39;</span><span class="p">)</span>
    <span class="o">+</span><span class="n">scale_x_datetime</span><span class="p">(</span><span class="n">breaks</span><span class="o">=</span><span class="n">date_breaks</span><span class="p">,</span> <span class="n">labels</span> <span class="o">=</span> <span class="n">date_labels</span><span class="p">)</span>
<span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/03-plot.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="三说明">三、说明</h2>
<p>我们都知道六度分割理论(通过任意六个人，我们能认识世界上任意一个人。)， 类比到爬虫场景， 通过广度递归，当我们设置最大采集深度7， 意味理论上通过点击7次链接， 可以触达到任意一个页面。 <a href="https://textdata.cn/blog/2024-07-12-china-daily-dataset/">ChinaDaily</a>、 <a href="https://textdata.cn/blog/2024-06-22-usa_today_daily-news-dataset/">UsaToday</a>、 <a href="https://textdata.cn/blog/2024-07-12-entrepreneur-dataset/">Enterpreneur</a> 与 CBS News均采用scrapy广度递归，最大深度7。</p>
<p>但从月度统计中可以看出CBS News ，有很多个月份(周期性)接近于0 轴的， 网站一般不会出这么周期性的问题， 大概率说明采集遇到问题。</p>
<p><br><br></p>
<h2 id="四获取数据">四、获取数据</h2>
<p>虽然数据采集出现了问题，但因为该csv数据结构整洁、体量较大， 特别适合给各位拿来练习Python文本分析。</p>
<blockquote>
<p>CBS News链接: <a href="https://pan.baidu.com/s/1DlCo3PRnzcG1iZ_7V7PVlg?pwd=i4rr">https://pan.baidu.com/s/1DlCo3PRnzcG1iZ_7V7PVlg?pwd=i4rr</a> 提取码: i4rr</p>
</blockquote>
<br>
<h3 id="注意">注意</h3>
<p>如Excel打开csv乱码， 请百度搜「在 Excel 中正确打开 CSV UTF-8 文件」</p>
<br>
<br>
<h2 id="五相关内容">五、相关内容</h2>
<ul>
<li><a href="https://textdata.cn/blog/2024-06-22-usa_today_daily-news-dataset/"><strong>数据集(英文）| USA Today新闻数据集(2012~2024)</strong></a></li>
<li><a href="**https://textdata.cn/blog/2024-07-12-china-daily-dataset/**">数据集(中英) | ChinaDaily新闻数据集(2008 ~ 2024)</a></li>
<li><a href="https://textdata.cn/blog/2024-07-12-entrepreneur-dataset/">数据集 | 企业家Entrepreneur杂志数据集(1996 ~ 2024)</a></li>
<li><a href="https://textdata.cn/blog/2023-11-17-how-handle-mega-csv-that-far-exceed-memory/">代码 | 如何处理远超电脑内存的csv文件</a></li>
<li><a href="https://textdata.cn/blog/2023-12-18-how-to-generate-panel-data-from-daily-news-dataset/"><strong>代码 | 使用「新闻数据」构造概念词提及量「面板数据」</strong></a></li>
<li><a href="https://textdata.cn/blog/2023-12-28-visualize-the-culture-change-using-people-daily-dataset/"><strong>可视化 | 人民日报语料反映七十年文化演变</strong></a></li>
<li><a href="https://textdata.cn/blog/2023-12-20-measure-china-economic-policy-uncertainty/">代码 | 使用「新闻数据」测量 「经济政策不确定性EPU」指标</a></li>
</ul>
<p><br><br></p>
<h2 id="精选内容">精选内容</h2>
<ul>
<li><a href="https://textdata.cn/blog/datasets_available_for_management_science/">LIST | 可供社科(经管)领域使用的数据集汇总</a></li>
<li><a href="https://textdata.cn/blog/the_text_analysis_list_about_ms/">LIST | 社科(经管)数据挖掘文献资料汇总</a></li>
<li><a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/">推荐 | 文本分析库cntext使用手册</a></li>
<li><a href="https://textdata.cn/blog/management_python_course/">付费视频课 | Python实证指标构建与文本分析</a>
<br>
<br></li>
</ul>
]]></content:encoded>
    </item>
    
    <item>
      <title>数据集 | ChinaDaily 新闻数据集(2008 ~ 2024)</title>
      <link>https://textdata.cn/blog/2024-07-12-china-daily-dataset/</link>
      <pubDate>Fri, 12 Jul 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-07-12-china-daily-dataset/</guid>
      <description>新闻数据集研究价值大， 您可从中提取丰富的指标，包括但不限于经济政策不确定性指数EPU 、 媒体关注度指数、文本相似度、情感分析。而且可训练词向量，构建新的词典，开发新的指标指数。计算机自然语言处理、经济学、管理学、新闻传播学、公共管理等领域均可使用。</description>
      <content:encoded><![CDATA[<h2 id="一china-daily概况">一、「China Daily」概况</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">数据来源: chinadaily.com
覆盖日期: 2008-10-24 ~ 2024-06-29
所含字段:  date, title, content, source, link, img, lang
记录条数: 847854
     - 英文  697241
     - 中文  150613  
  
文件格式: csv
文件大小: 2648M

本文声明: 如有问题， 请加微信372335839，备注「姓名-学校-专业」
</code></pre></div><p><img loading="lazy" src="img/china-daily-main.jpg" alt=""  />
</p>
<p><br><br></p>
<h2 id="二查看数据">二、查看数据</h2>
<h3 id="21-读取数据">2.1 读取数据</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;ChinaDaily.csv&#39;</span><span class="p">)</span>
<span class="n">df</span>
</code></pre></div><p><img loading="lazy" src="img/01-df.png" alt=""  />
</p>
<br>
<h3 id="22-覆盖日期">2.2 覆盖日期</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s1">&#39;date&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;date&#39;</span><span class="p">],</span> <span class="n">errors</span><span class="o">=</span><span class="s1">&#39;coerce&#39;</span><span class="p">)</span>

<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;起: &#39;</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;date&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">min</span><span class="p">()</span><span class="o">.</span><span class="n">strftime</span><span class="p">(</span><span class="s1">&#39;%Y-%m-</span><span class="si">%d</span><span class="s1">&#39;</span><span class="p">))</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;止: &#39;</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;date&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">()</span><span class="o">.</span><span class="n">strftime</span><span class="p">(</span><span class="s1">&#39;%Y-%m-</span><span class="si">%d</span><span class="s1">&#39;</span><span class="p">))</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">起:  2008-10-24
止:  2024-06-29
</code></pre></div><br>
<h3 id="23-所含字段">2.3 所含字段</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="k">for</span> <span class="n">col</span> <span class="ow">in</span> <span class="n">df</span><span class="o">.</span><span class="n">columns</span><span class="p">:</span>
    <span class="nb">print</span><span class="p">(</span><span class="n">col</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">date #日期
title #标题
content #新闻内容
source  #来源
link  #新闻链接
img  #新闻首图链接
lang #语言chinese、english
</code></pre></div><br>
<h3 id="24-语言">2.4 语言</h3>
<p>China Daily是双语网站， 数据集中大多为英文新闻，也含少量中文内容。 中英文新闻的记录数</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s1">&#39;lang&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">value_counts</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">lang
english    697241
chinese    150613
</code></pre></div><br>
<h3 id="25-月度发文量">2.5 月度发文量</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">months</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">date</span><span class="p">,</span> <span class="n">month_df</span> <span class="ow">in</span> <span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">pd</span><span class="o">.</span><span class="n">Grouper</span><span class="p">(</span><span class="n">key</span><span class="o">=</span><span class="s1">&#39;date&#39;</span><span class="p">,</span> <span class="n">freq</span><span class="o">=</span><span class="s1">&#39;M&#39;</span><span class="p">)):</span>
    <span class="c1">#print(date)</span>
    <span class="n">months</span><span class="o">.</span><span class="n">append</span><span class="p">((</span><span class="n">date</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">month_df</span><span class="p">)))</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">months</span><span class="p">)</span>
<span class="n">data</span><span class="o">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;date&#39;</span><span class="p">,</span> <span class="s1">&#39;count&#39;</span><span class="p">]</span>
<span class="n">data</span>
</code></pre></div><p><img loading="lazy" src="img/02-df.png" alt=""  />
</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">plotnine</span> <span class="kn">import</span> <span class="o">*</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">from</span> <span class="nn">matplotlib.font_manager</span> <span class="kn">import</span> <span class="n">FontProperties</span>

<span class="c1">#文泉驿微米黑.ttf位于代码同文件夹</span>
<span class="n">font_prop</span> <span class="o">=</span> <span class="n">FontProperties</span><span class="p">(</span><span class="n">fname</span><span class="o">=</span><span class="s1">&#39;文泉驿微米黑.ttf&#39;</span><span class="p">)</span> 

<span class="n">date_breaks</span> <span class="o">=</span> <span class="p">[</span><span class="n">d</span><span class="o">.</span><span class="n">strftime</span><span class="p">(</span><span class="s1">&#39;%Y-%m&#39;</span><span class="p">)</span> <span class="k">for</span> <span class="n">d</span> <span class="ow">in</span> <span class="n">pd</span><span class="o">.</span><span class="n">date_range</span><span class="p">(</span><span class="n">start</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="s1">&#39;date&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">min</span><span class="p">(),</span> 
                                                          <span class="n">end</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="s1">&#39;date&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">(),</span> 
                                                          <span class="n">freq</span> <span class="o">=</span> <span class="s1">&#39;12M&#39;</span><span class="p">)]</span>

<span class="n">date_labels</span> <span class="o">=</span> <span class="p">[</span><span class="n">d</span><span class="p">[:</span><span class="mi">4</span><span class="p">]</span> <span class="k">for</span> <span class="n">d</span> <span class="ow">in</span> <span class="n">date_breaks</span><span class="p">]</span>


<span class="p">(</span>
    <span class="n">ggplot</span><span class="p">(</span><span class="n">data</span><span class="p">,</span>  <span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="s1">&#39;date&#39;</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="s1">&#39;count&#39;</span><span class="p">))</span>
    <span class="o">+</span><span class="n">geom_point</span><span class="p">()</span>
    <span class="o">+</span><span class="n">geom_line</span><span class="p">()</span>
    <span class="o">+</span><span class="n">theme</span><span class="p">(</span><span class="n">figure_size</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">6</span><span class="p">),</span>
           <span class="n">text</span> <span class="o">=</span> <span class="n">element_text</span><span class="p">(</span><span class="n">family</span> <span class="o">=</span> <span class="n">font_prop</span><span class="o">.</span><span class="n">get_name</span><span class="p">()),</span> 
           <span class="n">plot_title</span> <span class="o">=</span> <span class="n">element_text</span><span class="p">(</span><span class="n">family</span> <span class="o">=</span> <span class="n">font_prop</span><span class="o">.</span><span class="n">get_name</span><span class="p">(),</span> <span class="n">size</span><span class="o">=</span><span class="mi">12</span><span class="p">)</span>
          <span class="p">)</span>
    <span class="o">+</span><span class="n">labs</span><span class="p">(</span><span class="n">title</span><span class="o">=</span><span class="s1">&#39;China Daily月度发文量(2008.10 ~ 2024.06)&#39;</span><span class="p">,</span>
          <span class="n">x</span> <span class="o">=</span> <span class="s1">&#39;月度&#39;</span><span class="p">,</span> 
          <span class="n">y</span> <span class="o">=</span> <span class="s1">&#39;发文量&#39;</span><span class="p">)</span>
    <span class="o">+</span><span class="n">scale_x_datetime</span><span class="p">(</span><span class="n">breaks</span><span class="o">=</span><span class="n">date_breaks</span><span class="p">,</span> <span class="n">labels</span> <span class="o">=</span> <span class="n">date_labels</span><span class="p">)</span>
<span class="p">)</span>

</code></pre></div><p><img loading="lazy" src="img/03-plot.png" alt=""  />
</p>
<br>
<h3 id="注意">注意</h3>
<p>如Excel打开csv乱码， 请百度搜【在 Excel 中正确打开 CSV UTF-8 文件】</p>
<!--所采集的数据并非China Daily全部内容，但采用广度递归采集， 最大深度为7，能看做是对ChinaDaily进行的大规模抽样。-->
<p><br><br></p>
<h2 id="三数据用途">三、数据用途</h2>
<p>新闻数据集 可提取丰富的指标，包括但不限于 <strong>经济政策不确定性指数</strong> 、<strong>环境政策不确定性</strong>、 <strong>媒体关注度指数</strong>、<strong>文本相似度</strong>、<strong>情感分析</strong>。此外， 可训练词向量，开发新的概念词典。数据带时间， 参照前面指标， 依主体、日期、指标进行计算， 可构造面板数据，构建新的指标指数。因此在经济学、管理学、新闻传播学、公共管理、社会学等领域均有较高的研究价值。</p>
<p>相关参考文献</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[1]洪永淼,刘俸奇,薛涧坡.政府与市场心理因素的经济影响及其测度[J].管理世界,2023,39(03):30-51.
[2]刘景江,郑畅然,洪永淼.机器学习如何赋能管理学研究？——国内外前沿综述和未来展望[J].管理世界,2023,39(09):191-216.
[3]张一帆,林建浩,樊嘉诚.新闻文本大数据与消费增速实时预测——基于叙事经济学的视角[J].金融研究,2023,(05):152-169.
[4]Huang, Yun, and Paul Luk. &#34;Measuring economic policy uncertainty in China.&#34; China Economic Review 59 (2020): 101367
[5]欧阳资生,陈世丽,杨希特,刘凤根,周学伟.经济政策不确定性、网络舆情与金融机构系统性风险[J].管理科学学报,2023,26(04):62-86.
[6]逯东,宋昕倍.媒体报道、上市公司年报可读性与融资约束[J].管理科学学报,2021,24(12):45-61.
[7]彭涛,黄福广,孙凌霞.经济政策不确定性与风险承担:基于风险投资的证据[J].管理科学学报,2021,24(03):98-114.
[8]庞锐.采纳与内化：多重制度压力如何影响河长制创新扩散——基于省级政府的定向配对事件史分析[J].公共管理学报,2023,20(02):25-37+165-166.
</code></pre></div><br>
<br>
<h2 id="精选内容">精选内容</h2>
<ul>
<li>
<p><a href="https://textdata.cn/blog/datasets_available_for_management_science/">LIST | 可供社科(经管)领域使用的数据集汇总</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/the_text_analysis_list_about_ms/">LIST | 社科(经管)数据挖掘文献资料汇总</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/">推荐 | 文本分析库cntext2.x使用手册</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/management_python_course/">付费视频课 | Python实证指标构建与文本分析</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2024-06-22-usa_today_daily-news-dataset/"><strong>数据集(英文）| USA Today新闻数据集(2012~2024)</strong></a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2024-07-12-entrepreneur-dataset/">数据集 | 企业家Entrepreneur杂志数据集(1996 ~ 2024)</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-11-17-how-handle-mega-csv-that-far-exceed-memory/">代码 | 如何处理远超电脑内存的csv文件</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-12-18-how-to-generate-panel-data-from-daily-news-dataset/"><strong>代码 | 使用「新闻数据」构造概念词提及量「面板数据」</strong></a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-12-28-visualize-the-culture-change-using-people-daily-dataset/"><strong>可视化 | 人民日报语料反映七十年文化演变</strong></a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-12-20-measure-china-economic-policy-uncertainty/">代码 | 使用「新闻数据」测量 「经济政策不确定性EPU」指标</a></p>
</li>
</ul>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>数据集 | 企业家 Entrepreneur 杂志数据集(1996 ~ 2024)</title>
      <link>https://textdata.cn/blog/2024-07-12-entrepreneur-dataset/</link>
      <pubDate>Fri, 12 Jul 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-07-12-entrepreneur-dataset/</guid>
      <description>新闻数据集研究价值大， 您可从中提取丰富的指标，包括但不限于经济政策不确定性指数EPU 、 媒体关注度指数、文本相似度、情感分析。而且可训练词向量，构建新的词典，开发新的指标指数。计算机自然语言处理、经济学、管理学、新闻传播学、公共管理等领域均可使用。</description>
      <content:encoded><![CDATA[<h2 id="一enterpreneur概况">一、Enterpreneur概况</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">数据集名称: 企业家杂志
数据来源: https://www.entrepreneur.com/
覆盖日期: 1996-01-01 ~ 2024-06-28
所含字段:  date, title, content, link
记录条数: 95813
文件格式: csv
文件大小: 1418 M
本文声明: 如有问题， 请加微信372335839，备注「姓名-学校-专业」
</code></pre></div><p><img loading="lazy" src="img/entrepreneur-elon-musk.jpeg" alt=""  />
</p>
<p><br><br></p>
<h2 id="二查看数据">二、查看数据</h2>
<h3 id="21-读取数据">2.1 读取数据</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;Entrepreneur.csv&#39;</span><span class="p">)</span>
<span class="n">df</span>
</code></pre></div><p><img loading="lazy" src="img/01-df.png" alt=""  />
</p>
<br>
<h3 id="22-覆盖日期">2.2 覆盖日期</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s1">&#39;date&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;date&#39;</span><span class="p">],</span> <span class="n">errors</span><span class="o">=</span><span class="s1">&#39;coerce&#39;</span><span class="p">)</span>

<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;起: &#39;</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;date&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">min</span><span class="p">()</span><span class="o">.</span><span class="n">strftime</span><span class="p">(</span><span class="s1">&#39;%Y-%m-</span><span class="si">%d</span><span class="s1">&#39;</span><span class="p">))</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;止: &#39;</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;date&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">()</span><span class="o">.</span><span class="n">strftime</span><span class="p">(</span><span class="s1">&#39;%Y-%m-</span><span class="si">%d</span><span class="s1">&#39;</span><span class="p">))</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">起:  1996-01-01
止:  2024-06-28
</code></pre></div><br>
<h3 id="23-所含字段">2.3 所含字段</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="k">for</span> <span class="n">col</span> <span class="ow">in</span> <span class="n">df</span><span class="o">.</span><span class="n">columns</span><span class="p">:</span>
    <span class="nb">print</span><span class="p">(</span><span class="n">col</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">date #日期
title #标题
content #新闻内容
link  #新闻链接
</code></pre></div><br>
<h3 id="24-发文量统计">2.4 发文量统计</h3>
<p>企业家杂志，按照月度发文量进行统计。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">month_volumes</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">date</span><span class="p">,</span> <span class="n">month_df</span> <span class="ow">in</span> <span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">pd</span><span class="o">.</span><span class="n">Grouper</span><span class="p">(</span><span class="n">key</span><span class="o">=</span><span class="s1">&#39;date&#39;</span><span class="p">,</span> <span class="n">freq</span><span class="o">=</span><span class="s1">&#39;M&#39;</span><span class="p">)):</span>
    <span class="n">month_volumes</span><span class="o">.</span><span class="n">append</span><span class="p">((</span><span class="n">date</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">month_df</span><span class="p">)))</span>

<span class="n">data</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">months</span><span class="p">)</span>
<span class="n">data</span><span class="o">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;date&#39;</span><span class="p">,</span> <span class="s1">&#39;count&#39;</span><span class="p">]</span>
<span class="n">data</span>
</code></pre></div><p><img loading="lazy" src="img/02-df.png" alt=""  />
</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">plotnine</span> <span class="kn">import</span> <span class="o">*</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">from</span> <span class="nn">matplotlib.font_manager</span> <span class="kn">import</span> <span class="n">FontProperties</span>

<span class="c1">#文泉驿微米黑.ttf位于代码同文件夹</span>
<span class="n">font_prop</span> <span class="o">=</span> <span class="n">FontProperties</span><span class="p">(</span><span class="n">fname</span><span class="o">=</span><span class="s1">&#39;文泉驿微米黑.ttf&#39;</span><span class="p">)</span> 

<span class="n">date_labels</span> <span class="o">=</span> <span class="p">[</span><span class="nb">str</span><span class="p">(</span><span class="n">y</span><span class="p">)</span> <span class="k">for</span> <span class="n">y</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1996</span><span class="p">,</span> <span class="mi">2025</span><span class="p">,</span> <span class="mi">2</span><span class="p">)]</span>


<span class="p">(</span>
    <span class="n">ggplot</span><span class="p">(</span><span class="n">data</span><span class="p">,</span>  <span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="s1">&#39;date&#39;</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="s1">&#39;count&#39;</span><span class="p">))</span>
    <span class="o">+</span><span class="n">geom_point</span><span class="p">()</span>
    <span class="o">+</span><span class="n">geom_line</span><span class="p">()</span>
    <span class="o">+</span><span class="n">theme</span><span class="p">(</span><span class="n">figure_size</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">6</span><span class="p">),</span>
           <span class="n">text</span> <span class="o">=</span> <span class="n">element_text</span><span class="p">(</span><span class="n">family</span> <span class="o">=</span> <span class="n">font_prop</span><span class="o">.</span><span class="n">get_name</span><span class="p">()),</span> 
           <span class="n">plot_title</span> <span class="o">=</span> <span class="n">element_text</span><span class="p">(</span><span class="n">family</span> <span class="o">=</span> <span class="n">font_prop</span><span class="o">.</span><span class="n">get_name</span><span class="p">(),</span> <span class="n">size</span><span class="o">=</span><span class="mi">14</span><span class="p">)</span>
          <span class="p">)</span>
    <span class="o">+</span><span class="n">labs</span><span class="p">(</span><span class="n">title</span><span class="o">=</span><span class="s1">&#39;企业家Entrepreneur杂志(1996.1-2024.6.28)&#39;</span><span class="p">,</span>
          <span class="n">x</span> <span class="o">=</span> <span class="s1">&#39;月度&#39;</span><span class="p">,</span> 
          <span class="n">y</span> <span class="o">=</span> <span class="s1">&#39;发文量&#39;</span><span class="p">)</span>
    <span class="o">+</span><span class="n">scale_x_datetime</span><span class="p">(</span><span class="n">labels</span> <span class="o">=</span> <span class="n">date_labels</span><span class="p">,</span> <span class="n">breaks</span><span class="o">=</span><span class="n">date_labels</span><span class="p">)</span>
   
<span class="p">)</span>

</code></pre></div><p><img loading="lazy" src="img/03-plot.png" alt=""  />
</p>
<br>
<h3 id="注意">注意</h3>
<p>如Excel打开csv乱码， 请百度搜「在 Excel 中正确打开 CSV UTF-8 文件」</p>
<p><br><br></p>
<h2 id="三数据用途">三、数据用途</h2>
<p>企业家杂志数据集， 最相关的领域是与企业家相关的创新创业， 通过文本研究全球的企业家创新创业。</p>
<p>当然也可将该数据集看做新闻数据集， 提取的指标提取丰富的指标，包括但不限于 <strong>经济政策不确定性指数</strong> 、<strong>环境政策不确定性</strong>、 <strong>媒体关注度指数</strong>、<strong>文本相似度</strong>、<strong>情感分析</strong>。此外， 可训练词向量，开发新的概念词典。数据带时间， 参照前面指标， 依主体、日期、指标进行计算， 可构造面板数据，构建新的指标指数。因此在经济学、管理学、新闻传播学、公共管理、社会学等领域均有较高的研究价值。</p>
<p>相关参考文献</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[1]洪永淼,刘俸奇,薛涧坡.政府与市场心理因素的经济影响及其测度[J].管理世界,2023,39(03):30-51.
[2]刘景江,郑畅然,洪永淼.机器学习如何赋能管理学研究？——国内外前沿综述和未来展望[J].管理世界,2023,39(09):191-216.
[3]张一帆,林建浩,樊嘉诚.新闻文本大数据与消费增速实时预测——基于叙事经济学的视角[J].金融研究,2023,(05):152-169.
[4]Huang, Yun, and Paul Luk. &#34;Measuring economic policy uncertainty in China.&#34; China Economic Review 59 (2020): 101367
[5]欧阳资生,陈世丽,杨希特,刘凤根,周学伟.经济政策不确定性、网络舆情与金融机构系统性风险[J].管理科学学报,2023,26(04):62-86.
[6]逯东,宋昕倍.媒体报道、上市公司年报可读性与融资约束[J].管理科学学报,2021,24(12):45-61.
[7]彭涛,黄福广,孙凌霞.经济政策不确定性与风险承担:基于风险投资的证据[J].管理科学学报,2021,24(03):98-114.
[8]庞锐.采纳与内化：多重制度压力如何影响河长制创新扩散——基于省级政府的定向配对事件史分析[J].公共管理学报,2023,20(02):25-37+165-166.
</code></pre></div><br>
<br>
<h2 id="精选内容">精选内容</h2>
<ul>
<li><a href="https://textdata.cn/blog/datasets_available_for_management_science/">LIST | 可供社科(经管)领域使用的数据集汇总</a></li>
<li><a href="https://textdata.cn/blog/the_text_analysis_list_about_ms/">LIST | 社科(经管)数据挖掘文献资料汇总</a></li>
<li><a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/">推荐 | 文本分析库cntext使用手册</a></li>
<li><a href="https://textdata.cn/blog/management_python_course/">付费视频课 | Python实证指标构建与文本分析</a></li>
<li><a href="https://textdata.cn/blog/2024-06-22-usa_today_daily-news-dataset/"><strong>数据集(英文）| USA Today新闻数据集(2012~2024)</strong></a></li>
<li><a href="**https://textdata.cn/blog/2024-07-12-china-daily-dataset/**">数据集(中英) | ChinaDaily新闻数据集(2008 ~ 2024)</a></li>
<li><a href="https://textdata.cn/blog/2023-11-17-how-handle-mega-csv-that-far-exceed-memory/">代码 | 如何处理远超电脑内存的csv文件</a></li>
<li><a href="https://textdata.cn/blog/2023-12-18-how-to-generate-panel-data-from-daily-news-dataset/"><strong>代码 | 使用「新闻数据」构造概念词提及量「面板数据」</strong></a></li>
<li><a href="https://textdata.cn/blog/2023-12-28-visualize-the-culture-change-using-people-daily-dataset/"><strong>可视化 | 人民日报语料反映七十年文化演变</strong></a></li>
<li><a href="https://textdata.cn/blog/2023-12-20-measure-china-economic-policy-uncertainty/">代码 | 使用「新闻数据」测量 「经济政策不确定性EPU」指标</a>
<br>
<br></li>
</ul>
]]></content:encoded>
    </item>
    
    <item>
      <title>实验 | 使用Ollama本地大模型DIY制作单词书教案PDF</title>
      <link>https://textdata.cn/blog/2024-07-10-using-large-language-model-to-build-diy-dictionary/</link>
      <pubDate>Wed, 10 Jul 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-07-10-using-large-language-model-to-build-diy-dictionary/</guid>
      <description>&lt;h2 id=&#34;一任务描述&#34;&gt;一、任务描述&lt;/h2&gt;
&lt;p&gt;前几天分享了 &lt;a href=&#34;https://textdata.cn/blog/2024-06-14-using-large-language-model-to-extract-structure-data-from-raw-text/&#34;&gt;实验 | 使用本地大模型从文本中提取结构化信息&lt;/a&gt; ，今天实验一个成功率更高的使用场景，生成单词书教案PDF。&lt;br&gt;&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;假设你是英语老师，你希望在单词书中增加历史文化方面的信息， 市面上的单词书并不能很好的满足你的需要。针对这一需求， 我们可以利用大模型，定制你的单词书教案。例如单词 &lt;em&gt;&lt;strong&gt;abandon&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/07-pixyII.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/07-pdf.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二ollama介绍&#34;&gt;二、Ollama介绍&lt;/h2&gt;
&lt;p&gt;&lt;a href=&#34;https://ollama.ai/&#34;&gt;&lt;strong&gt;Ollama&lt;/strong&gt;&lt;/a&gt;是一款开源应用程序，可让您使用 MacOS、Linux 和 Windows 上的命令行界面在本地运行、创建和共享大型语言模型。&lt;/p&gt;
&lt;p&gt;Ollama 可以直接从其库中访问各种 LLM，只需一个命令即可下载。下载后，只需执行一个命令即可开始使用。这对于工作量围绕终端窗口的用户非常有帮助。如果他们被困在某个地方，他们可以在不切换到另一个浏览器窗口的情况下获得答案。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;21-特点和优点&#34;&gt;2.1 特点和优点&lt;/h3&gt;
&lt;p&gt;这就是为什么 OLLAMA 是您的工具包中必备的工具：&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;简单&lt;/strong&gt; ：OLLAMA 提供简单的设置过程。您无需拥有机器学习博士学位即可启动和运行它。&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;成本效益&lt;/strong&gt; ：在本地运行模型意味着您无需支付云成本。您的钱包会感谢您。&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;隐私&lt;/strong&gt; ：使用 OLLAMA，所有数据处理都在您的本地机器上进行。这对于用户隐私来说是一个巨大的胜利。&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;多功能性&lt;/strong&gt; ：OLLAMA 不只是为 Python 爱好者准备的。它的灵活性使其可以用于各种应用程序，包括 Web 开发。&lt;/li&gt;
&lt;/ul&gt;
&lt;br&gt;
&lt;h3 id=&#34;22-使用-ollama-进行-llm-选择&#34;&gt;2.2 使用 Ollama 进行 LLM 选择&lt;/h3&gt;
&lt;p&gt;默认情况下，Openai Models 在 CrewAI 中用作 llm。有经费、有网络、不担心数据泄露等条件下,  力求达到最佳性能，可考虑使用 GPT-4 或 OpenAI 稍便宜的 GPT-3.5。&lt;/p&gt;
&lt;p&gt;但本文是要 &lt;strong&gt;本地部署&lt;/strong&gt;， 因此我们将使用 Meta Llama 3，这是迄今为止功能最强大的公开 LLM。Meta Llama 3 是 Meta Inc. 开发的模型系列，是最新推出的模型，具有 8B 和 70B 两种参数大小（预训练或指令调整）。Llama 3 指令调整模型针对对话/聊天用例进行了微调和优化，并且在常见基准测试中胜过许多可用的开源聊天模型。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/01-llama3-performance.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/02-llama3-performance.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;h2 id=&#34;二准备工作&#34;&gt;二、准备工作&lt;/h2&gt;
&lt;h3 id=&#34;21-安装ollama&#34;&gt;2.1 安装ollama&lt;/h3&gt;
&lt;p&gt;点击前往网站 &lt;a href=&#34;https://ollama.com/&#34;&gt;https://ollama.com/&lt;/a&gt; ，下载ollama软件，支持win、Mac、linux&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/03-ollama-gui.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;22-下载llm&#34;&gt;2.2 下载LLM&lt;/h3&gt;
&lt;p&gt;ollama软件目前支持多种大模型， 如阿里的（qwen、qwen2）、meta的(llama3)，&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/04-ollama-model.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;以llama3为例，根据自己电脑显存性能， 选择适宜的版本。如果不知道选什么，那就试着安装，不合适不能用再删除即可。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/05-ollama-llama3.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;打开电脑命令行cmd(mac是terminal),  网络是连网状态，执行模型下载(安装)命令&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;ollama pull llama3
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;等待 &lt;strong&gt;llama3:8b&lt;/strong&gt; 下载完成。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;23-安装python包&#34;&gt;2.3 安装python包&lt;/h3&gt;
&lt;p&gt;在python中调用ollama服务，需要ollama包。&lt;/p&gt;
&lt;p&gt;打开电脑命令行cmd(mac是terminal),  网络是连网状态，执行安装命令&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;pip3 install ollama
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;24-启动ollama服务&#34;&gt;2.4 启动ollama服务&lt;/h3&gt;
&lt;p&gt;在Python中调用本地ollama服务，需要先启动本地ollama服务， 打开电脑命令行cmd(mac是terminal), 执行&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;ollama serve
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;2024/06/14 14:52:24 routes.go:1011: INFO server config env=&amp;#34;map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_KEEP_ALIVE: OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS:/Users/deng/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_TMPDIR:]&amp;#34;
time=2024-06-14T14:52:24.742+08:00 level=INFO source=images.go:725 msg=&amp;#34;total blobs: 18&amp;#34;
time=2024-06-14T14:52:24.742+08:00 level=INFO source=images.go:732 msg=&amp;#34;total unused blobs removed: 0&amp;#34;
time=2024-06-14T14:52:24.743+08:00 level=INFO source=routes.go:1057 msg=&amp;#34;Listening on 127.0.0.1:11434 (version 0.1.44)&amp;#34;
time=2024-06-14T14:52:24.744+08:00 level=INFO source=payload.go:30 msg=&amp;#34;extracting embedded files&amp;#34; dir=/var/folders/y0/4gqxky0s2t94x1c1qhlwr6100000gn/T/ollama4239159529/runners
time=2024-06-14T14:52:24.772+08:00 level=INFO source=payload.go:44 msg=&amp;#34;Dynamic LLM libraries [metal]&amp;#34;
time=2024-06-14T14:52:24.796+08:00 level=INFO source=types.go:71 msg=&amp;#34;inference compute&amp;#34; id=0 library=metal compute=&amp;#34;&amp;#34; driver=0.0 name=&amp;#34;&amp;#34; total=&amp;#34;72.0 GiB&amp;#34; available=&amp;#34;72.0 GiB&amp;#34;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;cmd(mac是terminal)看到如上的信息，说明本地ollama服务已开启。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;三实验&#34;&gt;三、实验&lt;/h2&gt;
&lt;h3 id=&#34;31-代码结构&#34;&gt;3.1 代码结构&lt;/h3&gt;
&lt;p&gt;点击下载&lt;a href=&#34;project.zip&#34;&gt;本文代码&lt;/a&gt;&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;project
  - 代码.ipynb #代码
  - prompt.txt #提示模板
  - words.csv  #准备的单词列表
  - word-dictionary.csv  #生成的单词书
  - Your-Diy-Dictionary.md #生成的带主题样式的单词书
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;32-设计提示&#34;&gt;3.2 设计提示&lt;/h3&gt;
&lt;p&gt;需要根据单词，生成单词、音标、语义、例句、历史文化、相关单词等信息， 提示如下，&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;单词：
--- 
{word} 
---
   
你是一名中英文双语教育专家，拥有帮助将中文视为母语的用户理解和记忆英语单词的专长，请根据用户提供的英语单词{word}完成任务。
                
# {word}
markdown一级标题#
[美音]美国音标，斜体加粗
                
## 语义
- 系统地分析用户提供的单词，并以简单易懂的方式解答；
                
## 例句
- 为该单词提供至少 3 个不同场景下的使用方法和例句。并且附上中文翻译，以帮助用户更深入地理解单词意义。其中英文例句加粗斜体！
                
## 历史文化
- 详细介绍单词的造词来源和发展历史，以及在欧美文化中的内涵
                
## 相关单词
- 列出单词对应的名词、单复数、动词、不同时态、形容词、副词等的变形以及对应的中文翻译。
               
## 词组搭配
- 列出单词对应的固定搭配、组词以及对应的中文翻译。


注意: 如非特别说明尽量用中文，结果返回markdown格式; 均为二级标题##， 无序列表用-而不是*。
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;该提示已存储到 &lt;em&gt;&lt;strong&gt;prompt.txt&lt;/strong&gt;&lt;/em&gt; 内。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;33-小实验&#34;&gt;3.3 小实验&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;o&#34;&gt;%%&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;time&lt;/span&gt;

&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;ollama&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#读取提示&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;prompt&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;open&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;prompt.txt&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;encoding&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;utf-8&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;

&lt;span class=&#34;k&#34;&gt;def&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;diy_dictionary&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;word&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;response&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ollama&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;chat&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;model&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;llama3:8b&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;messages&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;
          &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;role&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;system&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;content&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;prompt&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;},&lt;/span&gt;
          &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;role&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;user&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;content&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;word&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;},&lt;/span&gt;
        &lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;

    &lt;span class=&#34;n&#34;&gt;result&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;response&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;message&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;][&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;content&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
    &lt;span class=&#34;k&#34;&gt;return&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;result&lt;/span&gt;


&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;diy_dictionary&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;word&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;march&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;# March


[美音] /mɑːrtʃ/


## 语义
March 是指第三个月份，但它也可以用于其他场景：
- 在军事或政治上，March 可以表示进军、推动或实施某些措施。
- 在生活中，March 可以表示开始新的项目或计划。

## 例句
* **_The company will march into the new market next quarter._** - 公司将在下一个季度进入新市场。
* **_She&amp;#39;s been marching towards her goals for years, and now she&amp;#39;s finally achieved them._** - 她多年来一直朝着目标努力，现在终于实现了。
* **_The company will march into bankruptcy if they don&amp;#39;t receive new funding._** - 如果他们不能获得新的资金，公司将面临破产。

## 历史文化
March 是英语中的一个月份词语，源于古罗马语言。古罗马人将一年分为 12 个月，每个月份都有特定的名称和特征。 March 就是指春季的开端，是一月到三月的最后一个月份。

## 相关单词
- Noun: march, marches
- Verb: to march, marched, marching
- Adjective: march-like, martial
- Idiom: take a step forward (向前进步), take the initiative (采取主动)

## 词组搭配
- &amp;#34;take a step forward&amp;#34; (向前进步)
- &amp;#34;march towards&amp;#34; (朝着目标努力)
- &amp;#34;march into&amp;#34; (进入某个领域或状态)


Note: As a Chinese-English bilingual expert, I will provide the pronunciation in the American English accent and use markdown formatting.


CPU times: user 2.97 ms, sys: 2.83 ms, total: 5.8 ms
Wall time: 7.61 s
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;34-读取词表&#34;&gt;3.4 读取词表&lt;/h3&gt;
&lt;p&gt;假设你需要背 &lt;a href=&#34;words.csv&#34;&gt;&lt;em&gt;&lt;strong&gt;words.csv&lt;/strong&gt;&lt;/em&gt;&lt;/a&gt;中的单词，&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;words.csv&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/05-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;35-批量生成&#34;&gt;3.5 批量生成&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;o&#34;&gt;%%&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;time&lt;/span&gt;

&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;csv&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;ollama&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#读取提示&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;prompt&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;open&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;prompt.txt&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;encoding&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;utf-8&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;


&lt;span class=&#34;k&#34;&gt;def&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;diy_dictionary&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;word&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;response&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ollama&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;chat&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;model&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;llama3:8b&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;messages&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;
          &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;role&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;system&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;content&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;prompt&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;},&lt;/span&gt;
          &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;role&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;user&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;content&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;word&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;},&lt;/span&gt;
        &lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;

    &lt;span class=&#34;n&#34;&gt;result&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;response&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;message&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;][&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;content&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
    &lt;span class=&#34;k&#34;&gt;return&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;result&lt;/span&gt;
  


&lt;span class=&#34;c1&#34;&gt;#读取词表&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;words.csv&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;Dictionary&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;Word&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;apply&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;diy_dictionary&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;


&lt;span class=&#34;c1&#34;&gt;#保存成csv和md&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;to_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;word-dictionary.csv&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;index&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;False&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;with&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;open&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;Your-Diy-Dictionary.md&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;w&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;encoding&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;utf-8&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;mdf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;mdf&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;write&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&amp;lt;br&amp;gt;&amp;lt;br&amp;gt;&lt;/span&gt;&lt;span class=&#34;se&#34;&gt;\n&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;join&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;Dictionary&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]))&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/06-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;有些小失望， 如音标有的是 &lt;code&gt;[美音]&lt;/code&gt;，另一些是 &lt;code&gt;**美音**&lt;/code&gt;， 格式还不够统一。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;36-生成单词书&#34;&gt;3.6 生成单词书&lt;/h3&gt;
&lt;h4 id=&#34;361-选择主题&#34;&gt;3.6.1 选择主题&lt;/h4&gt;
&lt;p&gt;打开 &lt;strong&gt;Typora&lt;/strong&gt;(一种markdown软件)， 选择一种自己喜欢的 &lt;strong&gt;主题Theme&lt;/strong&gt; ，&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/07-pixyII.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/07-hara.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/07-seniva.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h4 id=&#34;362-导出pdf&#34;&gt;3.6.2 导出pdf&lt;/h4&gt;
&lt;p&gt;依次&lt;strong&gt;文件&amp;ndash;&amp;gt;导出&amp;ndash;&amp;gt;PDF或HTML&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/07-pdf.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;四讨论&#34;&gt;四、讨论&lt;/h2&gt;
&lt;p&gt;在本文中，我们展示了利用ollama制作单词书教案，实际上各位可以结合自身学习工作需要， 开发更多的应用场景。如果这份利用 ollama 自制教案对你有帮助，欢迎转发分享给你的朋友。 点击下载&lt;a href=&#34;project.zip&#34;&gt;本文代码&lt;/a&gt;&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;h2 id=&#34;精选内容&#34;&gt;精选内容&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/datasets_available_for_management_science/&#34;&gt;LIST | 可供社科(经管)领域使用的数据集汇总&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/the_text_analysis_list_about_ms/&#34;&gt;LIST | 社科(经管)数据挖掘文献资料汇总&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2024-06-16-scrapegraph-ai/&#34;&gt;网络爬虫 | 使用scrapegraph-ai(大模型方案)自动采集网页数据&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/&#34;&gt;推荐 | 文本分析库 cntext 使用手册&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/management_python_course/&#34;&gt;付费视频课 | Python实证指标构建与文本分析&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2024-06-14-using-large-language-model-to-extract-structure-data-from-raw-text/&#34;&gt;实验 | 使用本地大模型从文本中提取结构化信息&lt;/a&gt;
&lt;br&gt;
&lt;br&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
      <content:encoded><![CDATA[<h2 id="一任务描述">一、任务描述</h2>
<p>前几天分享了 <a href="https://textdata.cn/blog/2024-06-14-using-large-language-model-to-extract-structure-data-from-raw-text/">实验 | 使用本地大模型从文本中提取结构化信息</a> ，今天实验一个成功率更高的使用场景，生成单词书教案PDF。<br></p>
<br>
<p>假设你是英语老师，你希望在单词书中增加历史文化方面的信息， 市面上的单词书并不能很好的满足你的需要。针对这一需求， 我们可以利用大模型，定制你的单词书教案。例如单词 <em><strong>abandon</strong></em></p>
<p><img loading="lazy" src="img/07-pixyII.png" alt=""  />
</p>
<p><img loading="lazy" src="img/07-pdf.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="二ollama介绍">二、Ollama介绍</h2>
<p><a href="https://ollama.ai/"><strong>Ollama</strong></a>是一款开源应用程序，可让您使用 MacOS、Linux 和 Windows 上的命令行界面在本地运行、创建和共享大型语言模型。</p>
<p>Ollama 可以直接从其库中访问各种 LLM，只需一个命令即可下载。下载后，只需执行一个命令即可开始使用。这对于工作量围绕终端窗口的用户非常有帮助。如果他们被困在某个地方，他们可以在不切换到另一个浏览器窗口的情况下获得答案。</p>
<br>
<h3 id="21-特点和优点">2.1 特点和优点</h3>
<p>这就是为什么 OLLAMA 是您的工具包中必备的工具：</p>
<ul>
<li><strong>简单</strong> ：OLLAMA 提供简单的设置过程。您无需拥有机器学习博士学位即可启动和运行它。</li>
<li><strong>成本效益</strong> ：在本地运行模型意味着您无需支付云成本。您的钱包会感谢您。</li>
<li><strong>隐私</strong> ：使用 OLLAMA，所有数据处理都在您的本地机器上进行。这对于用户隐私来说是一个巨大的胜利。</li>
<li><strong>多功能性</strong> ：OLLAMA 不只是为 Python 爱好者准备的。它的灵活性使其可以用于各种应用程序，包括 Web 开发。</li>
</ul>
<br>
<h3 id="22-使用-ollama-进行-llm-选择">2.2 使用 Ollama 进行 LLM 选择</h3>
<p>默认情况下，Openai Models 在 CrewAI 中用作 llm。有经费、有网络、不担心数据泄露等条件下,  力求达到最佳性能，可考虑使用 GPT-4 或 OpenAI 稍便宜的 GPT-3.5。</p>
<p>但本文是要 <strong>本地部署</strong>， 因此我们将使用 Meta Llama 3，这是迄今为止功能最强大的公开 LLM。Meta Llama 3 是 Meta Inc. 开发的模型系列，是最新推出的模型，具有 8B 和 70B 两种参数大小（预训练或指令调整）。Llama 3 指令调整模型针对对话/聊天用例进行了微调和优化，并且在常见基准测试中胜过许多可用的开源聊天模型。</p>
<p><img loading="lazy" src="img/01-llama3-performance.png" alt=""  />
</p>
<p><img loading="lazy" src="img/02-llama3-performance.png" alt=""  />
</p>
<h2 id="二准备工作">二、准备工作</h2>
<h3 id="21-安装ollama">2.1 安装ollama</h3>
<p>点击前往网站 <a href="https://ollama.com/">https://ollama.com/</a> ，下载ollama软件，支持win、Mac、linux</p>
<p><img loading="lazy" src="img/03-ollama-gui.png" alt=""  />
</p>
<br>
<h3 id="22-下载llm">2.2 下载LLM</h3>
<p>ollama软件目前支持多种大模型， 如阿里的（qwen、qwen2）、meta的(llama3)，</p>
<p><img loading="lazy" src="img/04-ollama-model.png" alt=""  />
</p>
<br>
<p>以llama3为例，根据自己电脑显存性能， 选择适宜的版本。如果不知道选什么，那就试着安装，不合适不能用再删除即可。</p>
<p><img loading="lazy" src="img/05-ollama-llama3.png" alt=""  />
</p>
<br>
<p>打开电脑命令行cmd(mac是terminal),  网络是连网状态，执行模型下载(安装)命令</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">ollama pull llama3
</code></pre></div><p>等待 <strong>llama3:8b</strong> 下载完成。</p>
<br>
<h3 id="23-安装python包">2.3 安装python包</h3>
<p>在python中调用ollama服务，需要ollama包。</p>
<p>打开电脑命令行cmd(mac是terminal),  网络是连网状态，执行安装命令</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">pip3 install ollama
</code></pre></div><br>
<h3 id="24-启动ollama服务">2.4 启动ollama服务</h3>
<p>在Python中调用本地ollama服务，需要先启动本地ollama服务， 打开电脑命令行cmd(mac是terminal), 执行</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">ollama serve
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">2024/06/14 14:52:24 routes.go:1011: INFO server config env=&#34;map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_KEEP_ALIVE: OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS:/Users/deng/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_TMPDIR:]&#34;
time=2024-06-14T14:52:24.742+08:00 level=INFO source=images.go:725 msg=&#34;total blobs: 18&#34;
time=2024-06-14T14:52:24.742+08:00 level=INFO source=images.go:732 msg=&#34;total unused blobs removed: 0&#34;
time=2024-06-14T14:52:24.743+08:00 level=INFO source=routes.go:1057 msg=&#34;Listening on 127.0.0.1:11434 (version 0.1.44)&#34;
time=2024-06-14T14:52:24.744+08:00 level=INFO source=payload.go:30 msg=&#34;extracting embedded files&#34; dir=/var/folders/y0/4gqxky0s2t94x1c1qhlwr6100000gn/T/ollama4239159529/runners
time=2024-06-14T14:52:24.772+08:00 level=INFO source=payload.go:44 msg=&#34;Dynamic LLM libraries [metal]&#34;
time=2024-06-14T14:52:24.796+08:00 level=INFO source=types.go:71 msg=&#34;inference compute&#34; id=0 library=metal compute=&#34;&#34; driver=0.0 name=&#34;&#34; total=&#34;72.0 GiB&#34; available=&#34;72.0 GiB&#34;
</code></pre></div><p>cmd(mac是terminal)看到如上的信息，说明本地ollama服务已开启。</p>
<p><br><br></p>
<h2 id="三实验">三、实验</h2>
<h3 id="31-代码结构">3.1 代码结构</h3>
<p>点击下载<a href="project.zip">本文代码</a></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">project
  - 代码.ipynb #代码
  - prompt.txt #提示模板
  - words.csv  #准备的单词列表
  - word-dictionary.csv  #生成的单词书
  - Your-Diy-Dictionary.md #生成的带主题样式的单词书
</code></pre></div><br>
<h3 id="32-设计提示">3.2 设计提示</h3>
<p>需要根据单词，生成单词、音标、语义、例句、历史文化、相关单词等信息， 提示如下，</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">单词：
--- 
{word} 
---
   
你是一名中英文双语教育专家，拥有帮助将中文视为母语的用户理解和记忆英语单词的专长，请根据用户提供的英语单词{word}完成任务。
                
# {word}
markdown一级标题#
[美音]美国音标，斜体加粗
                
## 语义
- 系统地分析用户提供的单词，并以简单易懂的方式解答；
                
## 例句
- 为该单词提供至少 3 个不同场景下的使用方法和例句。并且附上中文翻译，以帮助用户更深入地理解单词意义。其中英文例句加粗斜体！
                
## 历史文化
- 详细介绍单词的造词来源和发展历史，以及在欧美文化中的内涵
                
## 相关单词
- 列出单词对应的名词、单复数、动词、不同时态、形容词、副词等的变形以及对应的中文翻译。
               
## 词组搭配
- 列出单词对应的固定搭配、组词以及对应的中文翻译。


注意: 如非特别说明尽量用中文，结果返回markdown格式; 均为二级标题##， 无序列表用-而不是*。
</code></pre></div><p>该提示已存储到 <em><strong>prompt.txt</strong></em> 内。</p>
<br>
<h3 id="33-小实验">3.3 小实验</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>

<span class="kn">import</span> <span class="nn">ollama</span>

<span class="c1">#读取提示</span>
<span class="n">prompt</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s1">&#39;prompt.txt&#39;</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>

<span class="k">def</span> <span class="nf">diy_dictionary</span><span class="p">(</span><span class="n">word</span><span class="p">):</span>
    <span class="n">response</span> <span class="o">=</span> <span class="n">ollama</span><span class="o">.</span><span class="n">chat</span><span class="p">(</span><span class="n">model</span><span class="o">=</span><span class="s1">&#39;llama3:8b&#39;</span><span class="p">,</span> <span class="n">messages</span><span class="o">=</span><span class="p">[</span>
          <span class="p">{</span><span class="s1">&#39;role&#39;</span><span class="p">:</span> <span class="s1">&#39;system&#39;</span><span class="p">,</span> <span class="s1">&#39;content&#39;</span><span class="p">:</span> <span class="n">prompt</span><span class="p">},</span>
          <span class="p">{</span><span class="s1">&#39;role&#39;</span><span class="p">:</span> <span class="s1">&#39;user&#39;</span><span class="p">,</span> <span class="s1">&#39;content&#39;</span><span class="p">:</span> <span class="n">word</span><span class="p">},</span>
        <span class="p">])</span>

    <span class="n">result</span> <span class="o">=</span> <span class="n">response</span><span class="p">[</span><span class="s1">&#39;message&#39;</span><span class="p">][</span><span class="s1">&#39;content&#39;</span><span class="p">]</span>
    <span class="k">return</span> <span class="n">result</span>


<span class="nb">print</span><span class="p">(</span><span class="n">diy_dictionary</span><span class="p">(</span><span class="n">word</span> <span class="o">=</span> <span class="s1">&#39;march&#39;</span><span class="p">))</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"># March


[美音] /mɑːrtʃ/


## 语义
March 是指第三个月份，但它也可以用于其他场景：
- 在军事或政治上，March 可以表示进军、推动或实施某些措施。
- 在生活中，March 可以表示开始新的项目或计划。

## 例句
* **_The company will march into the new market next quarter._** - 公司将在下一个季度进入新市场。
* **_She&#39;s been marching towards her goals for years, and now she&#39;s finally achieved them._** - 她多年来一直朝着目标努力，现在终于实现了。
* **_The company will march into bankruptcy if they don&#39;t receive new funding._** - 如果他们不能获得新的资金，公司将面临破产。

## 历史文化
March 是英语中的一个月份词语，源于古罗马语言。古罗马人将一年分为 12 个月，每个月份都有特定的名称和特征。 March 就是指春季的开端，是一月到三月的最后一个月份。

## 相关单词
- Noun: march, marches
- Verb: to march, marched, marching
- Adjective: march-like, martial
- Idiom: take a step forward (向前进步), take the initiative (采取主动)

## 词组搭配
- &#34;take a step forward&#34; (向前进步)
- &#34;march towards&#34; (朝着目标努力)
- &#34;march into&#34; (进入某个领域或状态)


Note: As a Chinese-English bilingual expert, I will provide the pronunciation in the American English accent and use markdown formatting.


CPU times: user 2.97 ms, sys: 2.83 ms, total: 5.8 ms
Wall time: 7.61 s
</code></pre></div><br>
<h3 id="34-读取词表">3.4 读取词表</h3>
<p>假设你需要背 <a href="words.csv"><em><strong>words.csv</strong></em></a>中的单词，</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;words.csv&#39;</span><span class="p">)</span>
<span class="n">df</span>
</code></pre></div><p><img loading="lazy" src="img/05-df.png" alt=""  />
</p>
<br>
<h3 id="35-批量生成">3.5 批量生成</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>

<span class="kn">import</span> <span class="nn">csv</span>
<span class="kn">import</span> <span class="nn">ollama</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="c1">#读取提示</span>
<span class="n">prompt</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s1">&#39;prompt.txt&#39;</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>


<span class="k">def</span> <span class="nf">diy_dictionary</span><span class="p">(</span><span class="n">word</span><span class="p">):</span>
    <span class="n">response</span> <span class="o">=</span> <span class="n">ollama</span><span class="o">.</span><span class="n">chat</span><span class="p">(</span><span class="n">model</span><span class="o">=</span><span class="s1">&#39;llama3:8b&#39;</span><span class="p">,</span> <span class="n">messages</span><span class="o">=</span><span class="p">[</span>
          <span class="p">{</span><span class="s1">&#39;role&#39;</span><span class="p">:</span> <span class="s1">&#39;system&#39;</span><span class="p">,</span> <span class="s1">&#39;content&#39;</span><span class="p">:</span> <span class="n">prompt</span><span class="p">},</span>
          <span class="p">{</span><span class="s1">&#39;role&#39;</span><span class="p">:</span> <span class="s1">&#39;user&#39;</span><span class="p">,</span> <span class="s1">&#39;content&#39;</span><span class="p">:</span> <span class="n">word</span><span class="p">},</span>
        <span class="p">])</span>

    <span class="n">result</span> <span class="o">=</span> <span class="n">response</span><span class="p">[</span><span class="s1">&#39;message&#39;</span><span class="p">][</span><span class="s1">&#39;content&#39;</span><span class="p">]</span>
    <span class="k">return</span> <span class="n">result</span>
  


<span class="c1">#读取词表</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;words.csv&#39;</span><span class="p">)</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;Dictionary&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;Word&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">diy_dictionary</span><span class="p">)</span>


<span class="c1">#保存成csv和md</span>
<span class="n">df</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s1">&#39;word-dictionary.csv&#39;</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s1">&#39;Your-Diy-Dictionary.md&#39;</span><span class="p">,</span> <span class="s1">&#39;w&#39;</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">mdf</span><span class="p">:</span>
    <span class="n">mdf</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="s1">&#39;&lt;br&gt;&lt;br&gt;</span><span class="se">\n</span><span class="s1">&#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;Dictionary&#39;</span><span class="p">]))</span>

<span class="n">df</span>
</code></pre></div><p><img loading="lazy" src="img/06-df.png" alt=""  />
</p>
<p>有些小失望， 如音标有的是 <code>[美音]</code>，另一些是 <code>**美音**</code>， 格式还不够统一。</p>
<br>
<h3 id="36-生成单词书">3.6 生成单词书</h3>
<h4 id="361-选择主题">3.6.1 选择主题</h4>
<p>打开 <strong>Typora</strong>(一种markdown软件)， 选择一种自己喜欢的 <strong>主题Theme</strong> ，</p>
<p><img loading="lazy" src="img/07-pixyII.png" alt=""  />
</p>
<p><img loading="lazy" src="img/07-hara.png" alt=""  />
</p>
<p><img loading="lazy" src="img/07-seniva.png" alt=""  />
</p>
<br>
<h4 id="362-导出pdf">3.6.2 导出pdf</h4>
<p>依次<strong>文件&ndash;&gt;导出&ndash;&gt;PDF或HTML</strong></p>
<p><img loading="lazy" src="img/07-pdf.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="四讨论">四、讨论</h2>
<p>在本文中，我们展示了利用ollama制作单词书教案，实际上各位可以结合自身学习工作需要， 开发更多的应用场景。如果这份利用 ollama 自制教案对你有帮助，欢迎转发分享给你的朋友。 点击下载<a href="project.zip">本文代码</a></p>
<br>
<br>
<h2 id="精选内容">精选内容</h2>
<ul>
<li><a href="https://textdata.cn/blog/datasets_available_for_management_science/">LIST | 可供社科(经管)领域使用的数据集汇总</a></li>
<li><a href="https://textdata.cn/blog/the_text_analysis_list_about_ms/">LIST | 社科(经管)数据挖掘文献资料汇总</a></li>
<li><a href="https://textdata.cn/blog/2024-06-16-scrapegraph-ai/">网络爬虫 | 使用scrapegraph-ai(大模型方案)自动采集网页数据</a></li>
<li><a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/">推荐 | 文本分析库 cntext 使用手册</a></li>
<li><a href="https://textdata.cn/blog/management_python_course/">付费视频课 | Python实证指标构建与文本分析</a></li>
<li><a href="https://textdata.cn/blog/2024-06-14-using-large-language-model-to-extract-structure-data-from-raw-text/">实验 | 使用本地大模型从文本中提取结构化信息</a>
<br>
<br></li>
</ul>
]]></content:encoded>
    </item>
    
    <item>
      <title>Open Sanctions | 使用该网站可查询被制裁的个人、企业组织等制裁清单</title>
      <link>https://textdata.cn/blog/2024-07-08-open-sanctions-dataset/</link>
      <pubDate>Mon, 08 Jul 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-07-08-open-sanctions-dataset/</guid>
      <description>&lt;p&gt;收集被制裁数据是一项劳动密集型过程，包括数据清理和质量保证。这给所有用户带来了不必要的重复工作，无论他们是金融科技/监管科技技术专家、调查记者、学者还是其他人。&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;声明: 大邓本人十分热爱「种花家」， 并不认可其他国家对「种花家」相关实体的制裁。&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;一关于&#34;&gt;一、关于&lt;/h2&gt;
&lt;p&gt;&lt;a href=&#34;https://www.opensanctions.org/&#34;&gt;OpenSanctions&lt;/a&gt; 是一个包含具有政治、犯罪或经济利益的个人和公司的国际数据库。该数据哭将制裁名单、政治公众人物数据库和其他与公众利益相关的人员信息整合成一个易于使用的数据集。这样可以轻松实现以下操作：&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;交叉检查数据库&lt;/strong&gt;是否存在利益冲突和非法活动的迹象。&lt;/li&gt;
&lt;li&gt;在国际交易中&lt;strong&gt;筛选潜在客户和合作伙伴。&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;追踪政治冲突&lt;/strong&gt;并比较国家制裁政策。&lt;/li&gt;
&lt;li&gt;将制裁和关注人员图表整合到现有数据产品中。&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/01-research.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/02-dataset.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二有何不同&#34;&gt;二、有何不同？&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;全面覆盖&lt;/strong&gt;：OpenSanctions将来自&lt;a href=&#34;https://www.opensanctions.org/datasets/&#34;&gt;数百个数据源&lt;/a&gt;和世界各地的数据整合成一个包含制裁、&lt;a href=&#34;https://www.opensanctions.org/pep/&#34;&gt;政治公众人物&lt;/a&gt;和犯罪相关实体的单一数据集。&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;关注数据质量：OpenSanctions的数据集经过仔细清理，包括&lt;/strong&gt;&lt;a href=&#34;https://www.opensanctions.org/articles/2021-11-11-deduplication/&#34;&gt;跨列表实体重复数据删除的&lt;/a&gt;人机交互过程，以及数千个手工制作的&lt;em&gt;数据补丁&lt;/em&gt;，以一致的方式构造识别信息，如出生日期、国家、地址或税务标识符。&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;为每个人提供批量数据&lt;/strong&gt;：OpenSanctions使原始数据易于访问，支持需要访问完整档案（而不是逐个实体的 API）的用例，甚至使OpenSanctions的客户能够在自己的基础设施内&lt;a href=&#34;https://www.opensanctions.org/docs/self-hosted/&#34;&gt;自行托管我们的 API 服务器。&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;可审计和开源：任何人都可以通过浏览&lt;/strong&gt;&lt;a href=&#34;https://github.com/opensanctions&#34;&gt;源代码&lt;/a&gt;来验证 OpenSanctions 的工作原理、突出显示问题、建议更改并提出改进建议。&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;三团队资金来源&#34;&gt;三、团队&amp;amp;资金来源&lt;/h2&gt;
&lt;p&gt;OpenSanctions 的开发和维护由一家营利性实体（OpenSanctions Datenbanken GmbH）协调，该实体提供&lt;a href=&#34;https://www.opensanctions.org/licensing/&#34;&gt;批量数据订阅&lt;/a&gt;和数据&lt;a href=&#34;https://www.opensanctions.org/api/&#34;&gt;API 访问&lt;/a&gt;。其目标是实现财务可持续性，使我们能够持续保持数据的可用性和可靠性。&lt;/p&gt;
&lt;p&gt;&lt;a href=&#34;https://occrp.org/&#34;&gt;2017 年至 2019 年，爬虫的维护由有组织犯罪和腐败报道项目&lt;/a&gt;的&lt;a href=&#34;https://sunu.in/&#34;&gt;Tarashish Mishra&lt;/a&gt;负责。您可以&lt;a href=&#34;https://github.com/opensanctions/opensanctions/graphs/contributors&#34;&gt;在 Github 上&lt;/a&gt;看到贡献爬虫的人员列表。我们还要感谢&lt;a href=&#34;https://marcdacosta.com/&#34;&gt;Marc da Costa&lt;/a&gt;、&lt;a href=&#34;https://twitter.com/mrpaulmay&#34;&gt;Paul May&lt;/a&gt;和&lt;a href=&#34;https://twitter.com/tmtm&#34;&gt;Tony Bowden&lt;/a&gt;为该项目提供的不懈建议。&lt;/p&gt;
&lt;p&gt;从 2021 年 9 月到 2022 年 2 月，该项目获得了德国联邦教育和研究部 (Bundesministerium für Bildung und Forschung, BMBF) 的资助，资助编号为&lt;code&gt;01IS21S48&lt;/code&gt;。本出版物内容的全部责任仍由其作者承担。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;四演示&#34;&gt;四、演示&lt;/h2&gt;
&lt;p&gt;咱们种花家的华为公司，在国际上，其实主要是被老美制裁的。在 &lt;code&gt;https://www.opensanctions.org/search/&lt;/code&gt; 搜一下&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/03-huawei.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;精选内容&#34;&gt;精选内容&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/datasets_available_for_management_science/&#34;&gt;LIST | 可供社科(经管)领域使用的数据集汇总&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/the_text_analysis_list_about_ms/&#34;&gt;LIST | 社科(经管)数据挖掘文献资料汇总&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/&#34;&gt;推荐 | 文本分析库cntext使用手册&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/management_python_course/&#34;&gt;付费视频课 | Python实证指标构建与文本分析&lt;/a&gt;
&lt;br&gt;
&lt;br&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
      <content:encoded><![CDATA[<p>收集被制裁数据是一项劳动密集型过程，包括数据清理和质量保证。这给所有用户带来了不必要的重复工作，无论他们是金融科技/监管科技技术专家、调查记者、学者还是其他人。</p>
<blockquote>
<p>声明: 大邓本人十分热爱「种花家」， 并不认可其他国家对「种花家」相关实体的制裁。</p>
</blockquote>
<p><br><br></p>
<h2 id="一关于">一、关于</h2>
<p><a href="https://www.opensanctions.org/">OpenSanctions</a> 是一个包含具有政治、犯罪或经济利益的个人和公司的国际数据库。该数据哭将制裁名单、政治公众人物数据库和其他与公众利益相关的人员信息整合成一个易于使用的数据集。这样可以轻松实现以下操作：</p>
<ul>
<li><strong>交叉检查数据库</strong>是否存在利益冲突和非法活动的迹象。</li>
<li>在国际交易中<strong>筛选潜在客户和合作伙伴。</strong></li>
<li><strong>追踪政治冲突</strong>并比较国家制裁政策。</li>
<li>将制裁和关注人员图表整合到现有数据产品中。</li>
</ul>
<p><img loading="lazy" src="img/01-research.png" alt=""  />
</p>
<p><img loading="lazy" src="img/02-dataset.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="二有何不同">二、有何不同？</h2>
<ul>
<li><strong>全面覆盖</strong>：OpenSanctions将来自<a href="https://www.opensanctions.org/datasets/">数百个数据源</a>和世界各地的数据整合成一个包含制裁、<a href="https://www.opensanctions.org/pep/">政治公众人物</a>和犯罪相关实体的单一数据集。</li>
<li><strong>关注数据质量：OpenSanctions的数据集经过仔细清理，包括</strong><a href="https://www.opensanctions.org/articles/2021-11-11-deduplication/">跨列表实体重复数据删除的</a>人机交互过程，以及数千个手工制作的<em>数据补丁</em>，以一致的方式构造识别信息，如出生日期、国家、地址或税务标识符。</li>
<li><strong>为每个人提供批量数据</strong>：OpenSanctions使原始数据易于访问，支持需要访问完整档案（而不是逐个实体的 API）的用例，甚至使OpenSanctions的客户能够在自己的基础设施内<a href="https://www.opensanctions.org/docs/self-hosted/">自行托管我们的 API 服务器。</a></li>
<li><strong>可审计和开源：任何人都可以通过浏览</strong><a href="https://github.com/opensanctions">源代码</a>来验证 OpenSanctions 的工作原理、突出显示问题、建议更改并提出改进建议。</li>
</ul>
<p><br><br></p>
<h2 id="三团队资金来源">三、团队&amp;资金来源</h2>
<p>OpenSanctions 的开发和维护由一家营利性实体（OpenSanctions Datenbanken GmbH）协调，该实体提供<a href="https://www.opensanctions.org/licensing/">批量数据订阅</a>和数据<a href="https://www.opensanctions.org/api/">API 访问</a>。其目标是实现财务可持续性，使我们能够持续保持数据的可用性和可靠性。</p>
<p><a href="https://occrp.org/">2017 年至 2019 年，爬虫的维护由有组织犯罪和腐败报道项目</a>的<a href="https://sunu.in/">Tarashish Mishra</a>负责。您可以<a href="https://github.com/opensanctions/opensanctions/graphs/contributors">在 Github 上</a>看到贡献爬虫的人员列表。我们还要感谢<a href="https://marcdacosta.com/">Marc da Costa</a>、<a href="https://twitter.com/mrpaulmay">Paul May</a>和<a href="https://twitter.com/tmtm">Tony Bowden</a>为该项目提供的不懈建议。</p>
<p>从 2021 年 9 月到 2022 年 2 月，该项目获得了德国联邦教育和研究部 (Bundesministerium für Bildung und Forschung, BMBF) 的资助，资助编号为<code>01IS21S48</code>。本出版物内容的全部责任仍由其作者承担。</p>
<p><br><br></p>
<h2 id="四演示">四、演示</h2>
<p>咱们种花家的华为公司，在国际上，其实主要是被老美制裁的。在 <code>https://www.opensanctions.org/search/</code> 搜一下</p>
<p><img loading="lazy" src="img/03-huawei.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="精选内容">精选内容</h2>
<ul>
<li><a href="https://textdata.cn/blog/datasets_available_for_management_science/">LIST | 可供社科(经管)领域使用的数据集汇总</a></li>
<li><a href="https://textdata.cn/blog/the_text_analysis_list_about_ms/">LIST | 社科(经管)数据挖掘文献资料汇总</a></li>
<li><a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/">推荐 | 文本分析库cntext使用手册</a></li>
<li><a href="https://textdata.cn/blog/management_python_course/">付费视频课 | Python实证指标构建与文本分析</a>
<br>
<br></li>
</ul>
]]></content:encoded>
    </item>
    
    <item>
      <title>科学上网违法吗</title>
      <link>https://textdata.cn/blog/2024-06-30-law-about-vpn/</link>
      <pubDate>Sun, 30 Jun 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-06-30-law-about-vpn/</guid>
      <description>&lt;br&gt;
&lt;br&gt;
&lt;h2 id=&#34;关于国际联网的规定&#34;&gt;关于国际联网的规定&lt;/h2&gt;
&lt;p&gt;根据《中华人民共和国计算机信息网络国际联网管理暂行规定》，计算机信息网络直接进行国际联网，必须使用邮电部国家公用电信网提供的国际出入口信道。任何单位和个人不得自行建立或者使用其他信道进行国际联网。违反这一规定，由公安机关责令停止联网，给予警告，可以并处15000元以下的罚款。因此，如果传播的外网内容是通过非法渠道（如VPN）获取的，那么这种行为本身就是违法的。&lt;/p&gt;
&lt;p&gt;国务院在1996年发布该《暂行规定》，在1997年修改。这个规定在过去20多年中“备而无用”，具有法律效应，但没有履行执法行为。执法主要从2017年开始收紧VPN市场。2017年1月22日，工信息化部公布了《关于清理规范互联网网络接入服务市场的通知》（下称《通知》），决定从当日起至2018年3月31日，在全国范围内清查网络基础设施和IP地址、宽带等网络接入资源。&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;“可能只是下载了一个VPN软件帮你连到国外，其实都是违法”。&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;精选内容&#34;&gt;精选内容&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/datasets_available_for_management_science/&#34;&gt;LIST | 可供社科(经管)领域使用的数据集汇总&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/the_text_analysis_list_about_ms/&#34;&gt;LIST | 社科(经管)数据挖掘文献资料汇总&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/&#34;&gt;推荐 | 文本分析库cntext使用手册&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/management_python_course/&#34;&gt;付费视频课 | Python实证指标构建与文本分析&lt;/a&gt;
&lt;br&gt;
&lt;br&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
      <content:encoded><![CDATA[<br>
<br>
<h2 id="关于国际联网的规定">关于国际联网的规定</h2>
<p>根据《中华人民共和国计算机信息网络国际联网管理暂行规定》，计算机信息网络直接进行国际联网，必须使用邮电部国家公用电信网提供的国际出入口信道。任何单位和个人不得自行建立或者使用其他信道进行国际联网。违反这一规定，由公安机关责令停止联网，给予警告，可以并处15000元以下的罚款。因此，如果传播的外网内容是通过非法渠道（如VPN）获取的，那么这种行为本身就是违法的。</p>
<p>国务院在1996年发布该《暂行规定》，在1997年修改。这个规定在过去20多年中“备而无用”，具有法律效应，但没有履行执法行为。执法主要从2017年开始收紧VPN市场。2017年1月22日，工信息化部公布了《关于清理规范互联网网络接入服务市场的通知》（下称《通知》），决定从当日起至2018年3月31日，在全国范围内清查网络基础设施和IP地址、宽带等网络接入资源。</p>
<blockquote>
<p>“可能只是下载了一个VPN软件帮你连到国外，其实都是违法”。</p>
</blockquote>
<p><br><br></p>
<h2 id="精选内容">精选内容</h2>
<ul>
<li><a href="https://textdata.cn/blog/datasets_available_for_management_science/">LIST | 可供社科(经管)领域使用的数据集汇总</a></li>
<li><a href="https://textdata.cn/blog/the_text_analysis_list_about_ms/">LIST | 社科(经管)数据挖掘文献资料汇总</a></li>
<li><a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/">推荐 | 文本分析库cntext使用手册</a></li>
<li><a href="https://textdata.cn/blog/management_python_course/">付费视频课 | Python实证指标构建与文本分析</a>
<br>
<br></li>
</ul>
]]></content:encoded>
    </item>
    
    <item>
      <title>转载 | 人生认知有层次</title>
      <link>https://textdata.cn/blog/2024-06-30-think-different/</link>
      <pubDate>Sun, 30 Jun 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-06-30-think-different/</guid>
      <description>我相信大多数人树立「自己理想」的画面大概率都是站在五星红旗下，立下我要做共产主义接班人。但随着年龄增长，国人新的理想没有树立起来，而旧的理想也逐渐淡去。于是可怜可悲的人生开始了， 渐渐的我们变成现实的人。但现实就一定能过好这一生吗</description>
      <content:encoded><![CDATA[<blockquote>
<p>我相信大多数人树立「自己理想」的画面大概率都是站在五星红旗下，立下我要做共产主义接班人。但随着年龄增长，国人新的理想没有树立起来，而旧的理想也逐渐淡去。于是可怜可悲的人生开始了， 渐渐的我们变成现实的人。但现实就一定能过好这一生吗？</p>
<p>今天又一次读到谢春霖的《认知红利》的认知的六个层次，兼具理论性和可操作性， 我觉得参照书中逻辑， 指导（优化）人生系统具有具有很高的可操作性。遂简单整理，分享给大家。</p>
</blockquote>
<br>
<h2 id="问题">问题</h2>
<p>假设X拥有「某品牌运动鞋」的品牌店， 门店在上海闹市区经营多年， 店里有一批员工，X每周都会来店里了解经营情况。但近来，X发现</p>
<ul>
<li>生意越来越差</li>
<li>X发现有些鞋子的进价比淘宝上的零售价还高</li>
<li>很多客人来店里逛一圈，最后竟然都到网上下单。</li>
<li>此外， 由于生意越来越差， 店员也开始变得消极， 客人进了店， 店员都不太愿意搭理&hellip;..</li>
<li>X看到这个情况非常生气，但刚准备发火， 其中某个店员竟然向X提出辞呈。</li>
<li>紧随其后，各种糟糕的事情发生， 房租变贵、滞销导致库存增加、城市中逛街的人流量变小。</li>
<li>&hellip;&hellip;</li>
</ul>
<p>店铺开始亏损，而X之前投入的大量装修成本和库存，如果现在关门，X的损失将非常大， 这个时候怎么办？</p>
<p>假设你是X， 你会怎么办?</p>
<br>
<br>
<h2 id="认知层次">认知层次</h2>
<p>谢春霖的《认知红利》提出了人类认知的六个层次，从低到高，依次是环境、行为、能力、BVR(价值观体系)、身份、精神。一般而言，认知层次越高，解决起来越容易越有效。针对X遇到的问题，处于不同层次，解决办法和效果可能是如下</p>
<table>
<thead>
<tr>
<th>层次</th>
<th>思维</th>
<th>类似问题</th>
<th>理论基础</th>
<th>措施</th>
<th>不足</th>
</tr>
</thead>
<tbody>
<tr>
<td>环境</td>
<td>错在外界</td>
<td>工作不顺， 领导是白痴<br>工作十年，没有晋升， 因为公司有办公室政治。<br>自己命不好</td>
<td>都是环境的错，改变环境，就能改变现有的处境</td>
<td></td>
<td>环境很难改变，或者改变的很慢，所以效果很差。</td>
</tr>
<tr>
<td>行为</td>
<td>错在自己</td>
<td>收入太低，因为不够努力<br>买不起房子， 因为不够努力<br>创业失败, 因为不够努力</td>
<td>爱拼才会赢</td>
<td>店铺营业时间从8小时改为24小时,<br>店员三班倒， 闲暇时间打电话找客户。</td>
<td>努力，是成功的必要不充分条件。有时候有效，有时候无效。</td>
</tr>
<tr>
<td>能力</td>
<td>方法(思路)比问题多</td>
<td>线下门店生意不好，可能是因为经营模式老旧,需要学习新的商业模式<br>和男朋友关系处的不好， 可能是沟通能力有问题， 需要专门去学《关键对话》等书</td>
<td>一定有人遇到过类似的问题，且已经有更好的解决办法。</td>
<td>将处境拆解成团队管理、营销方式、商业模式等不同的小问题。</td>
<td>选择不同的问题， 走向也将不同。 一旦选错， 只会离正确越来越远。如何选择， 是个更大的问题！</td>
</tr>
<tr>
<td>BVR</td>
<td>价值观(什么是最重要的)</td>
<td>我只想想过不差钱的人生， 为此我要学习经商的专业， 不浪费时间，做最有效率的事情。</td>
<td>Believe信念，相信什么是对的<br>Value价值观，A和B哪个更重要<br>Rule,做事的原则<br></td>
<td>团队管理、营销方式、商业模式哪个是最关键的问题，彼此之间有什么关系。是否遗漏了未知，但能改天换地因素。<br>淘宝的出现，导致交易结构发生变化(省去中间商赚差价)。客户因为淘宝便宜，而最终在淘宝下单。但线下店最大的优势是体验丰富，可开展多种体验活动，如全城跑不死大赛，让喜欢慢跑的人加入。</td>
<td>人生赢家。依然会面临选择，如年薪百万(无风险)和经营店铺收入百万(有风险)，如何选择？</td>
</tr>
<tr>
<td>身份</td>
<td>自己想成为什么样的人</td>
<td>成为心血管医生，造福这类疾病的患者。<br>我要当核物理学家，因为我觉得这很酷。</td>
<td>角色还是身份<br/>工地上，同样的搬砖的工作， 有的人认为我是搬砖的;也有人认为自己应该成为改变城市天际线的画家。<br>角色是被动，是别人给自己的<br>身份是自己主要选择的，是自己想成为的。</td>
<td>告诉自己， 我要成为自己做主的老板，而不是被别人定义自己。</td>
<td>世间高人，几乎瑕疵<br></td>
</tr>
<tr>
<td>精神</td>
<td>人活着就是为了改变世界</td>
<td>为天地立心<br>为生民立命<br>为往圣继绝学<br>为万世开太平。</td>
<td>人与世界的关系；<br>人生使命；</td>
<td>做对世界、对社会有用的人。我为人人，追求大我。在成就他人的同时，成就自我。</td>
<td>认知拉满，人生无价，人生无憾。</td>
</tr>
</tbody>
</table>
<p><br><br></p>
<h2 id="如何成为时代佼佼者">如何成为时代佼佼者？</h2>
<p>是否需要一级级打怪， 从低级到高级？</p>
<p>No！你可以直接让自己站在最高层次， 从高到底做好顶层设计，从精神层开始，从上往下规划。</p>
<p><img loading="lazy" src="img/life-path.png" alt=""  />
</p>
<table>
<thead>
<tr>
<th>理解层次</th>
<th>思考内容</th>
</tr>
</thead>
<tbody>
<tr>
<td>精神</td>
<td>我的人生使命是什么？世界因为我变得有什么不一样？</td>
</tr>
<tr>
<td>身份</td>
<td>为了实现这个使命？五年后，我会变成一个什么样的人？</td>
</tr>
<tr>
<td>BVR(价值观体系)</td>
<td>一套什么样的信念、价值观、原则能帮助我达到这个身份?<br>什么是最重要的?<br>我应该坚持什么，放弃什么？</td>
</tr>
<tr>
<td>能力</td>
<td>为了实现这个身份和这套BVR价值观体系， 我应该学什么知识技能？<br>掌握什么方法套路？<br>什么可以做？什么不可以做？</td>
</tr>
<tr>
<td>行为</td>
<td>具体怎么做？第一步是什么？今年的计划具体怎么安排</td>
</tr>
<tr>
<td>环境</td>
<td>哪些人和资源可以帮助我实现目标？我如何去使用身边的资源。</td>
</tr>
</tbody>
</table>
<p><br><br></p>
<h2 id="非同凡想">非同凡想</h2>
<p>1997年，美国苹果公司创始人史蒂夫·乔布斯为苹果公司广告《非同凡想》（原名Think Different，也被译为《致疯狂的人》）中发表了这段讲话。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">Here&#39;s to the crazy ones.
向那些疯狂的人致敬。

The misfits.
致特立独行者。

The rebels.
致桀骜不驯者。

The troublemakers.
致惹是生非者。

The round pegs in the square holes.
这些人是方孔中的圆钉。

The ones who see things diffrently.
他们以不同的角度看世界。

They&#39;re not fond of rules,and they have no respect for the status quo.
他们拒绝墨守成规，也不安于现状。

You can quote them,disagree with them, glorify or vilify them.
你可以引用他们，反对他们，赞扬他们或贬低他们。

About the only thing that you can&#39;t do is ignore them.
但你唯独就是不能漠视他们。

Because they change things.
因为他们改变了世界。

They invent. They imagine. They heal.
他们发明创造，发挥想象，治愈世界。

They explore. They create. They inspire.
他们探索未知，创造奇迹，激发灵感。

They push the human race forward.
他们推动人类不断前进。

Maybe they have to be crazy.
也许有时候他们必须疯狂。

How else can you stare at an empty canvas and see a work of art?
否则你能只盯着空空如也的画布就创造出艺术作品吗？

Or sit in silence and hear a song that&#39;s never been written?
否则你能只静静坐着就唱出一首没有写出来的歌曲吗？

Or gaze at a red planet and see a laboratory on wheels?
否则你能只凝视火星时就想到移动的太空实验室吗？

We make tools for these kinds of people.
我们为这些人创造工具。

While some may see them as the crazy ones, we see genius.
有些人可能视他们为疯子，我们则视他们为天才。

Because the people who are crazy enough to think that they can are the ones who do.
因为只有疯狂到认为自己能改变世界的人，才能真正改变世界。
</code></pre></div><p><br><br></p>
<h2 id="声明">声明</h2>
<p>侵删， 微信372335839</p>
<p><br><br></p>
<h2 id="精选内容">精选内容</h2>
<ul>
<li><a href="https://textdata.cn/blog/datasets_available_for_management_science/">LIST | 可供社科(经管)领域使用的数据集汇总</a></li>
<li><a href="https://textdata.cn/blog/the_text_analysis_list_about_ms/">LIST | 社科(经管)数据挖掘文献资料汇总</a></li>
<li><a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/">推荐 | 文本分析库 cntext 使用手册</a></li>
<li><a href="https://textdata.cn/blog/management_python_course/">付费视频课 | Python实证指标构建与文本分析</a></li>
</ul>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>数据集(英文）| USA Today新闻数据集(2012~2024)</title>
      <link>https://textdata.cn/blog/2024-06-22-usa_today_daily-news-dataset/</link>
      <pubDate>Sat, 22 Jun 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-06-22-usa_today_daily-news-dataset/</guid>
      <description>媒体数据集研究价值大， 您可从中提取丰富的指标，包括但不限于经济政策不确定性指数EPU 、 媒体关注度指数、文本相似度、情感分析。而且可训练词向量，构建新的词典，开发新的指标指数。计算机自然语言处理、经济学、管理学、新闻传播学、公共管理等领域均可使用。</description>
      <content:encoded><![CDATA[<p>今日分享一个数据集「<a href="https://www.usatoday.com/">今日美国USA Today</a>」，该网站是国内可合法访问(<a href="2024-06-30-law-about-vpn">科学上网违法吗</a>)，只是访问速度比较慢。</p>
<p><br><br></p>
<h2 id="一usa-today数据集">一、USA Today数据集</h2>
<h3 id="11-概况">1.1 概况</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">数据集:  USA Today
数据源:  https://www.usatoday.com/
记录数:  532628
覆盖日期: 2001-02-21 ~2024-06-30 
数据格式: CSV
数据体积: 3422 M
所含字段: date、title、content、author_link、publisher、link
本文声明: 如有问题， 请加微信372335839，备注「姓名-学校-专业」
</code></pre></div><p><img loading="lazy" src="img/01-usa-today.jpg" alt=""  />
</p>
<br>
<h3 id="12-数据用途">1.2 数据用途</h3>
<p>可提取丰富的指标，包括但不限于 **经济政策不确定性指数 **、<strong>环境政策不确定性</strong>、 <strong>媒体关注度指数</strong>、<strong>文本相似度</strong>、<strong>情感分析</strong>。此外， 可训练词向量，开发新的概念词典。数据带时间， 参照前面指标， 依主体、日期、指标进行计算， 可构造面板数据，构建新的指标指数。因此在经济学、管理学、新闻传播学、公共管理、社会学等领域均有较高的研究价值。</p>
<p>相关参考文献</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[1]洪永淼,刘俸奇,薛涧坡.政府与市场心理因素的经济影响及其测度[J].管理世界,2023,39(03):30-51.
[2]刘景江,郑畅然,洪永淼.机器学习如何赋能管理学研究？——国内外前沿综述和未来展望[J].管理世界,2023,39(09):191-216.
[3]张一帆,林建浩,樊嘉诚.新闻文本大数据与消费增速实时预测——基于叙事经济学的视角[J].金融研究,2023,(05):152-169.
[4]Huang, Yun, and Paul Luk. &#34;Measuring economic policy uncertainty in China.&#34; China Economic Review 59 (2020): 101367
[5]欧阳资生,陈世丽,杨希特,刘凤根,周学伟.经济政策不确定性、网络舆情与金融机构系统性风险[J].管理科学学报,2023,26(04):62-86.
[6]逯东,宋昕倍.媒体报道、上市公司年报可读性与融资约束[J].管理科学学报,2021,24(12):45-61.
[7]彭涛,黄福广,孙凌霞.经济政策不确定性与风险承担:基于风险投资的证据[J].管理科学学报,2021,24(03):98-114.
[8]庞锐.采纳与内化：多重制度压力如何影响河长制创新扩散——基于省级政府的定向配对事件史分析[J].公共管理学报,2023,20(02):25-37+165-166.
</code></pre></div><p><br><br></p>
<h2 id="二查看数据">二、查看数据</h2>
<h3 id="21-读取数据">2.1 读取数据</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;USA_Today.csv.gz&#39;</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/02-df.png" alt=""  />
</p>
<br>
<h3 id="22-所含字段">2.2 所含字段</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="o">.</span><span class="n">columns</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">Index([&#39;date&#39;, &#39;title&#39;, &#39;content&#39;, &#39;author_link&#39;, &#39;publisher&#39;, &#39;link&#39;], dtype=&#39;object&#39;)
</code></pre></div><br>
<h3 id="23-查看记录数">2.3 查看记录数</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">print(&#39;记录数：&#39;, len(df))
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">记录数： 532628
</code></pre></div><br>
<h3 id="24-覆盖日期">2.4 覆盖日期</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s1">&#39;date&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;date&#39;</span><span class="p">])</span>

<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;起:  &#39;</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;date&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">min</span><span class="p">())</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;止:  &#39;</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;date&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">())</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">起:   2001-02-21 21:01:00
止:   2024-06-30 10:55:00
</code></pre></div><br>
<h3 id="25-数据体积">2.5 数据体积</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">size</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">memory_usage</span><span class="p">(</span><span class="n">deep</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span><span class="o">/</span><span class="mi">1024</span><span class="o">/</span><span class="mi">1024</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;数据体积 </span><span class="si">{</span><span class="n">size</span><span class="si">:</span><span class="s1">.2f</span><span class="si">}</span><span class="s1"> M&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">数据体积 3422 M
</code></pre></div><br>
<h3 id="26-发文量统计">2.6 发文量统计</h3>
<p>按月度，统计发文量</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">months</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">date</span><span class="p">,</span> <span class="n">month_df</span> <span class="ow">in</span> <span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">pd</span><span class="o">.</span><span class="n">Grouper</span><span class="p">(</span><span class="n">key</span><span class="o">=</span><span class="s1">&#39;date&#39;</span><span class="p">,</span> <span class="n">freq</span><span class="o">=</span><span class="s1">&#39;M&#39;</span><span class="p">)):</span>
    <span class="c1">#print(date)</span>
    <span class="n">months</span><span class="o">.</span><span class="n">append</span><span class="p">((</span><span class="n">date</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">month_df</span><span class="p">)))</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">months</span><span class="p">)</span>
<span class="n">data</span><span class="o">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;date&#39;</span><span class="p">,</span> <span class="s1">&#39;count&#39;</span><span class="p">]</span>
<span class="n">data</span>
</code></pre></div><p><img loading="lazy" src="img/03-df.png" alt=""  />
</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">plotnine</span> <span class="kn">import</span> <span class="o">*</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">from</span> <span class="nn">matplotlib.font_manager</span> <span class="kn">import</span> <span class="n">FontProperties</span>

<span class="c1">#文泉驿微米黑.ttf位于代码同文件夹</span>
<span class="n">font_prop</span> <span class="o">=</span> <span class="n">FontProperties</span><span class="p">(</span><span class="n">fname</span><span class="o">=</span><span class="s1">&#39;文泉驿微米黑.ttf&#39;</span><span class="p">)</span> 

<span class="n">date_labels</span> <span class="o">=</span> <span class="p">[</span><span class="nb">str</span><span class="p">(</span><span class="n">y</span><span class="p">)</span> <span class="k">for</span> <span class="n">y</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">2001</span><span class="p">,</span> <span class="mi">2025</span><span class="p">)]</span>


<span class="p">(</span>
    <span class="n">ggplot</span><span class="p">(</span><span class="n">data</span><span class="p">,</span>  <span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="s1">&#39;date&#39;</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="s1">&#39;count&#39;</span><span class="p">))</span>
    <span class="o">+</span><span class="n">geom_point</span><span class="p">()</span>
    <span class="o">+</span><span class="n">geom_line</span><span class="p">()</span>
    <span class="o">+</span><span class="n">theme</span><span class="p">(</span><span class="n">figure_size</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">6</span><span class="p">),</span>
           <span class="n">text</span> <span class="o">=</span> <span class="n">element_text</span><span class="p">(</span><span class="n">family</span> <span class="o">=</span> <span class="n">font_prop</span><span class="o">.</span><span class="n">get_name</span><span class="p">()),</span> 
           <span class="n">plot_title</span> <span class="o">=</span> <span class="n">element_text</span><span class="p">(</span><span class="n">family</span> <span class="o">=</span> <span class="n">font_prop</span><span class="o">.</span><span class="n">get_name</span><span class="p">(),</span> <span class="n">size</span><span class="o">=</span><span class="mi">14</span><span class="p">)</span>
          <span class="p">)</span>
    <span class="o">+</span><span class="n">labs</span><span class="p">(</span><span class="n">title</span><span class="o">=</span><span class="s1">&#39;USA Today月度发文量(2001.02 ~2024.06)&#39;</span><span class="p">,</span>
          <span class="n">x</span> <span class="o">=</span> <span class="s1">&#39;月度&#39;</span><span class="p">,</span> 
          <span class="n">y</span> <span class="o">=</span> <span class="s1">&#39;发文量&#39;</span><span class="p">)</span>
    <span class="o">+</span><span class="n">scale_x_datetime</span><span class="p">(</span><span class="n">labels</span> <span class="o">=</span> <span class="n">date_labels</span><span class="p">,</span> <span class="n">breaks</span><span class="o">=</span><span class="n">date_labels</span><span class="p">)</span>
<span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/04-plot.png" alt=""  />
</p>
<br>
<br>
<h2 id="精选内容">精选内容</h2>
<ul>
<li><a href="https://textdata.cn/blog/datasets_available_for_management_science/">LIST | 可供社科(经管)领域使用的数据集汇总</a></li>
<li><a href="https://textdata.cn/blog/the_text_analysis_list_about_ms/">LIST | 社科(经管)数据挖掘文献资料汇总</a></li>
<li><a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/">推荐 | 文本分析库cntext2.x使用手册</a></li>
<li><a href="https://textdata.cn/blog/management_python_course/">付费视频课 | Python实证指标构建与文本分析</a></li>
<li><a href="**https://textdata.cn/blog/2024-07-12-china-daily-dataset/**">数据集(中英) | ChinaDaily新闻数据集(2008 ~ 2024)</a></li>
<li><a href="https://textdata.cn/blog/2024-07-12-entrepreneur-dataset/">数据集 | 企业家Entrepreneur杂志数据集(1996 ~ 2024)</a></li>
<li><a href="https://textdata.cn/blog/2023-11-17-how-handle-mega-csv-that-far-exceed-memory/">代码 | 如何处理远超电脑内存的csv文件</a></li>
<li><a href="https://textdata.cn/blog/2023-12-18-how-to-generate-panel-data-from-daily-news-dataset/"><strong>代码 | 使用「新闻数据」构造概念词提及量「面板数据」</strong></a></li>
<li><a href="https://textdata.cn/blog/2023-12-28-visualize-the-culture-change-using-people-daily-dataset/"><strong>可视化 | 人民日报语料反映七十年文化演变</strong></a></li>
<li><a href="https://textdata.cn/blog/2023-12-20-measure-china-economic-policy-uncertainty/">代码 | 使用「新闻数据」测量 「经济政策不确定性EPU」指标</a></li>
</ul>
<br>
<br>
]]></content:encoded>
    </item>
    
    <item>
      <title>代码 | 如何用Python计算知识宽度(赫芬达尔—赫希曼指数)</title>
      <link>https://textdata.cn/blog/2024-06-20-using-python-to-caculate-herfindahl-hirschman-index/</link>
      <pubDate>Thu, 20 Jun 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-06-20-using-python-to-caculate-herfindahl-hirschman-index/</guid>
      <description>赫芬达尔-赫希曼指数(Herfindahl-Hirschman Index)作为一种衡量市场集中度的经济指标，通常用于分析产业或市场中企业份额的分布情况。近年来有学者使用HHI算法测量专利的所涉领域的集中程度，反应专利的知识宽度。我们是否可能利用HHI来量化某个语料库中不同词汇的使用频率分布，以此来分析个人、群体或时代的语言风格、词汇丰富度、或是语言标准化与变化的趋势。如果词汇分布非常均匀，表明语言使用中的词汇多样性高，HHI值就会较低；反之，如果少数词汇占据了大部分文本空间，表明词汇使用集中，HHI值则较高。</description>
      <content:encoded><![CDATA[<h2 id="一相关概念">一、相关概念</h2>
<h3 id="11-赫芬达尔-赫希曼指数">1.1 赫芬达尔-赫希曼指数</h3>
<p>**赫芬达尔-赫希曼指数(Herfindahl-Hirschman Index)**作为一种衡量市场集中度的经济指标，通常用于分析产业或市场中企业份额的分布情况。近年来有学者使用HHI算法测量专利的所涉领域的集中程度，反应专利的知识宽度。</p>
<blockquote>
<p><strong>知识宽度</strong>是指在特定领域或跨领域中，个人或组织掌握的知识的多样性和广度。</p>
</blockquote>
<br>
<p>假设某行业有N家公司，每家公司的市场份额为MSi, 则该行业的HHI指数计算公式</p>
<p><img loading="lazy" src="img/01-hhi-algo.png" alt=""  />
</p>
<br>
<h3 id="12-专利ipc号">1.2 、专利IPC号</h3>
<p>IPC号是国际专利分类体系（International Patent Classification, IPC）的缩写，它是一个用于将专利归类到特定技术领域的全球性标准。IPC系统由世界知识产权组织（WIPO）维护，旨在标准化专利文献的分类，以便于检索和分析。</p>
<p><img loading="lazy" src="img/02-ipc-class.png" alt=""  />
</p>
<p>IPC号的结构如下：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">1. 部(Section): 用大写字母A到H表示，共分8个部，每个部覆盖特定的技术领域。
2. 大类(Class): 用两位数字表示；每个部下面进一步细分为大类。
3. 小类(Subclass): 用大写字母表示， 大类下面再细分为小类，。
4. 大组(Main Group) 和 小组(Sub-Group)：小类下面进一步细分，用斜杠（/）分隔的数字表示。
   - 大组：用两位数字表示。
   - 小组：大组后面跟着的两位数字。
</code></pre></div><br>
<h3 id="13-专利与hhi">1.3 专利与HHI</h3>
<p>在创新领域，使用hhi计算专利质量，最小粒度是大组(Group)。举例说明</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">|专利序号|       IPC号         |
|  1   | A01B01/00;A01B01/01 |
|  2   | A01B01/00;A01C01/01 |
</code></pre></div><p>如果用HHI计算</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">#1种知识，份额1/1
patent1_HHI = (1/1)*(1/1)  = 1

#2种知识，份额各1/2
patent2_HHI = (1/2)*(1/2) + (1/2)*(1/2) = 1/2
</code></pre></div><p>从知识集中程度（HHI），专利1 知识更聚焦。</p>
<p>衡量一个人的知识有广度和深度两种不同的角度， 在创新创业领域， 习惯用专利的(1-hhi)来表示专利质量(广度)。</p>
<p><br><br></p>
<h2 id="二实验-衡量专利质量">二、实验: 衡量专利质量</h2>
<p>准备了三个专利，</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">&#34;A01B01/00&#34;
&#34;A01B01/00;B01D01/01&#34;
&#34;A01B01/00;B01D01/01;C01B01/01&#34;
&#34;A01B01/00;B01D01/01;C01B01/01;D01B01/01&#34;
&#34;A01B01/00;B01D01/01;C01B01/01;D01B01/01;F01B01/01&#34;
</code></pre></div><p>从上到下，知识宽度越来越大， 集中程度(HHI)越来越小。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>

<span class="k">def</span> <span class="nf">ipc_hhi</span><span class="p">(</span><span class="n">ipc_text</span><span class="p">):</span>
  	<span class="c1">#ipc_text字符串，形如&#34;A01B01/00;B01D01/01;F01B01/01&#34;</span>
    <span class="n">ipc_list</span> <span class="o">=</span> <span class="p">[</span><span class="n">group</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">&#39;/&#39;</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span> <span class="k">for</span> <span class="n">group</span> <span class="ow">in</span> <span class="n">ipc_text</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">&#39;;&#39;</span><span class="p">)]</span>
    <span class="n">ipc_group_counts</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">Counter</span><span class="p">(</span><span class="n">ipc_list</span><span class="p">)</span><span class="o">.</span><span class="n">values</span><span class="p">())</span>
    <span class="n">ipc_props</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">ipc_group_counts</span><span class="p">)</span><span class="o">/</span><span class="nb">sum</span><span class="p">(</span><span class="n">ipc_group_counts</span><span class="p">)</span>
    <span class="n">hhi_value</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">(</span><span class="n">ipc_prop</span><span class="o">**</span><span class="mi">2</span> <span class="k">for</span> <span class="n">ipc_prop</span> <span class="ow">in</span> <span class="n">ipc_props</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">hhi_value</span>
  
<span class="nb">print</span><span class="p">(</span><span class="n">ipc_hhi</span><span class="p">(</span><span class="s2">&#34;A01B01/00&#34;</span><span class="p">))</span>
<span class="nb">print</span><span class="p">(</span><span class="n">ipc_hhi</span><span class="p">(</span><span class="s2">&#34;A01B01/00;B01D01/01&#34;</span><span class="p">))</span>
<span class="nb">print</span><span class="p">(</span><span class="n">ipc_hhi</span><span class="p">(</span><span class="s2">&#34;A01B01/00;B01D01/01;C01B01/01&#34;</span><span class="p">))</span>
<span class="nb">print</span><span class="p">(</span><span class="n">ipc_hhi</span><span class="p">(</span><span class="s2">&#34;A01B01/00;B01D01/01;C01B01/01;D01B01/01&#34;</span><span class="p">))</span>
<span class="nb">print</span><span class="p">(</span><span class="n">ipc_hhi</span><span class="p">(</span><span class="s2">&#34;A01B01/00;B01D01/01;C01B01/01;D01B01/01;F01B01/01&#34;</span><span class="p">))</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">1.0
0.5
0.3333333333333333
0.25
0.20000000000000004
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">print(1-ipc_hhi(&#34;A01B01/00&#34;))
print(1-ipc_hhi(&#34;A01B01/00;B01D01/01&#34;))
print(1-ipc_hhi(&#34;A01B01/00;B01D01/01;C01B01/01&#34;))
print(1-ipc_hhi(&#34;A01B01/00;B01D01/01;C01B01/01;D01B01/01&#34;))
print(1-ipc_hhi(&#34;A01B01/00;B01D01/01;C01B01/01;D01B01/01;F01B01/01&#34;))
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">0.0
0.5
0.6666666666666667
0.75
0.7999999999999999
</code></pre></div><p>知识宽度(1-hhi)越来越大。</p>
<p><br><br></p>
<h2 id="三语言的hhi">三、语言的HHI</h2>
<h3 id="31-联想">3.1 联想</h3>
<blockquote>
<p>本节语言是使用的大模型，未查阅文献。通义千问，提问[赫芬达尔-赫希曼指数(Herfindahl-Hirschman Index)是否可以测量一个人用语(表达)的特质]</p>
</blockquote>
<p>前人类比市场集中程度，用于测量专利质量(知识宽度)。 那放在文本语言中，我们是否可能利用HHI来量化某个语料库中不同词汇的使用频率分布，以此来分析个人、群体或时代的语言风格、词汇丰富度、或是语言标准化与变化的趋势。</p>
<ul>
<li>如果词汇分布非常均匀，表明语言使用中的词汇多样性高，HHI值就会较低；</li>
<li>反之，如果少数词汇占据了大部分文本空间，表明词汇使用集中，HHI值则较高。</li>
</ul>
<p>结合其他语言学指标一起使用，比如TTR（Type-Token Ratio，类型-标记比率）、Shannon entropy（香农熵）等，共同评估语言表达的复杂度和多样性。不过，这类研究的文献相对较少，因为语言学领域有自己一套成熟且专业的分析工具和方法，HHI更多地被视为跨学科应用的一个创新尝试。</p>
<br>
<h3 id="32-词语的hhi">3.2 词语的HHI</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">collections</span> <span class="kn">import</span> <span class="n">Counter</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="kn">import</span> <span class="nn">jieba</span>

<span class="k">def</span> <span class="nf">word_hhi</span><span class="p">(</span><span class="n">text</span><span class="p">):</span>
    <span class="s2">&#34;&#34;&#34;计算文本词汇使用的HHI&#34;&#34;&#34;</span>
    <span class="n">word_counts</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">Counter</span><span class="p">(</span><span class="n">jieba</span><span class="o">.</span><span class="n">lcut</span><span class="p">(</span><span class="n">text</span><span class="p">))</span><span class="o">.</span><span class="n">values</span><span class="p">())</span>
    <span class="n">word_props</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">word_counts</span><span class="p">)</span><span class="o">/</span><span class="nb">sum</span><span class="p">(</span><span class="n">word_counts</span><span class="p">)</span>
    <span class="n">hhi_value</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">(</span><span class="n">w_prop</span><span class="o">**</span><span class="mi">2</span> <span class="k">for</span> <span class="n">w_prop</span> <span class="ow">in</span> <span class="n">word_props</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">hhi_value</span>
  
<span class="n">personA</span> <span class="o">=</span> <span class="s1">&#39;这场音乐会太嗨了&#39;</span>
<span class="n">personB</span> <span class="o">=</span> <span class="s1">&#39;这场音乐会说出来令你不敢相信，主办方策划有方，群众激情满满，我印象深刻，体验感拉满&#39;</span>

<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;A-hhi&#39;</span><span class="p">,</span> <span class="n">word_hhi</span><span class="p">(</span><span class="n">personA</span><span class="p">))</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;B-hhi&#39;</span><span class="p">,</span> <span class="n">word_hhi</span><span class="p">(</span><span class="n">personB</span><span class="p">))</span>

<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;A词汇多样性&#39;</span><span class="p">,</span> <span class="mi">1</span><span class="o">-</span><span class="n">word_hhi</span><span class="p">(</span><span class="n">personA</span><span class="p">))</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;B词汇多样性&#39;</span><span class="p">,</span> <span class="mi">1</span><span class="o">-</span><span class="n">word_hhi</span><span class="p">(</span><span class="n">personB</span><span class="p">))</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">A-hhi 0.20000000000000004
B-hhi 0.07024793388429751

A词汇多样性 0.7999999999999999
B词汇多样性 0.9297520661157025
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#pip3 install cntext --upgrade</span>
<span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="n">personA</span> <span class="o">=</span> <span class="s1">&#39;这场音乐会太嗨了&#39;</span>
<span class="n">personB</span> <span class="o">=</span> <span class="s1">&#39;这场音乐会说出来令你不敢相信，主办方策划有方，群众激情满满，我印象深刻，体验感拉满&#39;</span>


<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;A-hhi&#39;</span><span class="p">,</span> <span class="n">ct</span><span class="o">.</span><span class="n">word_hhi</span><span class="p">(</span><span class="n">personA</span><span class="p">))</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;B-hhi&#39;</span><span class="p">,</span> <span class="n">ct</span><span class="o">.</span><span class="n">word_hhi</span><span class="p">(</span><span class="n">personB</span><span class="p">))</span>

<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;A词汇多样性&#39;</span><span class="p">,</span> <span class="mi">1</span> <span class="o">-</span> <span class="n">ct</span><span class="o">.</span><span class="n">word_hhi</span><span class="p">(</span><span class="n">personA</span><span class="p">))</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;B词汇多样性&#39;</span><span class="p">,</span> <span class="mi">1</span> <span class="o">-</span> <span class="n">ct</span><span class="o">.</span><span class="n">word_hhi</span><span class="p">(</span><span class="n">personB</span><span class="p">))</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">A-hhi 0.20000000000000004
B-hhi 0.07024793388429751

A词汇多样性 0.7999999999999999
B词汇多样性 0.9297520661157025
</code></pre></div><br>
<br>
<h2 id="精选内容">精选内容</h2>
<ul>
<li><a href="https://textdata.cn/blog/datasets_available_for_management_science/">LIST | 可供社科(经管)领域使用的数据集汇总</a></li>
<li><a href="https://textdata.cn/blog/the_text_analysis_list_about_ms/">LIST | 社科(经管)数据挖掘文献资料汇总</a></li>
<li><a href="https://textdata.cn/blog/2024-06-16-scrapegraph-ai/">网络爬虫 | 使用scrapegraph-ai(大模型方案)自动采集网页数据</a></li>
<li><a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/">推荐 | 文本分析库 cntext 使用手册</a></li>
<li><a href="https://textdata.cn/blog/management_python_course/">付费视频课 | Python实证指标构建与文本分析</a>
<br>
<br></li>
</ul>
]]></content:encoded>
    </item>
    
    <item>
      <title>MOR | 使用md&amp;a测量企业民族主义指标</title>
      <link>https://textdata.cn/blog/2024-06-17-firms-rhetorical-nationalism/</link>
      <pubDate>Tue, 18 Jun 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-06-17-firms-rhetorical-nationalism/</guid>
      <description>该论文开发了一种企业层面的修辞民族主义计算方法。回顾文献，开发了与企业相关的四维民族主义理论框架：民族自豪感、排外主义、主导议程（民族复兴）和企业角色。然后，使用文本分析方法计算了 2000 年至 2020 年中国上市公司的 41,000 多份年度报告，并为每个维度确定了一个词典。该论文数据集可公开访问：https://sites.google.com/view/firms-rhetorical-nationalism</description>
      <content:encoded><![CDATA[<h2 id="一文献">一、文献</h2>
<p>Yue, Lori Qingyuan, Jiexin Zheng, and Kaixian Mao. &ldquo;Firms’ Rhetorical Nationalism: Theory, Measurement, and Evidence from a Computational Analysis of Chinese Public Firms.&rdquo; <em>Management and Organization Review</em> 20, no. 2 (2024): 161-203.</p>
<h3 id="摘要">摘要</h3>
<p>本文建立了 <strong>企业民族主义</strong> 的理论框架和概念测量。我们首先回顾了相关文献，并建立了一个四维的企业民族主义理论框架：民族自豪感、排外主义、主导议程（民族复兴）和企业角色(在实现国家民族主义目标中的使命和角色)。我们使用基于机器学习的文本分析方法，对2000年成立到2020年中国市政多份年报进行分析，并为每个维度确定了一个词库。利用相关词汇的加权比例，我们建立了中国上市公司语言民族主义测量，并首次提供了中国国有企业语言民族主义上升的实证证据。企业在语言上表现出的民族主义与其战略因素有关；国有企业、历史较长、规模较大、盈利能力较强、面向消费者、个人投资者较多、海外销售额较少的企业表现出的民族主义水平较高。这些在语言上表现出更多民族主义的企业，其未来的财务回报率也较高。</p>
<br>
<br>
<h2 id="二企业修辞民族主义">二、企业修辞民族主义</h2>
<h3 id="21-原文算法">2.1 原文算法</h3>
<p>依据大邓对论文的理解，复现企业修辞民族主义测量过程， 大致可分为3个步骤</p>
<ul>
<li>Step1 民族主义(四维度)理论基础</li>
<li>Step2 使用<a href="https://textdata.cn/blog/2023-03-24-load-w2v-and-expand-your-concpet-dicitonary/">md&amp;a语料训练words2vec</a>，扩充民族主义词典</li>
<li>Step3 使用民族主义词典]，tfidf方式计算民族主义指标</li>
</ul>
<br>
<h3 id="22-已有资源">2.2 已有资源</h3>
<p>大邓已有的数据或者工具</p>
<ol>
<li>已有<a href="https://textdata.cn/blog/2023-03-24-load-w2v-and-expand-your-concpet-dicitonary/">md&amp;a训练的word2vec模型</a></li>
<li><a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/">cntext2.1.7</a>的sentiment函数，可实现词典的文本分析</li>
<li><a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/">cntext2.1.7</a>内置了民族主义词典</li>
</ol>
<br>
<h3 id="23-注意">2.3 注意</h3>
<p>所以我们直接进行到Step3， 为了简化本文复现难度， 没有使用tfidf方式测量。</p>
<blockquote>
<ul>
<li>
<p>常规文本分析默认词典中的所有词语权重均为1，</p>
</li>
<li>
<p>而tfidf认为词典中的词语是有差异的，带着不同的权重。</p>
</li>
</ul>
</blockquote>
<p><br><br></p>
<h2 id="三代码实现">三、代码实现</h2>
<h3 id="31-查看词典">3.1 查看词典</h3>
<p>大邓整理了论文中的词表，将其内置于cntext2.1.7</p>
<p><img loading="lazy" src="img/01-dict.png" alt=""  />
</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#pip3 install cntext==2.1.7</span>
<span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>
<span class="n">nationism_diction_info</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">read_yaml_dict</span><span class="p">(</span><span class="s1">&#39;zh_common_RhetoricalNationalism.yaml&#39;</span><span class="p">)</span>
<span class="n">nationism_diction_info</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">{&#39;Name&#39;: &#39;Firms Rhetorical Nationalism&#39;,
 
 &#39;Desc&#39;: &#39;企业修辞民族主义，含四个词表， 分别是民族自豪感、排外、民族复兴和企业角色。 https://sites.google.com/view/firms-rhetorical-nationalism/home&#39;,
 
 &#39;Refer&#39;: &#39;Yue, Lori Qingyuan, Jiexin Zheng, and Kaixian Mao. &#34;Firms’ Rhetorical Nationalism: Theory, Measurement, and Evidence from a Computational Analysis of Chinese Public Firms.&#34; Management and Organization Review 20, no. 2 (2024): 161-203.&#39;,
 
 &#39;Category&#39;: [&#39;民族自豪感&#39;, &#39;排外主义&#39;, &#39;民族复兴&#39;, &#39;企业角色&#39;],
 
 &#39;Dictionary&#39;: {
 		&#39;民族自豪感&#39;: [&#39;中华文化&#39;, &#39;瑰宝&#39;, &#39;源远流长&#39;,......, &#39;人民满意&#39;, &#39;纲领性文件&#39;, &#39;国民素质&#39;],
  
  	&#39;排外&#39;: [&#39;贸易战&#39;, &#39;争端&#39;, &#39;制裁&#39;,......, &#39;离岸&#39;, &#39;卡脖子&#39;, &#39;原油&#39;],
  
  	&#39;民族复兴&#39;: [&#39;中国梦&#39;, &#39;宏伟目标&#39;, &#39;共同富裕&#39;,......, &#39;新起点&#39;, &#39;新篇章&#39;],
 
 		&#39;企业角色&#39;: [&#39;自主&#39;, &#39;世界领先&#39;, &#39;独立自主&#39;, ......, &#39;产业报国&#39;,&#39;建功立业&#39;]}}
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="n">nationism_diction</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">read_yaml_dict</span><span class="p">(</span><span class="s1">&#39;zh_common_RhetoricalNationalism.yaml&#39;</span><span class="p">)[</span><span class="s1">&#39;Dictionary&#39;</span><span class="p">]</span>
<span class="n">nationism_diction</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">{&#39;民族自豪感&#39;: [&#39;中华文化&#39;, &#39;瑰宝&#39;, &#39;源远流长&#39;,......, &#39;人民满意&#39;, &#39;纲领性文件&#39;, &#39;国民素质&#39;],

&#39;排外&#39;: [&#39;贸易战&#39;, &#39;争端&#39;, &#39;制裁&#39;,......, &#39;离岸&#39;, &#39;卡脖子&#39;, &#39;原油&#39;],

&#39;民族复兴&#39;: [&#39;中国梦&#39;, &#39;宏伟目标&#39;, &#39;共同富裕&#39;,......, &#39;新起点&#39;, &#39;新篇章&#39;],

&#39;企业角色&#39;: [&#39;自主&#39;, &#39;世界领先&#39;, &#39;独立自主&#39;, ......, &#39;产业报国&#39;,&#39;建功立业&#39;]}}
</code></pre></div><br>
<h3 id="32-小实验">3.2 小实验</h3>
<p>写代码，要先简单(抽象局部)后复杂(扩展到整体)。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="c1">#实验文本</span>
<span class="n">text</span> <span class="o">=</span> <span class="s1">&#39;某某公司高举产业报国精神， 独立自主创新， 经过多年发展，该公司在该领域处于世界领先&#39;</span>

<span class="c1">#民族主义词典</span>
<span class="n">nationism_diction</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">read_yaml_dict</span><span class="p">(</span><span class="s1">&#39;zh_common_RhetoricalNationalism.yaml&#39;</span><span class="p">)[</span><span class="s1">&#39;Dictionary&#39;</span><span class="p">]</span>
<span class="n">nationism_diction</span>

<span class="nb">print</span><span class="p">(</span><span class="n">ct</span><span class="o">.</span><span class="n">__version__</span><span class="p">)</span>
<span class="n">ct</span><span class="o">.</span><span class="n">sentiment</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="n">text</span><span class="p">,</span> <span class="n">diction</span><span class="o">=</span><span class="n">nationism_diction</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">2.1.3

{&#39;民族自豪感_num&#39;: 0,
 &#39;排外主义_num&#39;: 0,
 &#39;民族复兴_num&#39;: 1,
 &#39;企业角色_num&#39;: 4,
 &#39;stopword_num&#39;: 8,
 &#39;word_num&#39;: 22,
 &#39;sentence_num&#39;: 1}
</code></pre></div><br>
<p>计算结果解读</p>
<ul>
<li><em><strong>民族自豪感_num</strong></em> 文本中民族自豪感词语出现总次数</li>
<li><em><strong>排外_num</strong></em> 文本中排外词语出现总次数</li>
<li><em><strong>民族复兴_num</strong></em> 文本中民族复兴词语出现总次数</li>
<li><em><strong>企业角色_num</strong></em> 文本中企业角色词语出现总次数</li>
<li><em><strong>stopword_num</strong></em> 文本中停用词词语出现总次数</li>
<li><em><strong>word_num</strong></em> 文本中词语总数</li>
<li><em><strong>sentence_num</strong></em> 文本中句子总数</li>
</ul>
<br>
<h3 id="33-读取mda">3.3 读取md&amp;a</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;mda01-23.csv.gz&#39;</span><span class="p">)</span>
<span class="n">df</span>
</code></pre></div><p><img loading="lazy" src="img/02-df.png" alt=""  />
</p>
<br>
<h3 id="34-批量计算民族主义">3.4 批量计算民族主义</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">from</span> <span class="nn">pandarallel</span> <span class="kn">import</span> <span class="n">pandarallel</span>
<span class="n">pandarallel</span><span class="o">.</span><span class="n">initialize</span><span class="p">(</span><span class="n">progress_bar</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>

<span class="k">def</span> <span class="nf">nationism_stats</span><span class="p">(</span><span class="n">text</span><span class="p">):</span>
    <span class="k">return</span> <span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">(</span><span class="n">ct</span><span class="o">.</span><span class="n">sentiment</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="n">text</span><span class="p">,</span> <span class="n">diction</span><span class="o">=</span><span class="n">nationism_diction</span><span class="p">,</span> <span class="n">lang</span><span class="o">=</span><span class="s1">&#39;chinese&#39;</span><span class="p">,</span> <span class="n">return_df</span><span class="o">=</span><span class="kc">False</span><span class="p">))</span>

<span class="c1">#统计词频</span>
<span class="c1">#并行运算</span>
<span class="n">stats_df</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;text&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">parallel_apply</span><span class="p">(</span><span class="n">nationism_stats</span><span class="p">)</span>

<span class="c1">#计算四个维度民族主义的指标</span>
<span class="n">stats_df</span><span class="p">[</span><span class="s1">&#39;民族自豪感&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="mi">100</span><span class="o">*</span><span class="n">stats_df</span><span class="p">[</span><span class="s1">&#39;民族自豪感_num&#39;</span><span class="p">]</span><span class="o">/</span><span class="n">stats_df</span><span class="p">[</span><span class="s1">&#39;word_num&#39;</span><span class="p">]</span>
<span class="n">stats_df</span><span class="p">[</span><span class="s1">&#39;排外主义&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="mi">100</span><span class="o">*</span><span class="n">stats_df</span><span class="p">[</span><span class="s1">&#39;排外主义_num&#39;</span><span class="p">]</span><span class="o">/</span><span class="n">stats_df</span><span class="p">[</span><span class="s1">&#39;word_num&#39;</span><span class="p">]</span>
<span class="n">stats_df</span><span class="p">[</span><span class="s1">&#39;民族复兴&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="mi">100</span><span class="o">*</span><span class="n">stats_df</span><span class="p">[</span><span class="s1">&#39;民族复兴_num&#39;</span><span class="p">]</span><span class="o">/</span><span class="n">stats_df</span><span class="p">[</span><span class="s1">&#39;word_num&#39;</span><span class="p">]</span>
<span class="n">stats_df</span><span class="p">[</span><span class="s1">&#39;企业角色&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="mi">100</span><span class="o">*</span><span class="n">stats_df</span><span class="p">[</span><span class="s1">&#39;企业角色_num&#39;</span><span class="p">]</span><span class="o">/</span><span class="n">stats_df</span><span class="p">[</span><span class="s1">&#39;word_num&#39;</span><span class="p">]</span>
<span class="n">stats_df</span><span class="p">[</span><span class="s1">&#39;code&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;code&#39;</span><span class="p">]</span>
<span class="n">stats_df</span><span class="p">[</span><span class="s1">&#39;year&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;year&#39;</span><span class="p">]</span>
<span class="n">stats_df</span><span class="p">[</span><span class="s1">&#39;民族主义(汇总)&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">stats_df</span><span class="p">[[</span><span class="s1">&#39;民族自豪感&#39;</span><span class="p">,</span> <span class="s1">&#39;排外主义&#39;</span><span class="p">,</span> <span class="s1">&#39;民族复兴&#39;</span><span class="p">,</span> <span class="s1">&#39;企业角色&#39;</span><span class="p">]]</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>


<span class="c1">#选择需要的字段显示和存储</span>
<span class="n">select_cols</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;code&#39;</span><span class="p">,</span> <span class="s1">&#39;year&#39;</span><span class="p">,</span> <span class="s1">&#39;民族自豪感&#39;</span><span class="p">,</span> <span class="s1">&#39;排外&#39;</span><span class="p">,</span> <span class="s1">&#39;民族复兴&#39;</span><span class="p">,</span> <span class="s1">&#39;企业角色&#39;</span><span class="p">,</span> <span class="s1">&#39;民族主义(汇总)&#39;</span><span class="p">]</span>
<span class="n">stats_df</span><span class="p">[</span><span class="n">select_cols</span><span class="p">]</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s1">&#39;A股上市公司-修辞民族主义2001-2023.csv&#39;</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
<span class="n">stats_df</span><span class="p">[</span><span class="n">select_cols</span><span class="p">]</span>
</code></pre></div><p><img loading="lazy" src="img/03-dff.png" alt=""  />
</p>
<p><img loading="lazy" src="img/03-df.png" alt=""  />
</p>
<br>
<h3 id="35-可视化">3.5 可视化</h3>
<p>求得A股每年的均值</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">datas</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">stats_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;A股上市公司-修辞民族主义2001-2023.csv&#39;</span><span class="p">)</span>
<span class="k">for</span> <span class="n">year</span><span class="p">,</span> <span class="n">year_df</span> <span class="ow">in</span> <span class="n">stats_df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">&#39;year&#39;</span><span class="p">):</span>
    <span class="n">select_cols</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;民族自豪感&#39;</span><span class="p">,</span> <span class="s1">&#39;排外&#39;</span><span class="p">,</span> <span class="s1">&#39;民族复兴&#39;</span><span class="p">,</span> <span class="s1">&#39;企业角色&#39;</span><span class="p">,</span> <span class="s1">&#39;民族主义(汇总)&#39;</span><span class="p">]</span>
    <span class="n">ys</span> <span class="o">=</span> <span class="n">year_df</span><span class="p">[</span><span class="n">select_cols</span><span class="p">]</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
    <span class="n">datas</span><span class="o">.</span><span class="n">append</span><span class="p">((</span><span class="n">year</span><span class="p">,</span> <span class="n">ys</span><span class="p">[</span><span class="s1">&#39;民族自豪感&#39;</span><span class="p">],</span> <span class="n">ys</span><span class="p">[</span><span class="s1">&#39;排外&#39;</span><span class="p">],</span> <span class="n">ys</span><span class="p">[</span><span class="s1">&#39;民族复兴&#39;</span><span class="p">],</span> <span class="n">ys</span><span class="p">[</span><span class="s1">&#39;企业角色&#39;</span><span class="p">],</span> <span class="n">ys</span><span class="p">[</span><span class="s1">&#39;民族主义(汇总)&#39;</span><span class="p">]))</span>
    <span class="c1">#print(year, )</span>

<span class="n">stats_df2</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">datas</span><span class="p">)</span>
<span class="n">stats_df2</span><span class="o">.</span><span class="n">columns</span> <span class="o">=</span>  <span class="p">[</span><span class="s1">&#39;year&#39;</span><span class="p">,</span> <span class="s1">&#39;民族自豪感&#39;</span><span class="p">,</span> <span class="s1">&#39;排外&#39;</span><span class="p">,</span> <span class="s1">&#39;民族复兴&#39;</span><span class="p">,</span> <span class="s1">&#39;企业角色&#39;</span><span class="p">,</span> <span class="s1">&#39;民族主义(汇总)&#39;</span><span class="p">]</span>
<span class="n">stats_df2</span>
</code></pre></div><p><img loading="lazy" src="img/04-df.png" alt=""  />
</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">platform</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">import</span> <span class="nn">matplotlib</span>
<span class="kn">import</span> <span class="nn">matplotlib_inline</span>
<span class="n">matplotlib_inline</span><span class="o">.</span><span class="n">backend_inline</span><span class="o">.</span><span class="n">set_matplotlib_formats</span><span class="p">(</span><span class="s1">&#39;png&#39;</span><span class="p">,</span> <span class="s1">&#39;svg&#39;</span><span class="p">)</span>
<span class="kn">import</span> <span class="nn">scienceplots</span>
<span class="kn">import</span> <span class="nn">platform</span>
<span class="n">plt</span><span class="o">.</span><span class="n">style</span><span class="o">.</span><span class="n">use</span><span class="p">([</span><span class="s1">&#39;science&#39;</span><span class="p">,</span> <span class="s1">&#39;no-latex&#39;</span><span class="p">,</span> <span class="s1">&#39;cjk-sc-font&#39;</span><span class="p">])</span>
<span class="n">system</span> <span class="o">=</span> <span class="n">platform</span><span class="o">.</span><span class="n">system</span><span class="p">()</span>  <span class="c1"># 获取操作系统类型</span>

<span class="n">system</span> <span class="o">=</span> <span class="n">platform</span><span class="o">.</span><span class="n">system</span><span class="p">()</span>  <span class="c1"># 获取操作系统类型</span>

<span class="k">if</span> <span class="n">system</span> <span class="o">==</span> <span class="s1">&#39;Windows&#39;</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;SimHei&#39;</span><span class="p">}</span>
<span class="k">elif</span> <span class="n">system</span> <span class="o">==</span> <span class="s1">&#39;Darwin&#39;</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;Arial Unicode MS&#39;</span><span class="p">}</span>
<span class="k">else</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;sans-serif&#39;</span><span class="p">}</span>
<span class="n">matplotlib</span><span class="o">.</span><span class="n">rc</span><span class="p">(</span><span class="s1">&#39;font&#39;</span><span class="p">,</span> <span class="o">**</span><span class="n">font</span><span class="p">)</span>  <span class="c1"># 设置全局字体</span>
    
<span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">6</span><span class="p">))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">stats_df2</span><span class="o">.</span><span class="n">year</span><span class="p">,</span> <span class="n">stats_df2</span><span class="p">[</span><span class="s1">&#39;民族主义(汇总)&#39;</span><span class="p">],</span> <span class="n">label</span><span class="o">=</span><span class="s1">&#39;民族主义(汇总)&#39;</span><span class="p">,</span> <span class="n">color</span> <span class="o">=</span> <span class="s1">&#39;black&#39;</span> <span class="p">,</span> <span class="n">lw</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">linestyle</span><span class="o">=</span><span class="s1">&#39;-&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">stats_df2</span><span class="o">.</span><span class="n">year</span><span class="p">,</span> <span class="n">stats_df2</span><span class="p">[</span><span class="s1">&#39;民族复兴&#39;</span><span class="p">],</span> <span class="n">label</span><span class="o">=</span><span class="s1">&#39;民族复兴&#39;</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s1">&#39;blue&#39;</span><span class="p">,</span> <span class="n">lw</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">linestyle</span><span class="o">=</span><span class="s1">&#39;-.&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">stats_df2</span><span class="o">.</span><span class="n">year</span><span class="p">,</span> <span class="n">stats_df2</span><span class="p">[</span><span class="s1">&#39;企业角色&#39;</span><span class="p">],</span> <span class="n">label</span><span class="o">=</span><span class="s1">&#39;企业角色&#39;</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s1">&#39;red&#39;</span><span class="p">,</span> <span class="n">lw</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">linestyle</span><span class="o">=</span><span class="s1">&#39;:&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">stats_df2</span><span class="o">.</span><span class="n">year</span><span class="p">,</span> <span class="n">stats_df2</span><span class="p">[</span><span class="s1">&#39;排外&#39;</span><span class="p">],</span> <span class="n">label</span><span class="o">=</span><span class="s1">&#39;排外&#39;</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s1">&#39;green&#39;</span><span class="p">,</span> <span class="n">lw</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">linestyle</span><span class="o">=</span><span class="s1">&#39;:&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">stats_df2</span><span class="o">.</span><span class="n">year</span><span class="p">,</span> <span class="n">stats_df2</span><span class="p">[</span><span class="s1">&#39;民族自豪感&#39;</span><span class="p">],</span> <span class="n">label</span><span class="o">=</span><span class="s1">&#39;民族自豪感&#39;</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s1">&#39;grey&#39;</span><span class="p">,</span> <span class="n">lw</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">linestyle</span><span class="o">=</span><span class="s1">&#39;--&#39;</span><span class="p">)</span>

<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s1">&#39;年份&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">&#39;A股年报修辞民族主义年度趋势(2001-2023)&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">legend</span><span class="p">(</span><span class="n">loc</span><span class="o">=</span><span class="s1">&#39;upper left&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/05-plot.png" alt=""  />
</p>
<br>
<p>论文中民族主义4个维度的可视化</p>
<p><img loading="lazy" src="img/06-plot.png" alt=""  />
</p>
<p>两幅图的走势是近似的， 在2005年都是飙升到一个新水平，之后稳步上升。</p>
<h2 id="四注意">四、注意</h2>
<p>两幅图的Y轴的值差异比较大的原因</p>
<ol>
<li>数据集略有差异， 文本清洗方法。</li>
<li>计算词频的同时，论文考虑到词语权重差异，使用了TF-IDF。本文默认所有词语权重为1，只统计词频。</li>
</ol>
<br>
<p>论文作者公开了数据和代码资料， 可前往 <a href="https://sites.google.com/view/firms-rhetorical-nationalism/home">https://sites.google.com/view/firms-rhetorical-nationalism/home</a></p>
<p><br><br></p>
<h2 id="五获取资料">五、获取资料</h2>
<p>加微信372335839，备注[姓名-学校-专业]</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">- 100元 mda01-23.csv.gz
- 30元  &#34;A股上市公司-修辞民族主义2001-2023.csv&#34;
</code></pre></div><p><br><br></p>
<h2 id="cntext使用声明">cntext使用声明</h2>
<p>如在研究或项目中使用 cntext ，请在文中介绍并附引用声明。引用格式可参考 <a href="https://textdata.cn/blog/2025-09-09-transform-text-data-into-structured-data-with-lmstudio-and-cntext/#%E4%BD%BF%E7%94%A8%E5%A3%B0%E6%98%8E">cntext 推荐引用格式</a></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>网络爬虫 | 使用scrapegraph-ai(大模型方案)自动采集网页数据</title>
      <link>https://textdata.cn/blog/2024-06-16-scrapegraph-ai/</link>
      <pubDate>Sun, 16 Jun 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-06-16-scrapegraph-ai/</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;前几日分享了&lt;a href=&#34;https://textdata.cn/blog/2024-06-14-using-large-language-model-to-extract-structure-data-from-raw-text/&#34;&gt;实验 | 使用本地大模型从文本中提取结构化信息&lt;/a&gt;, 今天再分享一个  &lt;a href=&#34;https://github.com/VinciGit00/Scrapegraph-ai&#34;&gt;ScrapeGraphAI库&lt;/a&gt;， 现在还不太好用，但未来写爬虫很可能会变得越来越容易。&lt;/p&gt;
&lt;/blockquote&gt;
&lt;br&gt;
&lt;h2 id=&#34;一介绍&#34;&gt;一、介绍&lt;/h2&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;ScrapeGraphAI&lt;/strong&gt;&lt;/em&gt;是一个&lt;em&gt;网络爬虫&lt;/em&gt; Python 库，使用大型语言模型和直接图逻辑为网站和本地文档（XML，HTML，JSON 等）创建爬取管道。&lt;/p&gt;
&lt;p&gt;只需告诉库您想提取哪些信息，它将为您完成！&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/scrapegraphai_logo.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;scrapegraphai有三种主要的爬取管道可用于从网站（或本地文件）提取信息：&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;SmartScraperGraph&lt;/code&gt;: 单页爬虫，只需用户提示和输入源；&lt;/li&gt;
&lt;li&gt;&lt;code&gt;SearchGraph&lt;/code&gt;: 多页爬虫，从搜索引擎的前 n 个搜索结果中提取信息；&lt;/li&gt;
&lt;li&gt;&lt;code&gt;SpeechGraph&lt;/code&gt;: 单页爬虫，从网站提取信息并生成音频文件。&lt;/li&gt;
&lt;li&gt;&lt;code&gt;SmartScraperMultiGraph&lt;/code&gt;: 多页爬虫，给定一个提示 可以通过 API 使用不同的 LLM，如 &lt;strong&gt;OpenAI&lt;/strong&gt;，&lt;strong&gt;Groq&lt;/strong&gt;，&lt;strong&gt;Azure&lt;/strong&gt; 和 &lt;strong&gt;Gemini&lt;/strong&gt;，或者使用 &lt;strong&gt;Ollama&lt;/strong&gt; 的本地模型。&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二准备工作&#34;&gt;二、准备工作&lt;/h2&gt;
&lt;h3 id=&#34;121-安装ollama&#34;&gt;12.1 安装ollama&lt;/h3&gt;
&lt;p&gt;点击前往网站 &lt;a href=&#34;https://ollama.com/&#34;&gt;https://ollama.com/&lt;/a&gt; ，下载ollama软件，支持win、Mac、linux&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/02-ollama-gui.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;22-下载llm&#34;&gt;2.2 下载LLM&lt;/h3&gt;
&lt;p&gt;ollama软件目前支持多种大模型， 如阿里的（qwen、qwen2）、meta的(llama3)，&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/03-ollama-model.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;以llama3为例，根据自己电脑显存性能， 选择适宜的版本。如果不知道选什么，那就试着安装，不合适不能用再删除即可。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/04-ollama-llama3.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;打开电脑命令行cmd(mac是terminal),  网络是连网状态，执行模型下载(安装)命令&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;ollama pull llama3
ollama pull qwen2
ollama pull nomic-embed-text
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;等待 &lt;strong&gt;llama3、 nomic-embed-text&lt;/strong&gt; 下载完成。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;23-安装python包&#34;&gt;2.3 安装python包&lt;/h3&gt;
&lt;p&gt;在python中调用ollama服务，需要ollama包。&lt;/p&gt;
&lt;p&gt;打开电脑命令行cmd(mac是terminal),  网络是连网状态，执行安装命令&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;pip3 install ollama
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;24-启动ollama服务&#34;&gt;2.4 启动ollama服务&lt;/h3&gt;
&lt;p&gt;在Python中调用本地ollama服务，需要先启动本地ollama服务， 打开电脑命令行cmd(mac是terminal), 执行&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;ollama serve
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;2024/06/14 14:52:24 routes.go:1011: INFO server config env=&amp;#34;map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_KEEP_ALIVE: OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS:/Users/deng/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_TMPDIR:]&amp;#34;
time=2024-06-14T14:52:24.742+08:00 level=INFO source=images.go:725 msg=&amp;#34;total blobs: 18&amp;#34;
time=2024-06-14T14:52:24.742+08:00 level=INFO source=images.go:732 msg=&amp;#34;total unused blobs removed: 0&amp;#34;
time=2024-06-14T14:52:24.743+08:00 level=INFO source=routes.go:1057 msg=&amp;#34;Listening on 127.0.0.1:11434 (version 0.1.44)&amp;#34;
time=2024-06-14T14:52:24.744+08:00 level=INFO source=payload.go:30 msg=&amp;#34;extracting embedded files&amp;#34; dir=/var/folders/y0/4gqxky0s2t94x1c1qhlwr6100000gn/T/ollama4239159529/runners
time=2024-06-14T14:52:24.772+08:00 level=INFO source=payload.go:44 msg=&amp;#34;Dynamic LLM libraries [metal]&amp;#34;
time=2024-06-14T14:52:24.796+08:00 level=INFO source=types.go:71 msg=&amp;#34;inference compute&amp;#34; id=0 library=metal compute=&amp;#34;&amp;#34; driver=0.0 name=&amp;#34;&amp;#34; total=&amp;#34;72.0 GiB&amp;#34; available=&amp;#34;72.0 GiB&amp;#34;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;cmd(mac是terminal)看到如上的信息，说明本地ollama服务已开启。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;25-安装scrapegraphai及playwright&#34;&gt;2.5 安装scrapegraphai及playwright&lt;/h3&gt;
&lt;p&gt;电脑命令行cmd(mac是terminal),  网络是连网状态，执行安装命令&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;pip install scrapegraphai
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;之后继续命令行cmd(mac是terminal)执行&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;playwright install
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;等待安装完成后，进行实验&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;三实验&#34;&gt;三、实验&lt;/h2&gt;
&lt;h3 id=&#34;31-案例1&#34;&gt;3.1 案例1&lt;/h3&gt;
&lt;p&gt;以我的博客 &lt;code&gt;https://textdata.cn/blog/&lt;/code&gt; 为例，假设我想获取&lt;code&gt;标题、日期、文章链接&lt;/code&gt;,&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/06-blog-list.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;代码如下:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;scrapegraphai.graphs&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;SmartScraperGraph&lt;/span&gt;


&lt;span class=&#34;n&#34;&gt;graph_config&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;
    &lt;span class=&#34;s2&#34;&gt;&amp;#34;llm&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;
        &lt;span class=&#34;s2&#34;&gt;&amp;#34;model&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;ollama/llama3&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
        &lt;span class=&#34;s2&#34;&gt;&amp;#34;temperature&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
        &lt;span class=&#34;s2&#34;&gt;&amp;#34;format&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;json&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;  &lt;span class=&#34;c1&#34;&gt;# Ollama 需要显式指定格式&lt;/span&gt;
        &lt;span class=&#34;s2&#34;&gt;&amp;#34;base_url&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;http://localhost:11434&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;  &lt;span class=&#34;c1&#34;&gt;# 设置 Ollama URL&lt;/span&gt;
    &lt;span class=&#34;p&#34;&gt;},&lt;/span&gt;
    &lt;span class=&#34;s2&#34;&gt;&amp;#34;embeddings&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;
        &lt;span class=&#34;s2&#34;&gt;&amp;#34;model&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;ollama/nomic-embed-text&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
        &lt;span class=&#34;s2&#34;&gt;&amp;#34;base_url&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;http://localhost:11434&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;  &lt;span class=&#34;c1&#34;&gt;# 设置 Ollama URL&lt;/span&gt;
    &lt;span class=&#34;p&#34;&gt;},&lt;/span&gt;
    &lt;span class=&#34;s2&#34;&gt;&amp;#34;verbose&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;smart_scraper_graph&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;SmartScraperGraph&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;prompt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;返回该网站所有文章的标题、日期、文章链接&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
    &lt;span class=&#34;c1&#34;&gt;# 也接受已下载的 HTML 代码的字符串&lt;/span&gt;
    &lt;span class=&#34;c1&#34;&gt;#source=requests.get(&amp;#34;https://textdata.cn/blog/&amp;#34;).text,&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;source&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;https://textdata.cn/blog/&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;config&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;graph_config&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;result&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;smart_scraper_graph&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;run&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;result&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;--- Executing Fetch Node ---
--- Executing Parse Node ---
--- Executing RAG Node ---
--- (updated chunks metadata) ---
--- (tokens compressed and vector stored) ---
--- Executing GenerateAnswer Node ---
Processing chunks: 100%|█████████████████████████| 1/1 [00:00&amp;lt;00:00, 825.81it/s]

{&amp;#39;articles&amp;#39;: 
		[{&amp;#39;title&amp;#39;: &amp;#39;LIST | 社科(经管)数据挖掘文献资料汇总&amp;#39;, 
			&amp;#39;date&amp;#39;: &amp;#39;2024-04-15&amp;#39;, 
			&amp;#39;link&amp;#39;: &amp;#39;https://textdata.cn/blog/management_python_course/&amp;#39;}, 
			
			{&amp;#39;title&amp;#39;: &amp;#39;LIST| 文本分析代码资料汇总&amp;#39;, 
			&amp;#39;date&amp;#39;: &amp;#39;2024-04-15&amp;#39;,
			&amp;#39;link&amp;#39;:&amp;#39;https://textdata.cn/blog/text_analysis_code_list_about_ms/&amp;#39;}, 
			
			{&amp;#39;title&amp;#39;: &amp;#39;实验 | 使用本地大模型从文本中提取结构化信息&amp;#39;, 
			&amp;#39;date&amp;#39;: &amp;#39;2024-06-14&amp;#39;, 
			&amp;#39;link&amp;#39;: &amp;#39;https://textdata.cn/blog/2024-06-14-using-large-language-model-to-extract-structure-data-from-raw-text/&amp;#39;}, 
			
			{&amp;#39;title&amp;#39;: &amp;#39;2023 | 文本分析在经管研究中的应用&amp;#39;, 
			&amp;#39;date&amp;#39;: &amp;#39;2023-11-05&amp;#39;, 
			&amp;#39;link&amp;#39;: &amp;#39;https://textdata.cn/blog/2023-11-05-xjtu-text-mining-in-ms/&amp;#39;}, 
			
			{&amp;#39;title&amp;#39;: &amp;#39;经管类 | 含 经济日报/经济观察报/中国工业报/中国贸易报/中国消费者报 等 10+ 家媒体(2024.05)&amp;#39;, 
			&amp;#39;date&amp;#39;: &amp;#39;2024-06-12&amp;#39;, 
			&amp;#39;link&amp;#39;: &amp;#39;https://textdata.cn/blog/2024-06-12-national-level-economic-daily-news-dataset/&amp;#39;}]}

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;32-案例2&#34;&gt;3.2 案例2&lt;/h3&gt;
&lt;p&gt;采集豆瓣读书 &lt;code&gt;https://book.douban.com/top250&lt;/code&gt; 中的 &lt;code&gt;名字、作者名、评分、书籍链接&lt;/code&gt; 等信息。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/07-books.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;scrapegraphai.graphs&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;SmartScraperGraph&lt;/span&gt;


&lt;span class=&#34;n&#34;&gt;graph_config&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;
    &lt;span class=&#34;s2&#34;&gt;&amp;#34;llm&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;
        &lt;span class=&#34;s2&#34;&gt;&amp;#34;model&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;ollama/llama3&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
        &lt;span class=&#34;s2&#34;&gt;&amp;#34;temperature&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
        &lt;span class=&#34;s2&#34;&gt;&amp;#34;format&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;json&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;  &lt;span class=&#34;c1&#34;&gt;# Ollama 需要显式指定格式&lt;/span&gt;
        &lt;span class=&#34;s2&#34;&gt;&amp;#34;base_url&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;http://localhost:11434&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;  &lt;span class=&#34;c1&#34;&gt;# 设置 Ollama URL&lt;/span&gt;
    &lt;span class=&#34;p&#34;&gt;},&lt;/span&gt;
    &lt;span class=&#34;s2&#34;&gt;&amp;#34;embeddings&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;
        &lt;span class=&#34;s2&#34;&gt;&amp;#34;model&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;ollama/nomic-embed-text&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
        &lt;span class=&#34;s2&#34;&gt;&amp;#34;base_url&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;http://localhost:11434&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;  &lt;span class=&#34;c1&#34;&gt;# 设置 Ollama URL&lt;/span&gt;
    &lt;span class=&#34;p&#34;&gt;},&lt;/span&gt;
    &lt;span class=&#34;s2&#34;&gt;&amp;#34;verbose&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;


&lt;span class=&#34;n&#34;&gt;smart_scraper_graph2&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;SmartScraperGraph&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;prompt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;返回该页面所有书的名字、作者名、评分、书籍链接&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;source&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;https://book.douban.com/top250&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;config&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;graph_config&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;result2&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;smart_scraper_graph2&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;run&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;result2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;--- Executing Fetch Node ---
--- Executing Parse Node ---
--- Executing RAG Node ---
--- (updated chunks metadata) ---
--- (tokens compressed and vector stored) ---
--- Executing GenerateAnswer Node ---
Processing chunks: 100%|████████████████████████| 1/1 [00:00&amp;lt;00:00, 1474.79it/s]
{}
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;采集失败，返回空。&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;将大模型llama3改为qwen2&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;from scrapegraphai.graphs import SmartScraperGraph


graph_config2 = {
    &amp;#34;llm&amp;#34;: {
        &amp;#34;model&amp;#34;: &amp;#34;ollama/qwen2&amp;#34;,
        &amp;#34;temperature&amp;#34;: 0,
        &amp;#34;format&amp;#34;: &amp;#34;json&amp;#34;,  # Ollama 需要显式指定格式
        &amp;#34;base_url&amp;#34;: &amp;#34;http://localhost:11434&amp;#34;,  # 设置 Ollama URL
    },
    &amp;#34;embeddings&amp;#34;: {
        &amp;#34;model&amp;#34;: &amp;#34;ollama/nomic-embed-text&amp;#34;,
        &amp;#34;base_url&amp;#34;: &amp;#34;http://localhost:11434&amp;#34;,  # 设置 Ollama URL
    },
    &amp;#34;verbose&amp;#34;: True,
}


smart_scraper_graph3 = SmartScraperGraph(
    prompt=&amp;#34;返回该页面所有书的名字、作者名、评分、书籍链接&amp;#34;,
    source=&amp;#34;https://book.douban.com/top250&amp;#34;,
    config=graph_config2
)

result3 = smart_scraper_graph3.run()
print(result3)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;--- Executing Fetch Node ---
--- Executing Parse Node ---
--- Executing RAG Node ---
--- (updated chunks metadata) ---
--- (tokens compressed and vector stored) ---
--- Executing GenerateAnswer Node ---
Processing chunks: 100%|████████████████████████| 1/1 [00:00&amp;lt;00:00, 1102.60it/s]
{&amp;#39;urls&amp;#39;: [&amp;#39;https://book.douban.com/subject/10554308/&amp;#39;, &amp;#39;https://book.douban.com/subject/1084336/&amp;#39;, &amp;#39;https://book.douban.com/subject/1084336/&amp;#39;, &amp;#39;https://book.douban.com/subject/1046209/&amp;#39;, &amp;#39;https://book.douban.com/subject/1046209/&amp;#39;, &amp;#39;https://book.douban.com/subject/1255625/&amp;#39;, &amp;#39;https://book.douban.com/subject/1255625/&amp;#39;, &amp;#39;https://book.douban.com/subject/1060068/&amp;#39;, &amp;#39;https://book.douban.com/subject/1060068/&amp;#39;, &amp;#39;https://book.douban.com/subject/1449351/&amp;#39;, &amp;#39;https://book.douban.com/subject/1449351/&amp;#39;, &amp;#39;https://book.douban.com/subject/20424526/&amp;#39;, &amp;#39;https://book.douban.com/subject/20424526/&amp;#39;, &amp;#39;https://book.douban.com/subject/29799269/&amp;#39;, &amp;#39;https://book.douban.com/subject/1034062/&amp;#39;, &amp;#39;https://book.douban.com/subject/1229240/&amp;#39;, &amp;#39;https://book.douban.com/subject/1237549/&amp;#39;, &amp;#39;https://book.douban.com/subject/1078958/&amp;#39;, &amp;#39;https://book.douban.com/subject/1076932/&amp;#39;, &amp;#39;https://book.douban.com/subject/1075440/&amp;#39;, &amp;#39;https://book.douban.com/subject/1076932/&amp;#39;, &amp;#39;https://book.douban.com/subject/1078958/&amp;#39;, &amp;#39;https://book.douban.com/subject/1076932/&amp;#39;, &amp;#39;https://book.douban.com/subject/1078958/&amp;#39;, &amp;#39;https://book.douban.com/subject/1076932/&amp;#39;, &amp;#39;https://book.douban.com/subject/1078958/&amp;#39;, &amp;#39;https://book.douban.com/subject/1076932/&amp;#39;], &amp;#39;images&amp;#39;: [&amp;#39;https://img1.doubanio.com/view/subject/s/public/s1078958.jpg&amp;#39;, &amp;#39;https://img1.doubanio.com/view/subject/s/public/s1076932.jpg&amp;#39;, &amp;#39;https://img1.doubanio.com/view/subject/s/public/s1447349.jpg&amp;#39;]}
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;采集到一些信息，但没有书名、作者等信息。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;注意&#34;&gt;注意：&lt;/h3&gt;
&lt;p&gt;代码需要在 &lt;code&gt;.py&lt;/code&gt; 中运行，在 &lt;code&gt;.ipynb&lt;/code&gt; 中运行会报错。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;四讨论&#34;&gt;四、讨论&lt;/h2&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;ScrapeGraphAI&lt;/strong&gt;&lt;/em&gt; 是目前大邓已经的唯一的大模型爬虫， 现在采集数据的成功率还是比较低的。 而且因为底层使用 playwright ， 访问速度较慢。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;cntext使用声明&#34;&gt;cntext使用声明&lt;/h2&gt;
&lt;p&gt;如在研究或项目中使用 cntext ，请在文中介绍并附引用声明。引用格式可参考 &lt;a href=&#34;https://textdata.cn/blog/2025-09-09-transform-text-data-into-structured-data-with-lmstudio-and-cntext/#%E4%BD%BF%E7%94%A8%E5%A3%B0%E6%98%8E&#34;&gt;cntext 推荐引用格式&lt;/a&gt;&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;h2 id=&#34;精选内容&#34;&gt;精选内容&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/datasets_available_for_management_science/&#34;&gt;LIST | 可供社科(经管)领域使用的数据集汇总&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/the_text_analysis_list_about_ms/&#34;&gt;LIST | 社科(经管)数据挖掘文献资料汇总&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/&#34;&gt;推荐 | 文本分析库 cntext 使用手册&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/management_python_course/&#34;&gt;付费视频课 | Python实证指标构建与文本分析&lt;/a&gt;
&lt;br&gt;
&lt;br&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
      <content:encoded><![CDATA[<blockquote>
<p>前几日分享了<a href="https://textdata.cn/blog/2024-06-14-using-large-language-model-to-extract-structure-data-from-raw-text/">实验 | 使用本地大模型从文本中提取结构化信息</a>, 今天再分享一个  <a href="https://github.com/VinciGit00/Scrapegraph-ai">ScrapeGraphAI库</a>， 现在还不太好用，但未来写爬虫很可能会变得越来越容易。</p>
</blockquote>
<br>
<h2 id="一介绍">一、介绍</h2>
<p><em><strong>ScrapeGraphAI</strong></em>是一个<em>网络爬虫</em> Python 库，使用大型语言模型和直接图逻辑为网站和本地文档（XML，HTML，JSON 等）创建爬取管道。</p>
<p>只需告诉库您想提取哪些信息，它将为您完成！</p>
<p><img loading="lazy" src="img/scrapegraphai_logo.png" alt=""  />
</p>
<p>scrapegraphai有三种主要的爬取管道可用于从网站（或本地文件）提取信息：</p>
<ul>
<li><code>SmartScraperGraph</code>: 单页爬虫，只需用户提示和输入源；</li>
<li><code>SearchGraph</code>: 多页爬虫，从搜索引擎的前 n 个搜索结果中提取信息；</li>
<li><code>SpeechGraph</code>: 单页爬虫，从网站提取信息并生成音频文件。</li>
<li><code>SmartScraperMultiGraph</code>: 多页爬虫，给定一个提示 可以通过 API 使用不同的 LLM，如 <strong>OpenAI</strong>，<strong>Groq</strong>，<strong>Azure</strong> 和 <strong>Gemini</strong>，或者使用 <strong>Ollama</strong> 的本地模型。</li>
</ul>
<p><br><br></p>
<h2 id="二准备工作">二、准备工作</h2>
<h3 id="121-安装ollama">12.1 安装ollama</h3>
<p>点击前往网站 <a href="https://ollama.com/">https://ollama.com/</a> ，下载ollama软件，支持win、Mac、linux</p>
<p><img loading="lazy" src="img/02-ollama-gui.png" alt=""  />
</p>
<br>
<h3 id="22-下载llm">2.2 下载LLM</h3>
<p>ollama软件目前支持多种大模型， 如阿里的（qwen、qwen2）、meta的(llama3)，</p>
<p><img loading="lazy" src="img/03-ollama-model.png" alt=""  />
</p>
<br>
<p>以llama3为例，根据自己电脑显存性能， 选择适宜的版本。如果不知道选什么，那就试着安装，不合适不能用再删除即可。</p>
<p><img loading="lazy" src="img/04-ollama-llama3.png" alt=""  />
</p>
<br>
<p>打开电脑命令行cmd(mac是terminal),  网络是连网状态，执行模型下载(安装)命令</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">ollama pull llama3
ollama pull qwen2
ollama pull nomic-embed-text
</code></pre></div><p>等待 <strong>llama3、 nomic-embed-text</strong> 下载完成。</p>
<br>
<h3 id="23-安装python包">2.3 安装python包</h3>
<p>在python中调用ollama服务，需要ollama包。</p>
<p>打开电脑命令行cmd(mac是terminal),  网络是连网状态，执行安装命令</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">pip3 install ollama
</code></pre></div><br>
<h3 id="24-启动ollama服务">2.4 启动ollama服务</h3>
<p>在Python中调用本地ollama服务，需要先启动本地ollama服务， 打开电脑命令行cmd(mac是terminal), 执行</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">ollama serve
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">2024/06/14 14:52:24 routes.go:1011: INFO server config env=&#34;map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_KEEP_ALIVE: OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS:/Users/deng/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_TMPDIR:]&#34;
time=2024-06-14T14:52:24.742+08:00 level=INFO source=images.go:725 msg=&#34;total blobs: 18&#34;
time=2024-06-14T14:52:24.742+08:00 level=INFO source=images.go:732 msg=&#34;total unused blobs removed: 0&#34;
time=2024-06-14T14:52:24.743+08:00 level=INFO source=routes.go:1057 msg=&#34;Listening on 127.0.0.1:11434 (version 0.1.44)&#34;
time=2024-06-14T14:52:24.744+08:00 level=INFO source=payload.go:30 msg=&#34;extracting embedded files&#34; dir=/var/folders/y0/4gqxky0s2t94x1c1qhlwr6100000gn/T/ollama4239159529/runners
time=2024-06-14T14:52:24.772+08:00 level=INFO source=payload.go:44 msg=&#34;Dynamic LLM libraries [metal]&#34;
time=2024-06-14T14:52:24.796+08:00 level=INFO source=types.go:71 msg=&#34;inference compute&#34; id=0 library=metal compute=&#34;&#34; driver=0.0 name=&#34;&#34; total=&#34;72.0 GiB&#34; available=&#34;72.0 GiB&#34;
</code></pre></div><p>cmd(mac是terminal)看到如上的信息，说明本地ollama服务已开启。</p>
<br>
<h3 id="25-安装scrapegraphai及playwright">2.5 安装scrapegraphai及playwright</h3>
<p>电脑命令行cmd(mac是terminal),  网络是连网状态，执行安装命令</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">pip install scrapegraphai
</code></pre></div><p>之后继续命令行cmd(mac是terminal)执行</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">playwright install
</code></pre></div><p>等待安装完成后，进行实验</p>
<p><br><br></p>
<h2 id="三实验">三、实验</h2>
<h3 id="31-案例1">3.1 案例1</h3>
<p>以我的博客 <code>https://textdata.cn/blog/</code> 为例，假设我想获取<code>标题、日期、文章链接</code>,</p>
<p><img loading="lazy" src="img/06-blog-list.png" alt=""  />
</p>
<p>代码如下:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">scrapegraphai.graphs</span> <span class="kn">import</span> <span class="n">SmartScraperGraph</span>


<span class="n">graph_config</span> <span class="o">=</span> <span class="p">{</span>
    <span class="s2">&#34;llm&#34;</span><span class="p">:</span> <span class="p">{</span>
        <span class="s2">&#34;model&#34;</span><span class="p">:</span> <span class="s2">&#34;ollama/llama3&#34;</span><span class="p">,</span>
        <span class="s2">&#34;temperature&#34;</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span>
        <span class="s2">&#34;format&#34;</span><span class="p">:</span> <span class="s2">&#34;json&#34;</span><span class="p">,</span>  <span class="c1"># Ollama 需要显式指定格式</span>
        <span class="s2">&#34;base_url&#34;</span><span class="p">:</span> <span class="s2">&#34;http://localhost:11434&#34;</span><span class="p">,</span>  <span class="c1"># 设置 Ollama URL</span>
    <span class="p">},</span>
    <span class="s2">&#34;embeddings&#34;</span><span class="p">:</span> <span class="p">{</span>
        <span class="s2">&#34;model&#34;</span><span class="p">:</span> <span class="s2">&#34;ollama/nomic-embed-text&#34;</span><span class="p">,</span>
        <span class="s2">&#34;base_url&#34;</span><span class="p">:</span> <span class="s2">&#34;http://localhost:11434&#34;</span><span class="p">,</span>  <span class="c1"># 设置 Ollama URL</span>
    <span class="p">},</span>
    <span class="s2">&#34;verbose&#34;</span><span class="p">:</span> <span class="kc">True</span><span class="p">,</span>
<span class="p">}</span>

<span class="n">smart_scraper_graph</span> <span class="o">=</span> <span class="n">SmartScraperGraph</span><span class="p">(</span>
    <span class="n">prompt</span><span class="o">=</span><span class="s2">&#34;返回该网站所有文章的标题、日期、文章链接&#34;</span><span class="p">,</span>
    <span class="c1"># 也接受已下载的 HTML 代码的字符串</span>
    <span class="c1">#source=requests.get(&#34;https://textdata.cn/blog/&#34;).text,</span>
    <span class="n">source</span><span class="o">=</span><span class="s2">&#34;https://textdata.cn/blog/&#34;</span><span class="p">,</span>
    <span class="n">config</span><span class="o">=</span><span class="n">graph_config</span>
<span class="p">)</span>

<span class="n">result</span> <span class="o">=</span> <span class="n">smart_scraper_graph</span><span class="o">.</span><span class="n">run</span><span class="p">()</span>
<span class="nb">print</span><span class="p">(</span><span class="n">result</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">--- Executing Fetch Node ---
--- Executing Parse Node ---
--- Executing RAG Node ---
--- (updated chunks metadata) ---
--- (tokens compressed and vector stored) ---
--- Executing GenerateAnswer Node ---
Processing chunks: 100%|█████████████████████████| 1/1 [00:00&lt;00:00, 825.81it/s]

{&#39;articles&#39;: 
		[{&#39;title&#39;: &#39;LIST | 社科(经管)数据挖掘文献资料汇总&#39;, 
			&#39;date&#39;: &#39;2024-04-15&#39;, 
			&#39;link&#39;: &#39;https://textdata.cn/blog/management_python_course/&#39;}, 
			
			{&#39;title&#39;: &#39;LIST| 文本分析代码资料汇总&#39;, 
			&#39;date&#39;: &#39;2024-04-15&#39;,
			&#39;link&#39;:&#39;https://textdata.cn/blog/text_analysis_code_list_about_ms/&#39;}, 
			
			{&#39;title&#39;: &#39;实验 | 使用本地大模型从文本中提取结构化信息&#39;, 
			&#39;date&#39;: &#39;2024-06-14&#39;, 
			&#39;link&#39;: &#39;https://textdata.cn/blog/2024-06-14-using-large-language-model-to-extract-structure-data-from-raw-text/&#39;}, 
			
			{&#39;title&#39;: &#39;2023 | 文本分析在经管研究中的应用&#39;, 
			&#39;date&#39;: &#39;2023-11-05&#39;, 
			&#39;link&#39;: &#39;https://textdata.cn/blog/2023-11-05-xjtu-text-mining-in-ms/&#39;}, 
			
			{&#39;title&#39;: &#39;经管类 | 含 经济日报/经济观察报/中国工业报/中国贸易报/中国消费者报 等 10+ 家媒体(2024.05)&#39;, 
			&#39;date&#39;: &#39;2024-06-12&#39;, 
			&#39;link&#39;: &#39;https://textdata.cn/blog/2024-06-12-national-level-economic-daily-news-dataset/&#39;}]}

</code></pre></div><br>
<h3 id="32-案例2">3.2 案例2</h3>
<p>采集豆瓣读书 <code>https://book.douban.com/top250</code> 中的 <code>名字、作者名、评分、书籍链接</code> 等信息。</p>
<p><img loading="lazy" src="img/07-books.png" alt=""  />
</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">scrapegraphai.graphs</span> <span class="kn">import</span> <span class="n">SmartScraperGraph</span>


<span class="n">graph_config</span> <span class="o">=</span> <span class="p">{</span>
    <span class="s2">&#34;llm&#34;</span><span class="p">:</span> <span class="p">{</span>
        <span class="s2">&#34;model&#34;</span><span class="p">:</span> <span class="s2">&#34;ollama/llama3&#34;</span><span class="p">,</span>
        <span class="s2">&#34;temperature&#34;</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span>
        <span class="s2">&#34;format&#34;</span><span class="p">:</span> <span class="s2">&#34;json&#34;</span><span class="p">,</span>  <span class="c1"># Ollama 需要显式指定格式</span>
        <span class="s2">&#34;base_url&#34;</span><span class="p">:</span> <span class="s2">&#34;http://localhost:11434&#34;</span><span class="p">,</span>  <span class="c1"># 设置 Ollama URL</span>
    <span class="p">},</span>
    <span class="s2">&#34;embeddings&#34;</span><span class="p">:</span> <span class="p">{</span>
        <span class="s2">&#34;model&#34;</span><span class="p">:</span> <span class="s2">&#34;ollama/nomic-embed-text&#34;</span><span class="p">,</span>
        <span class="s2">&#34;base_url&#34;</span><span class="p">:</span> <span class="s2">&#34;http://localhost:11434&#34;</span><span class="p">,</span>  <span class="c1"># 设置 Ollama URL</span>
    <span class="p">},</span>
    <span class="s2">&#34;verbose&#34;</span><span class="p">:</span> <span class="kc">True</span><span class="p">,</span>
<span class="p">}</span>


<span class="n">smart_scraper_graph2</span> <span class="o">=</span> <span class="n">SmartScraperGraph</span><span class="p">(</span>
    <span class="n">prompt</span><span class="o">=</span><span class="s2">&#34;返回该页面所有书的名字、作者名、评分、书籍链接&#34;</span><span class="p">,</span>
    <span class="n">source</span><span class="o">=</span><span class="s2">&#34;https://book.douban.com/top250&#34;</span><span class="p">,</span>
    <span class="n">config</span><span class="o">=</span><span class="n">graph_config</span>
<span class="p">)</span>

<span class="n">result2</span> <span class="o">=</span> <span class="n">smart_scraper_graph2</span><span class="o">.</span><span class="n">run</span><span class="p">()</span>
<span class="nb">print</span><span class="p">(</span><span class="n">result2</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">--- Executing Fetch Node ---
--- Executing Parse Node ---
--- Executing RAG Node ---
--- (updated chunks metadata) ---
--- (tokens compressed and vector stored) ---
--- Executing GenerateAnswer Node ---
Processing chunks: 100%|████████████████████████| 1/1 [00:00&lt;00:00, 1474.79it/s]
{}
</code></pre></div><p>采集失败，返回空。</p>
<br>
<p>将大模型llama3改为qwen2</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">from scrapegraphai.graphs import SmartScraperGraph


graph_config2 = {
    &#34;llm&#34;: {
        &#34;model&#34;: &#34;ollama/qwen2&#34;,
        &#34;temperature&#34;: 0,
        &#34;format&#34;: &#34;json&#34;,  # Ollama 需要显式指定格式
        &#34;base_url&#34;: &#34;http://localhost:11434&#34;,  # 设置 Ollama URL
    },
    &#34;embeddings&#34;: {
        &#34;model&#34;: &#34;ollama/nomic-embed-text&#34;,
        &#34;base_url&#34;: &#34;http://localhost:11434&#34;,  # 设置 Ollama URL
    },
    &#34;verbose&#34;: True,
}


smart_scraper_graph3 = SmartScraperGraph(
    prompt=&#34;返回该页面所有书的名字、作者名、评分、书籍链接&#34;,
    source=&#34;https://book.douban.com/top250&#34;,
    config=graph_config2
)

result3 = smart_scraper_graph3.run()
print(result3)
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">--- Executing Fetch Node ---
--- Executing Parse Node ---
--- Executing RAG Node ---
--- (updated chunks metadata) ---
--- (tokens compressed and vector stored) ---
--- Executing GenerateAnswer Node ---
Processing chunks: 100%|████████████████████████| 1/1 [00:00&lt;00:00, 1102.60it/s]
{&#39;urls&#39;: [&#39;https://book.douban.com/subject/10554308/&#39;, &#39;https://book.douban.com/subject/1084336/&#39;, &#39;https://book.douban.com/subject/1084336/&#39;, &#39;https://book.douban.com/subject/1046209/&#39;, &#39;https://book.douban.com/subject/1046209/&#39;, &#39;https://book.douban.com/subject/1255625/&#39;, &#39;https://book.douban.com/subject/1255625/&#39;, &#39;https://book.douban.com/subject/1060068/&#39;, &#39;https://book.douban.com/subject/1060068/&#39;, &#39;https://book.douban.com/subject/1449351/&#39;, &#39;https://book.douban.com/subject/1449351/&#39;, &#39;https://book.douban.com/subject/20424526/&#39;, &#39;https://book.douban.com/subject/20424526/&#39;, &#39;https://book.douban.com/subject/29799269/&#39;, &#39;https://book.douban.com/subject/1034062/&#39;, &#39;https://book.douban.com/subject/1229240/&#39;, &#39;https://book.douban.com/subject/1237549/&#39;, &#39;https://book.douban.com/subject/1078958/&#39;, &#39;https://book.douban.com/subject/1076932/&#39;, &#39;https://book.douban.com/subject/1075440/&#39;, &#39;https://book.douban.com/subject/1076932/&#39;, &#39;https://book.douban.com/subject/1078958/&#39;, &#39;https://book.douban.com/subject/1076932/&#39;, &#39;https://book.douban.com/subject/1078958/&#39;, &#39;https://book.douban.com/subject/1076932/&#39;, &#39;https://book.douban.com/subject/1078958/&#39;, &#39;https://book.douban.com/subject/1076932/&#39;], &#39;images&#39;: [&#39;https://img1.doubanio.com/view/subject/s/public/s1078958.jpg&#39;, &#39;https://img1.doubanio.com/view/subject/s/public/s1076932.jpg&#39;, &#39;https://img1.doubanio.com/view/subject/s/public/s1447349.jpg&#39;]}
</code></pre></div><p>采集到一些信息，但没有书名、作者等信息。</p>
<br>
<h3 id="注意">注意：</h3>
<p>代码需要在 <code>.py</code> 中运行，在 <code>.ipynb</code> 中运行会报错。</p>
<p><br><br></p>
<h2 id="四讨论">四、讨论</h2>
<p><em><strong>ScrapeGraphAI</strong></em> 是目前大邓已经的唯一的大模型爬虫， 现在采集数据的成功率还是比较低的。 而且因为底层使用 playwright ， 访问速度较慢。</p>
<p><br><br></p>
<h2 id="cntext使用声明">cntext使用声明</h2>
<p>如在研究或项目中使用 cntext ，请在文中介绍并附引用声明。引用格式可参考 <a href="https://textdata.cn/blog/2025-09-09-transform-text-data-into-structured-data-with-lmstudio-and-cntext/#%E4%BD%BF%E7%94%A8%E5%A3%B0%E6%98%8E">cntext 推荐引用格式</a></p>
<br>
<br>
<h2 id="精选内容">精选内容</h2>
<ul>
<li><a href="https://textdata.cn/blog/datasets_available_for_management_science/">LIST | 可供社科(经管)领域使用的数据集汇总</a></li>
<li><a href="https://textdata.cn/blog/the_text_analysis_list_about_ms/">LIST | 社科(经管)数据挖掘文献资料汇总</a></li>
<li><a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/">推荐 | 文本分析库 cntext 使用手册</a></li>
<li><a href="https://textdata.cn/blog/management_python_course/">付费视频课 | Python实证指标构建与文本分析</a>
<br>
<br></li>
</ul>
]]></content:encoded>
    </item>
    
    <item>
      <title>实验 | 使用本地大模型从文本中提取结构化信息</title>
      <link>https://textdata.cn/blog/2024-06-14-using-large-language-model-to-extract-structure-data-from-raw-text/</link>
      <pubDate>Fri, 14 Jun 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-06-14-using-large-language-model-to-extract-structure-data-from-raw-text/</guid>
      <description>&lt;p&gt;非结构文本、图片、视频等数据是待挖掘的数据矿藏， 在经管、社科等研究领域中谁拥有了&lt;em&gt;&lt;strong&gt;从非结构提取结构化信息的能力&lt;/strong&gt;&lt;/em&gt;，谁就拥有科研上的数据优势。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;一需求&#34;&gt;一、需求&lt;/h2&gt;
&lt;p&gt;现在有很多个电子发票PDF文件， 使用自动化工具帮我们批量自动从发票PDF提取出格式化信息。如从发票&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/01-raw-pdf.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;提取出DICT_DATA&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;DICT_DATA = {
    &amp;#34;开票日期&amp;#34;: &amp;#34;2023年01月06日&amp;#34;,
    &amp;#34;应税货物(或服务)名称&amp;#34;: &amp;#34;*信息技术服务*技术服务费&amp;#34;,
    &amp;#34;价税合计(大写)&amp;#34;: &amp;#34;&amp;#34;,
    &amp;#34;税率&amp;#34;: &amp;#34;6%&amp;#34;,
    &amp;#34;备注&amp;#34;: &amp;#34;230106163474406331&amp;#34;
}
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二ollama介绍&#34;&gt;二、Ollama介绍&lt;/h2&gt;
&lt;p&gt;&lt;a href=&#34;https://ollama.ai/&#34;&gt;&lt;strong&gt;Ollama&lt;/strong&gt;&lt;/a&gt;是一款开源应用程序，可让您使用 MacOS、Linux 和 Windows 上的命令行界面在本地运行、创建和共享大型语言模型。&lt;/p&gt;
&lt;p&gt;Ollama 可以直接从其库中访问各种 LLM，只需一个命令即可下载。下载后，只需执行一个命令即可开始使用。这对于工作量围绕终端窗口的用户非常有帮助。如果他们被困在某个地方，他们可以在不切换到另一个浏览器窗口的情况下获得答案。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;21-特点和优点&#34;&gt;2.1 特点和优点&lt;/h3&gt;
&lt;p&gt;这就是为什么 OLLAMA 是您的工具包中必备的工具：&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;简单&lt;/strong&gt; ：OLLAMA 提供简单的设置过程。您无需拥有机器学习博士学位即可启动和运行它。&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;成本效益&lt;/strong&gt; ：在本地运行模型意味着您无需支付云成本。您的钱包会感谢您。&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;隐私&lt;/strong&gt; ：使用 OLLAMA，所有数据处理都在您的本地机器上进行。这对于用户隐私来说是一个巨大的胜利。&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;多功能性&lt;/strong&gt; ：OLLAMA 不只是为 Python 爱好者准备的。它的灵活性使其可以用于各种应用程序，包括 Web 开发。&lt;/li&gt;
&lt;/ul&gt;
&lt;br&gt;
&lt;h3 id=&#34;22-使用-ollama-进行-llm-选择&#34;&gt;2.2 使用 Ollama 进行 LLM 选择&lt;/h3&gt;
&lt;p&gt;默认情况下，Openai Models 在 CrewAI 中用作 llm。有经费、有网络、不担心数据泄露等条件下,  力求达到最佳性能，可考虑使用 GPT-4 或 OpenAI 稍便宜的 GPT-3.5。&lt;/p&gt;
&lt;p&gt;但本文是要 &lt;strong&gt;本地部署&lt;/strong&gt;， 因此我们将使用 Meta Llama 3，这是迄今为止功能最强大的公开 LLM。Meta Llama 3 是 Meta Inc. 开发的模型系列，是最新推出的模型，具有 8B 和 70B 两种参数大小（预训练或指令调整）。Llama 3 指令调整模型针对对话/聊天用例进行了微调和优化，并且在常见基准测试中胜过许多可用的开源聊天模型。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/02-llama3-performance.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/03-llama3-performance.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;23-安装ollama&#34;&gt;2.3 安装ollama&lt;/h3&gt;
&lt;p&gt;点击前往网站 &lt;a href=&#34;https://ollama.com/&#34;&gt;https://ollama.com/&lt;/a&gt; ，下载ollama软件，支持win、Mac、linux&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/04-ollama-gui.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;24-下载llm&#34;&gt;2.4 下载LLM&lt;/h3&gt;
&lt;p&gt;ollama软件目前支持多种大模型， 如阿里的（qwen、qwen2）、meta的(llama3、llama3.1)，&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/05-ollama-model.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;以llama3为例，根据自己电脑显存性能， 选择适宜的版本。如果不知道选什么，那就试着安装，不合适不能用再删除即可。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/06-ollama-llama3.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;打开电脑命令行cmd(mac是terminal),  网络是连网状态，执行模型下载(安装)命令&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;ollama pull llama3
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;等待 &lt;strong&gt;llama3:8b&lt;/strong&gt; 下载完成。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;25-安装python包&#34;&gt;2.5 安装python包&lt;/h3&gt;
&lt;p&gt;在python中调用ollama服务，需要ollama包。&lt;/p&gt;
&lt;p&gt;打开电脑命令行cmd(mac是terminal),  网络是连网状态，执行安装命令&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;pip3 install ollama
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;26-启动ollama服务&#34;&gt;2.6 启动ollama服务&lt;/h3&gt;
&lt;p&gt;在Python中调用本地ollama服务，需要先启动本地ollama服务， 打开电脑命令行cmd(mac是terminal), 执行&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;ollama serve
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;2024/06/14 14:52:24 routes.go:1011: INFO server config env=&amp;#34;map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_KEEP_ALIVE: OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS:/Users/deng/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_TMPDIR:]&amp;#34;
time=2024-06-14T14:52:24.742+08:00 level=INFO source=images.go:725 msg=&amp;#34;total blobs: 18&amp;#34;
time=2024-06-14T14:52:24.742+08:00 level=INFO source=images.go:732 msg=&amp;#34;total unused blobs removed: 0&amp;#34;
time=2024-06-14T14:52:24.743+08:00 level=INFO source=routes.go:1057 msg=&amp;#34;Listening on 127.0.0.1:11434 (version 0.1.44)&amp;#34;
time=2024-06-14T14:52:24.744+08:00 level=INFO source=payload.go:30 msg=&amp;#34;extracting embedded files&amp;#34; dir=/var/folders/y0/4gqxky0s2t94x1c1qhlwr6100000gn/T/ollama4239159529/runners
time=2024-06-14T14:52:24.772+08:00 level=INFO source=payload.go:44 msg=&amp;#34;Dynamic LLM libraries [metal]&amp;#34;
time=2024-06-14T14:52:24.796+08:00 level=INFO source=types.go:71 msg=&amp;#34;inference compute&amp;#34; id=0 library=metal compute=&amp;#34;&amp;#34; driver=0.0 name=&amp;#34;&amp;#34; total=&amp;#34;72.0 GiB&amp;#34; available=&amp;#34;72.0 GiB&amp;#34;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;cmd(mac是terminal)看到如上的信息，说明本地ollama服务已开启。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;三实验&#34;&gt;三、实验&lt;/h2&gt;
&lt;h3 id=&#34;31-代码结构&#34;&gt;3.1 代码结构&lt;/h3&gt;
&lt;p&gt;点击下载 &lt;a href=&#34;project.zip&#34;&gt;本文代码&lt;/a&gt;&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;project
   |
  - 代码.ipynb   #代码
  - prompt.txt  #提示模板
  - data
      |--- 1.pdf #实验的发票
  - result.csv   #结果
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;32-读取pdf&#34;&gt;3.2 读取pdf&lt;/h3&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/01-raw-pdf.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;cntext&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;ct&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_pdf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;data/1.pdf&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;__version__&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;2.1.3

&amp;#39; 北京增值税电子普通发票发票代码： \n发票号码： 69453658\n开票日期： 2023年01月06日\n校 验 码： \n购\n买\n方名        称： 哈尔滨所以然信息技术有限公司\n密\n码\n区030898/5&amp;lt;32&amp;gt;*/0*440/63+79*08\n纳税人识别号： 91230109MABT7KBC4M /&amp;lt;54&amp;lt;1*6+49&amp;lt;-*+*&amp;gt;7&amp;lt;-8*04&amp;lt;+01\n地 址、电 话： 68+160026-45904*2&amp;lt;+3+15503&amp;gt;2\n开户行及账号： 98*2/*-*480145+-19*0917-1*61\n货物或应税劳务、服务名称 规格型号 单 位 数 量 单 价 金 额 税率 税 额\n*信息技术服务*技术服务费 1248.113208 248.11 6% 14.89\n合      计 ￥248.11 ￥14.89\n价税合计（大写）\n  贰佰陆拾叁元整             （小写）￥263.00\n销\n售\n方名        称： \n备\n注230106163474406331\n纳税人识别号： 91110108MA01WFY0X6\n地 址、电 话： \n开户行及账号： \n  收款人： 复核： 开票人： 销售方：（章）&amp;#39;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;34-提取信息&#34;&gt;3.4 提取信息&lt;/h3&gt;
&lt;p&gt;使用ollama服务中的大模型 &lt;em&gt;&lt;strong&gt;llama3:8b&lt;/strong&gt;&lt;/em&gt; , 需要大模型提示信息及数据。这是我实验里设计的提示信息prompt&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;prompt = open(&amp;#39;prompt.txt&amp;#39;, encoding=&amp;#39;utf-8&amp;#39;).read()
print(prompt)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;发票文本内容
--- 
{TEXT} 
---

以 JSON 格式回答。 JSON 应包含如下信息， 依次为&amp;#34;开票日期&amp;#34;, &amp;#34;应税货物(或服务)名称&amp;#34;, &amp;#34;价税合计(大写)&amp;#34;, &amp;#34;税率&amp;#34;, &amp;#34;备注&amp;#34;; 
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;o&#34;&gt;%%&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;time&lt;/span&gt;

&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;ollama&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;cntext&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;ct&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#读取发票pdf&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;content&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_pdf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;data/1.pdf&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#读取prompt&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;prompt&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;open&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;prompt.txt&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;encoding&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;utf-8&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;response&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ollama&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;chat&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;model&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;llama3:8b&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;messages&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;
      &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;role&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;system&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;content&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;prompt&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;},&lt;/span&gt;
      &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;role&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;user&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;content&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;content&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;},&lt;/span&gt;
    &lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;result&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;response&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;message&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;][&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;content&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;result&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;eval&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;result&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;split&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;```&lt;/span&gt;&lt;span class=&#34;se&#34;&gt;\n&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)[&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;split&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;se&#34;&gt;\n&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;```&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)[&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;result&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;CPU times: user 20.5 ms, sys: 2.34 ms, total: 22.9 ms
Wall time: 3.58 s

{&amp;#39;开票日期&amp;#39;: &amp;#39;2023年01月06日&amp;#39;,
 &amp;#39;应税货物(或服务)名称&amp;#39;: &amp;#39;*信息技术服务*技术服务费&amp;#39;,
 &amp;#39;价税合计(大写)&amp;#39;: &amp;#39;贰佰陆拾叁元整&amp;#39;,
 &amp;#39;税率&amp;#39;: &amp;#39;6%&amp;#39;,
 &amp;#39;备注&amp;#39;: &amp;#39;230106163474406331&amp;#39;}
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;33-封装成函数extract_info&#34;&gt;3.3 封装成函数extract_info&lt;/h3&gt;
&lt;p&gt;实验成功，我们将其封装为函数&lt;em&gt;&lt;strong&gt;extract_info&lt;/strong&gt;&lt;/em&gt;， 为增强代码的鲁棒性， 函数内设置了异常处理机制，最多可重试3次。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;ollama&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;re&lt;/span&gt;


&lt;span class=&#34;k&#34;&gt;def&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;extract_info&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;prompt&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;max_retries&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;3&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
    &lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;attempt&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;range&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;max_retries&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
        &lt;span class=&#34;k&#34;&gt;try&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
            &lt;span class=&#34;n&#34;&gt;response&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ollama&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;chat&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;
                &lt;span class=&#34;n&#34;&gt;model&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;llama3:8b&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
                &lt;span class=&#34;n&#34;&gt;messages&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;
                    &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;role&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;system&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;content&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;prompt&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;},&lt;/span&gt;
                    &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;role&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;user&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;content&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
                &lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
            &lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

            &lt;span class=&#34;n&#34;&gt;result&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;response&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;message&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;][&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;content&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
            &lt;span class=&#34;n&#34;&gt;result&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;eval&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;result&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;split&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;```&lt;/span&gt;&lt;span class=&#34;se&#34;&gt;\n&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)[&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;split&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;se&#34;&gt;\n&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;```&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)[&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
            &lt;span class=&#34;k&#34;&gt;return&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;result&lt;/span&gt;
        
        &lt;span class=&#34;k&#34;&gt;except&lt;/span&gt; &lt;span class=&#34;ne&#34;&gt;Exception&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;e&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
            &lt;span class=&#34;k&#34;&gt;if&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;attempt&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;max_retries&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
                &lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;sa&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;An error occurred: &lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;e&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;. Retrying (&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;attempt&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;/&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;max_retries&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;)...&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
            &lt;span class=&#34;k&#34;&gt;else&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
                &lt;span class=&#34;k&#34;&gt;raise&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;e&lt;/span&gt;
  
&lt;span class=&#34;c1&#34;&gt;#读取prompt&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;prompt&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;open&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;prompt.txt&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;encoding&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;utf-8&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;result&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;extract_info&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;prompt&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;result&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;result与之前无异， 为了节省版面，这里就不显示result。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;34-批量提取&#34;&gt;3.4 批量提取&lt;/h3&gt;
&lt;p&gt;假设data文件夹内有成百上千的发票(实际上只有一张发票)， 对data文件夹进行批量信息提取，结果存储为csv。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;o&#34;&gt;%%&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;time&lt;/span&gt;

&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;os&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;ollama&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;cntext&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;ct&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;


&lt;span class=&#34;k&#34;&gt;def&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;extract_info&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;prompt&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;max_retries&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;3&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
    &lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;attempt&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;range&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;max_retries&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
        &lt;span class=&#34;k&#34;&gt;try&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
            &lt;span class=&#34;n&#34;&gt;response&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ollama&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;chat&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;
                &lt;span class=&#34;n&#34;&gt;model&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;llama3:8b&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
                &lt;span class=&#34;n&#34;&gt;messages&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;
                    &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;role&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;system&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;content&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;prompt&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;},&lt;/span&gt;
                    &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;role&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;user&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;content&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
                &lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
            &lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

            &lt;span class=&#34;n&#34;&gt;result&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;response&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;message&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;][&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;content&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
            &lt;span class=&#34;n&#34;&gt;result&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;eval&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;result&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;split&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;```&lt;/span&gt;&lt;span class=&#34;se&#34;&gt;\n&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)[&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;split&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;se&#34;&gt;\n&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;```&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)[&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
            &lt;span class=&#34;k&#34;&gt;return&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;result&lt;/span&gt;
        
        &lt;span class=&#34;k&#34;&gt;except&lt;/span&gt; &lt;span class=&#34;ne&#34;&gt;Exception&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;e&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
            &lt;span class=&#34;k&#34;&gt;if&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;attempt&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;max_retries&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
                &lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;sa&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;An error occurred: &lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;e&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;. Retrying (&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;attempt&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;/&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;max_retries&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;)...&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
            &lt;span class=&#34;k&#34;&gt;else&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
                &lt;span class=&#34;k&#34;&gt;raise&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;e&lt;/span&gt;
                
  
&lt;span class=&#34;c1&#34;&gt;#当前代码所在的代码文件与data文件夹处于同一个文件夹内&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#获取data内所有pdf的路径&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;pdf_files&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;sa&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;data/&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;file&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;file&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;os&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;listdir&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;data&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;if&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;.pdf&amp;#39;&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;file&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#读取prompt&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;prompt&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;open&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;prompt.txt&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;encoding&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;utf-8&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;dict_datas&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[]&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pdf_file&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pdf_files&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;text&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_pdf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;pdf_file&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;dict_data&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;extract_info&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;prompt&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;dict_datas&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;append&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;dict_data&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;DataFrame&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;dict_datas&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;to_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;result.csv&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;index&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;False&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;CPU times: user 32 ms, sys: 2.17 ms, total: 15.2 ms
Wall time: 3.8 s
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/05-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;四讨论&#34;&gt;四、讨论&lt;/h2&gt;
&lt;p&gt;本文只使用了一张发票进行实验， 实际上准确率没有这么高， 识别错误字段集中在销售方纳税识别号(案例没有展示销售方纳税识别号的识别)。原因主要是ct.read_pdf读入pdf时，文本比较杂乱。对大模型的语义理解有一定的挑战。目前大模型已经支持文本、图片、音频、视频、网址， 所以各位看官，不用等太久，就可克服此问题。&lt;/p&gt;
&lt;p&gt;大模型会对每个输入，给出正确概率最大的回答，因此大模型提取数据时存在一定的错误识别风险。为降低该风险，尽量选择特别特殊、显眼，例如三张发票的&lt;strong&gt;价税合计(大写)&lt;/strong&gt;,  因为信息是特殊的中文大写数字， 在所有文本中是最醒目最特别的文本信息，这样大模型处理这类信息时会给这类信息尽可能高的权重，增大回答的准确率。&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;h2 id=&#34;精选内容&#34;&gt;精选内容&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/datasets_available_for_management_science/&#34;&gt;LIST | 可供社科(经管)领域使用的数据集汇总&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/the_text_analysis_list_about_ms/&#34;&gt;LIST | 社科(经管)数据挖掘文献资料汇总&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2024-06-16-scrapegraph-ai/&#34;&gt;网络爬虫 | 使用scrapegraph-ai(大模型方案)自动采集网页数据&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/&#34;&gt;推荐 | 文本分析库cntext使用手册&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/management_python_course/&#34;&gt;付费视频课 | Python实证指标构建与文本分析&lt;/a&gt;
&lt;br&gt;
&lt;br&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;cntext使用声明&#34;&gt;cntext使用声明&lt;/h2&gt;
&lt;p&gt;如在研究或项目中使用 cntext ，请在文中介绍并附引用声明。引用格式可参考 &lt;a href=&#34;https://textdata.cn/blog/2025-09-09-transform-text-data-into-structured-data-with-lmstudio-and-cntext/#%E4%BD%BF%E7%94%A8%E5%A3%B0%E6%98%8E&#34;&gt;cntext 推荐引用格式&lt;/a&gt;&lt;/p&gt;
</description>
      <content:encoded><![CDATA[<p>非结构文本、图片、视频等数据是待挖掘的数据矿藏， 在经管、社科等研究领域中谁拥有了<em><strong>从非结构提取结构化信息的能力</strong></em>，谁就拥有科研上的数据优势。</p>
<p><br><br></p>
<h2 id="一需求">一、需求</h2>
<p>现在有很多个电子发票PDF文件， 使用自动化工具帮我们批量自动从发票PDF提取出格式化信息。如从发票</p>
<p><img loading="lazy" src="img/01-raw-pdf.png" alt=""  />
</p>
<p>提取出DICT_DATA</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">DICT_DATA = {
    &#34;开票日期&#34;: &#34;2023年01月06日&#34;,
    &#34;应税货物(或服务)名称&#34;: &#34;*信息技术服务*技术服务费&#34;,
    &#34;价税合计(大写)&#34;: &#34;&#34;,
    &#34;税率&#34;: &#34;6%&#34;,
    &#34;备注&#34;: &#34;230106163474406331&#34;
}
</code></pre></div><p><br><br></p>
<h2 id="二ollama介绍">二、Ollama介绍</h2>
<p><a href="https://ollama.ai/"><strong>Ollama</strong></a>是一款开源应用程序，可让您使用 MacOS、Linux 和 Windows 上的命令行界面在本地运行、创建和共享大型语言模型。</p>
<p>Ollama 可以直接从其库中访问各种 LLM，只需一个命令即可下载。下载后，只需执行一个命令即可开始使用。这对于工作量围绕终端窗口的用户非常有帮助。如果他们被困在某个地方，他们可以在不切换到另一个浏览器窗口的情况下获得答案。</p>
<br>
<h3 id="21-特点和优点">2.1 特点和优点</h3>
<p>这就是为什么 OLLAMA 是您的工具包中必备的工具：</p>
<ul>
<li><strong>简单</strong> ：OLLAMA 提供简单的设置过程。您无需拥有机器学习博士学位即可启动和运行它。</li>
<li><strong>成本效益</strong> ：在本地运行模型意味着您无需支付云成本。您的钱包会感谢您。</li>
<li><strong>隐私</strong> ：使用 OLLAMA，所有数据处理都在您的本地机器上进行。这对于用户隐私来说是一个巨大的胜利。</li>
<li><strong>多功能性</strong> ：OLLAMA 不只是为 Python 爱好者准备的。它的灵活性使其可以用于各种应用程序，包括 Web 开发。</li>
</ul>
<br>
<h3 id="22-使用-ollama-进行-llm-选择">2.2 使用 Ollama 进行 LLM 选择</h3>
<p>默认情况下，Openai Models 在 CrewAI 中用作 llm。有经费、有网络、不担心数据泄露等条件下,  力求达到最佳性能，可考虑使用 GPT-4 或 OpenAI 稍便宜的 GPT-3.5。</p>
<p>但本文是要 <strong>本地部署</strong>， 因此我们将使用 Meta Llama 3，这是迄今为止功能最强大的公开 LLM。Meta Llama 3 是 Meta Inc. 开发的模型系列，是最新推出的模型，具有 8B 和 70B 两种参数大小（预训练或指令调整）。Llama 3 指令调整模型针对对话/聊天用例进行了微调和优化，并且在常见基准测试中胜过许多可用的开源聊天模型。</p>
<p><img loading="lazy" src="img/02-llama3-performance.png" alt=""  />
</p>
<p><img loading="lazy" src="img/03-llama3-performance.png" alt=""  />
</p>
<br>
<h3 id="23-安装ollama">2.3 安装ollama</h3>
<p>点击前往网站 <a href="https://ollama.com/">https://ollama.com/</a> ，下载ollama软件，支持win、Mac、linux</p>
<p><img loading="lazy" src="img/04-ollama-gui.png" alt=""  />
</p>
<br>
<h3 id="24-下载llm">2.4 下载LLM</h3>
<p>ollama软件目前支持多种大模型， 如阿里的（qwen、qwen2）、meta的(llama3、llama3.1)，</p>
<p><img loading="lazy" src="img/05-ollama-model.png" alt=""  />
</p>
<br>
<p>以llama3为例，根据自己电脑显存性能， 选择适宜的版本。如果不知道选什么，那就试着安装，不合适不能用再删除即可。</p>
<p><img loading="lazy" src="img/06-ollama-llama3.png" alt=""  />
</p>
<br>
<p>打开电脑命令行cmd(mac是terminal),  网络是连网状态，执行模型下载(安装)命令</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">ollama pull llama3
</code></pre></div><p>等待 <strong>llama3:8b</strong> 下载完成。</p>
<br>
<h3 id="25-安装python包">2.5 安装python包</h3>
<p>在python中调用ollama服务，需要ollama包。</p>
<p>打开电脑命令行cmd(mac是terminal),  网络是连网状态，执行安装命令</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">pip3 install ollama
</code></pre></div><br>
<h3 id="26-启动ollama服务">2.6 启动ollama服务</h3>
<p>在Python中调用本地ollama服务，需要先启动本地ollama服务， 打开电脑命令行cmd(mac是terminal), 执行</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">ollama serve
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">2024/06/14 14:52:24 routes.go:1011: INFO server config env=&#34;map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_KEEP_ALIVE: OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS:/Users/deng/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_TMPDIR:]&#34;
time=2024-06-14T14:52:24.742+08:00 level=INFO source=images.go:725 msg=&#34;total blobs: 18&#34;
time=2024-06-14T14:52:24.742+08:00 level=INFO source=images.go:732 msg=&#34;total unused blobs removed: 0&#34;
time=2024-06-14T14:52:24.743+08:00 level=INFO source=routes.go:1057 msg=&#34;Listening on 127.0.0.1:11434 (version 0.1.44)&#34;
time=2024-06-14T14:52:24.744+08:00 level=INFO source=payload.go:30 msg=&#34;extracting embedded files&#34; dir=/var/folders/y0/4gqxky0s2t94x1c1qhlwr6100000gn/T/ollama4239159529/runners
time=2024-06-14T14:52:24.772+08:00 level=INFO source=payload.go:44 msg=&#34;Dynamic LLM libraries [metal]&#34;
time=2024-06-14T14:52:24.796+08:00 level=INFO source=types.go:71 msg=&#34;inference compute&#34; id=0 library=metal compute=&#34;&#34; driver=0.0 name=&#34;&#34; total=&#34;72.0 GiB&#34; available=&#34;72.0 GiB&#34;
</code></pre></div><p>cmd(mac是terminal)看到如上的信息，说明本地ollama服务已开启。</p>
<p><br><br></p>
<h2 id="三实验">三、实验</h2>
<h3 id="31-代码结构">3.1 代码结构</h3>
<p>点击下载 <a href="project.zip">本文代码</a></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">project
   |
  - 代码.ipynb   #代码
  - prompt.txt  #提示模板
  - data
      |--- 1.pdf #实验的发票
  - result.csv   #结果
</code></pre></div><br>
<h3 id="32-读取pdf">3.2 读取pdf</h3>
<p><img loading="lazy" src="img/01-raw-pdf.png" alt=""  />
</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="n">text</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">read_pdf</span><span class="p">(</span><span class="s1">&#39;data/1.pdf&#39;</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">ct</span><span class="o">.</span><span class="n">__version__</span><span class="p">)</span>
<span class="n">text</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">2.1.3

&#39; 北京增值税电子普通发票发票代码： \n发票号码： 69453658\n开票日期： 2023年01月06日\n校 验 码： \n购\n买\n方名        称： 哈尔滨所以然信息技术有限公司\n密\n码\n区030898/5&lt;32&gt;*/0*440/63+79*08\n纳税人识别号： 91230109MABT7KBC4M /&lt;54&lt;1*6+49&lt;-*+*&gt;7&lt;-8*04&lt;+01\n地 址、电 话： 68+160026-45904*2&lt;+3+15503&gt;2\n开户行及账号： 98*2/*-*480145+-19*0917-1*61\n货物或应税劳务、服务名称 规格型号 单 位 数 量 单 价 金 额 税率 税 额\n*信息技术服务*技术服务费 1248.113208 248.11 6% 14.89\n合      计 ￥248.11 ￥14.89\n价税合计（大写）\n  贰佰陆拾叁元整             （小写）￥263.00\n销\n售\n方名        称： \n备\n注230106163474406331\n纳税人识别号： 91110108MA01WFY0X6\n地 址、电 话： \n开户行及账号： \n  收款人： 复核： 开票人： 销售方：（章）&#39;
</code></pre></div><br>
<h3 id="34-提取信息">3.4 提取信息</h3>
<p>使用ollama服务中的大模型 <em><strong>llama3:8b</strong></em> , 需要大模型提示信息及数据。这是我实验里设计的提示信息prompt</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">prompt = open(&#39;prompt.txt&#39;, encoding=&#39;utf-8&#39;).read()
print(prompt)
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">发票文本内容
--- 
{TEXT} 
---

以 JSON 格式回答。 JSON 应包含如下信息， 依次为&#34;开票日期&#34;, &#34;应税货物(或服务)名称&#34;, &#34;价税合计(大写)&#34;, &#34;税率&#34;, &#34;备注&#34;; 
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>

<span class="kn">import</span> <span class="nn">ollama</span>
<span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="c1">#读取发票pdf</span>
<span class="n">content</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">read_pdf</span><span class="p">(</span><span class="s1">&#39;data/1.pdf&#39;</span><span class="p">)</span>
<span class="c1">#读取prompt</span>
<span class="n">prompt</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s1">&#39;prompt.txt&#39;</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>

<span class="n">response</span> <span class="o">=</span> <span class="n">ollama</span><span class="o">.</span><span class="n">chat</span><span class="p">(</span><span class="n">model</span><span class="o">=</span><span class="s1">&#39;llama3:8b&#39;</span><span class="p">,</span> <span class="n">messages</span><span class="o">=</span><span class="p">[</span>
      <span class="p">{</span><span class="s1">&#39;role&#39;</span><span class="p">:</span> <span class="s1">&#39;system&#39;</span><span class="p">,</span><span class="s1">&#39;content&#39;</span><span class="p">:</span> <span class="n">prompt</span><span class="p">},</span>
      <span class="p">{</span><span class="s1">&#39;role&#39;</span><span class="p">:</span> <span class="s1">&#39;user&#39;</span><span class="p">,</span><span class="s1">&#39;content&#39;</span><span class="p">:</span> <span class="n">content</span><span class="p">},</span>
    <span class="p">])</span>

<span class="n">result</span> <span class="o">=</span> <span class="n">response</span><span class="p">[</span><span class="s1">&#39;message&#39;</span><span class="p">][</span><span class="s1">&#39;content&#39;</span><span class="p">]</span>
<span class="n">result</span> <span class="o">=</span> <span class="nb">eval</span><span class="p">(</span><span class="n">result</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">&#39;```</span><span class="se">\n</span><span class="s1">&#39;</span><span class="p">)[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">&#39;</span><span class="se">\n</span><span class="s1">```&#39;</span><span class="p">)[</span><span class="mi">0</span><span class="p">])</span>
<span class="n">result</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">CPU times: user 20.5 ms, sys: 2.34 ms, total: 22.9 ms
Wall time: 3.58 s

{&#39;开票日期&#39;: &#39;2023年01月06日&#39;,
 &#39;应税货物(或服务)名称&#39;: &#39;*信息技术服务*技术服务费&#39;,
 &#39;价税合计(大写)&#39;: &#39;贰佰陆拾叁元整&#39;,
 &#39;税率&#39;: &#39;6%&#39;,
 &#39;备注&#39;: &#39;230106163474406331&#39;}
</code></pre></div><br>
<h3 id="33-封装成函数extract_info">3.3 封装成函数extract_info</h3>
<p>实验成功，我们将其封装为函数<em><strong>extract_info</strong></em>， 为增强代码的鲁棒性， 函数内设置了异常处理机制，最多可重试3次。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">ollama</span>
<span class="kn">import</span> <span class="nn">re</span>


<span class="k">def</span> <span class="nf">extract_info</span><span class="p">(</span><span class="n">text</span><span class="p">,</span> <span class="n">prompt</span><span class="p">,</span> <span class="n">max_retries</span><span class="o">=</span><span class="mi">3</span><span class="p">):</span>
    <span class="k">for</span> <span class="n">attempt</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">max_retries</span> <span class="o">+</span> <span class="mi">1</span><span class="p">):</span>
        <span class="k">try</span><span class="p">:</span>
            <span class="n">response</span> <span class="o">=</span> <span class="n">ollama</span><span class="o">.</span><span class="n">chat</span><span class="p">(</span>
                <span class="n">model</span><span class="o">=</span><span class="s1">&#39;llama3:8b&#39;</span><span class="p">,</span>
                <span class="n">messages</span><span class="o">=</span><span class="p">[</span>
                    <span class="p">{</span><span class="s1">&#39;role&#39;</span><span class="p">:</span> <span class="s1">&#39;system&#39;</span><span class="p">,</span> <span class="s1">&#39;content&#39;</span><span class="p">:</span> <span class="n">prompt</span><span class="p">},</span>
                    <span class="p">{</span><span class="s1">&#39;role&#39;</span><span class="p">:</span> <span class="s1">&#39;user&#39;</span><span class="p">,</span> <span class="s1">&#39;content&#39;</span><span class="p">:</span> <span class="n">text</span><span class="p">}</span>
                <span class="p">]</span>
            <span class="p">)</span>

            <span class="n">result</span> <span class="o">=</span> <span class="n">response</span><span class="p">[</span><span class="s1">&#39;message&#39;</span><span class="p">][</span><span class="s1">&#39;content&#39;</span><span class="p">]</span>
            <span class="n">result</span> <span class="o">=</span> <span class="nb">eval</span><span class="p">(</span><span class="n">result</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">&#39;```</span><span class="se">\n</span><span class="s1">&#39;</span><span class="p">)[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">&#39;</span><span class="se">\n</span><span class="s1">```&#39;</span><span class="p">)[</span><span class="mi">0</span><span class="p">])</span>
            <span class="k">return</span> <span class="n">result</span>
        
        <span class="k">except</span> <span class="ne">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
            <span class="k">if</span> <span class="n">attempt</span> <span class="o">&lt;</span> <span class="n">max_retries</span><span class="p">:</span>
                <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;An error occurred: </span><span class="si">{</span><span class="n">e</span><span class="si">}</span><span class="s2">. Retrying (</span><span class="si">{</span><span class="n">attempt</span> <span class="o">+</span> <span class="mi">1</span><span class="si">}</span><span class="s2">/</span><span class="si">{</span><span class="n">max_retries</span> <span class="o">+</span> <span class="mi">1</span><span class="si">}</span><span class="s2">)...&#34;</span><span class="p">)</span>
            <span class="k">else</span><span class="p">:</span>
                <span class="k">raise</span> <span class="n">e</span>
  
<span class="c1">#读取prompt</span>
<span class="n">prompt</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s1">&#39;prompt.txt&#39;</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">extract_info</span><span class="p">(</span><span class="n">text</span><span class="p">,</span> <span class="n">prompt</span><span class="p">)</span>
<span class="n">result</span>
</code></pre></div><p>result与之前无异， 为了节省版面，这里就不显示result。</p>
<br>
<h3 id="34-批量提取">3.4 批量提取</h3>
<p>假设data文件夹内有成百上千的发票(实际上只有一张发票)， 对data文件夹进行批量信息提取，结果存储为csv。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>

<span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">ollama</span>
<span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>


<span class="k">def</span> <span class="nf">extract_info</span><span class="p">(</span><span class="n">text</span><span class="p">,</span> <span class="n">prompt</span><span class="p">,</span> <span class="n">max_retries</span><span class="o">=</span><span class="mi">3</span><span class="p">):</span>
    <span class="k">for</span> <span class="n">attempt</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">max_retries</span> <span class="o">+</span> <span class="mi">1</span><span class="p">):</span>
        <span class="k">try</span><span class="p">:</span>
            <span class="n">response</span> <span class="o">=</span> <span class="n">ollama</span><span class="o">.</span><span class="n">chat</span><span class="p">(</span>
                <span class="n">model</span><span class="o">=</span><span class="s1">&#39;llama3:8b&#39;</span><span class="p">,</span>
                <span class="n">messages</span><span class="o">=</span><span class="p">[</span>
                    <span class="p">{</span><span class="s1">&#39;role&#39;</span><span class="p">:</span> <span class="s1">&#39;system&#39;</span><span class="p">,</span> <span class="s1">&#39;content&#39;</span><span class="p">:</span> <span class="n">prompt</span><span class="p">},</span>
                    <span class="p">{</span><span class="s1">&#39;role&#39;</span><span class="p">:</span> <span class="s1">&#39;user&#39;</span><span class="p">,</span> <span class="s1">&#39;content&#39;</span><span class="p">:</span> <span class="n">text</span><span class="p">}</span>
                <span class="p">]</span>
            <span class="p">)</span>

            <span class="n">result</span> <span class="o">=</span> <span class="n">response</span><span class="p">[</span><span class="s1">&#39;message&#39;</span><span class="p">][</span><span class="s1">&#39;content&#39;</span><span class="p">]</span>
            <span class="n">result</span> <span class="o">=</span> <span class="nb">eval</span><span class="p">(</span><span class="n">result</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">&#39;```</span><span class="se">\n</span><span class="s1">&#39;</span><span class="p">)[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">&#39;</span><span class="se">\n</span><span class="s1">```&#39;</span><span class="p">)[</span><span class="mi">0</span><span class="p">])</span>
            <span class="k">return</span> <span class="n">result</span>
        
        <span class="k">except</span> <span class="ne">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
            <span class="k">if</span> <span class="n">attempt</span> <span class="o">&lt;</span> <span class="n">max_retries</span><span class="p">:</span>
                <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;An error occurred: </span><span class="si">{</span><span class="n">e</span><span class="si">}</span><span class="s2">. Retrying (</span><span class="si">{</span><span class="n">attempt</span> <span class="o">+</span> <span class="mi">1</span><span class="si">}</span><span class="s2">/</span><span class="si">{</span><span class="n">max_retries</span> <span class="o">+</span> <span class="mi">1</span><span class="si">}</span><span class="s2">)...&#34;</span><span class="p">)</span>
            <span class="k">else</span><span class="p">:</span>
                <span class="k">raise</span> <span class="n">e</span>
                
  
<span class="c1">#当前代码所在的代码文件与data文件夹处于同一个文件夹内</span>
<span class="c1">#获取data内所有pdf的路径</span>
<span class="n">pdf_files</span> <span class="o">=</span> <span class="p">[</span><span class="sa">f</span><span class="s1">&#39;data/</span><span class="si">{</span><span class="n">file</span><span class="si">}</span><span class="s1">&#39;</span> <span class="k">for</span> <span class="n">file</span> <span class="ow">in</span> <span class="n">os</span><span class="o">.</span><span class="n">listdir</span><span class="p">(</span><span class="s1">&#39;data&#39;</span><span class="p">)</span> <span class="k">if</span> <span class="s1">&#39;.pdf&#39;</span> <span class="ow">in</span> <span class="n">file</span><span class="p">]</span>
<span class="c1">#读取prompt</span>
<span class="n">prompt</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s1">&#39;prompt.txt&#39;</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>

<span class="n">dict_datas</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">pdf_file</span> <span class="ow">in</span> <span class="n">pdf_files</span><span class="p">:</span>
    <span class="n">text</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">read_pdf</span><span class="p">(</span><span class="n">pdf_file</span><span class="p">)</span>
    <span class="n">dict_data</span> <span class="o">=</span> <span class="n">extract_info</span><span class="p">(</span><span class="n">text</span><span class="p">,</span> <span class="n">prompt</span><span class="p">)</span>
    <span class="n">dict_datas</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">dict_data</span><span class="p">)</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">dict_datas</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s1">&#39;result.csv&#39;</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
<span class="n">df</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">CPU times: user 32 ms, sys: 2.17 ms, total: 15.2 ms
Wall time: 3.8 s
</code></pre></div><p><img loading="lazy" src="img/05-df.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="四讨论">四、讨论</h2>
<p>本文只使用了一张发票进行实验， 实际上准确率没有这么高， 识别错误字段集中在销售方纳税识别号(案例没有展示销售方纳税识别号的识别)。原因主要是ct.read_pdf读入pdf时，文本比较杂乱。对大模型的语义理解有一定的挑战。目前大模型已经支持文本、图片、音频、视频、网址， 所以各位看官，不用等太久，就可克服此问题。</p>
<p>大模型会对每个输入，给出正确概率最大的回答，因此大模型提取数据时存在一定的错误识别风险。为降低该风险，尽量选择特别特殊、显眼，例如三张发票的<strong>价税合计(大写)</strong>,  因为信息是特殊的中文大写数字， 在所有文本中是最醒目最特别的文本信息，这样大模型处理这类信息时会给这类信息尽可能高的权重，增大回答的准确率。</p>
<br>
<br>
<h2 id="精选内容">精选内容</h2>
<ul>
<li><a href="https://textdata.cn/blog/datasets_available_for_management_science/">LIST | 可供社科(经管)领域使用的数据集汇总</a></li>
<li><a href="https://textdata.cn/blog/the_text_analysis_list_about_ms/">LIST | 社科(经管)数据挖掘文献资料汇总</a></li>
<li><a href="https://textdata.cn/blog/2024-06-16-scrapegraph-ai/">网络爬虫 | 使用scrapegraph-ai(大模型方案)自动采集网页数据</a></li>
<li><a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/">推荐 | 文本分析库cntext使用手册</a></li>
<li><a href="https://textdata.cn/blog/management_python_course/">付费视频课 | Python实证指标构建与文本分析</a>
<br>
<br></li>
</ul>
<h2 id="cntext使用声明">cntext使用声明</h2>
<p>如在研究或项目中使用 cntext ，请在文中介绍并附引用声明。引用格式可参考 <a href="https://textdata.cn/blog/2025-09-09-transform-text-data-into-structured-data-with-lmstudio-and-cntext/#%E4%BD%BF%E7%94%A8%E5%A3%B0%E6%98%8E">cntext 推荐引用格式</a></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>可视化 | 如何在matplotlib中显示中文</title>
      <link>https://textdata.cn/blog/2024-06-05-how-to-show-chinese-in-matplotlib-plotnine/</link>
      <pubDate>Wed, 05 Jun 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-06-05-how-to-show-chinese-in-matplotlib-plotnine/</guid>
      <description>在matplotlib中正常显示中文</description>
      <content:encoded><![CDATA[<h2 id="一任务">一、任务</h2>
<p>想绘制下图， 要求中文正常显示</p>
<p><img loading="lazy" src="img/05-plot.png" alt=""  />
</p>
<br>
<h2 id="二实验数据">二、实验数据</h2>
<p><img loading="lazy" src="img/01-website.png" alt=""  />
</p>
<p>实验数据整理自 <a href="https://textdata.cn/blog/2024-06-05-wenzheng-hunan-dataset/"><strong>数据集 | 30w条「问政湖南」留言&amp;回复数据(2010-2024)</strong></a></p>
<p><img loading="lazy" src="img/02-wz-data.png" alt=""  />
</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">years</span> <span class="o">=</span> <span class="p">[</span><span class="mi">2010</span><span class="p">,</span> <span class="mi">2011</span><span class="p">,</span> <span class="mi">2012</span><span class="p">,</span> <span class="mi">2013</span><span class="p">,</span> <span class="mi">2014</span><span class="p">,</span> <span class="mi">2015</span><span class="p">,</span> <span class="mi">2016</span><span class="p">,</span> <span class="mi">2017</span>
         <span class="p">,</span> 
         <span class="mi">2018</span><span class="p">,</span> <span class="mi">2019</span><span class="p">,</span> <span class="mi">2020</span><span class="p">,</span> <span class="mi">2021</span><span class="p">,</span> <span class="mi">2022</span><span class="p">,</span> <span class="mi">2023</span><span class="p">,</span> <span class="mi">2024</span><span class="p">]</span>
<span class="n">volumes</span> <span class="o">=</span> <span class="p">[</span><span class="mi">3</span><span class="p">,</span> <span class="mi">675</span><span class="p">,</span> <span class="mi">2173</span><span class="p">,</span> <span class="mi">2878</span><span class="p">,</span> <span class="mi">4159</span><span class="p">,</span> <span class="mi">5329</span><span class="p">,</span> <span class="mi">7570</span><span class="p">,</span> <span class="mi">12691</span><span class="p">,</span> 
           <span class="mi">23123</span><span class="p">,</span> <span class="mi">29724</span><span class="p">,</span> <span class="mi">31766</span><span class="p">,</span> <span class="mi">47054</span><span class="p">,</span> <span class="mi">51565</span><span class="p">,</span> <span class="mi">58666</span><span class="p">,</span> <span class="mi">24814</span><span class="p">]</span>

<span class="n">data</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="s1">&#39;year&#39;</span><span class="p">:</span> <span class="n">years</span><span class="p">,</span> 
                    <span class="s1">&#39;volume&#39;</span><span class="p">:</span> <span class="n">volumes</span><span class="p">})</span>

<span class="n">data</span>
</code></pre></div><p><img loading="lazy" src="img/01-data.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="三失败的可视化">三、失败的可视化</h2>
<p>使用matplotlib绘制</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
plt.scatter(data.year, data.volume)
plt.plot(data.year, data.volume)
plt.xlabel(&#39;年份&#39;)
plt.ylabel(&#39;回复量&#39;)
plt.title(&#39;问政湖南回复量(2010-2024)&#39;)
plt.show()
</code></pre></div><p><img loading="lazy" src="img/02-failure.png" alt=""  />
</p>
<br>
<p>使用plotnine绘制</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">plotnine</span> <span class="kn">import</span> <span class="o">*</span>

<span class="p">(</span>
    <span class="n">ggplot</span><span class="p">(</span><span class="n">data</span><span class="p">,</span>  <span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="s1">&#39;year&#39;</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="s1">&#39;volume&#39;</span><span class="p">))</span>
    <span class="o">+</span><span class="n">geom_point</span><span class="p">()</span>
    <span class="o">+</span><span class="n">geom_line</span><span class="p">()</span>
    <span class="o">+</span><span class="n">theme</span><span class="p">(</span><span class="n">figure_size</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">6</span><span class="p">))</span>
    <span class="o">+</span><span class="n">labs</span><span class="p">(</span><span class="n">title</span><span class="o">=</span><span class="s1">&#39;问政湖南留言回复量(2010-2024.6)&#39;</span><span class="p">,</span>
          <span class="n">x</span> <span class="o">=</span> <span class="s1">&#39;年度&#39;</span><span class="p">,</span> 
          <span class="n">y</span> <span class="o">=</span> <span class="s1">&#39;回复量&#39;</span><span class="p">)</span>
<span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/03-failure.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="四使用系统内置字体">四、使用系统内置字体</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">platform</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">import</span> <span class="nn">matplotlib</span>

<span class="n">system</span> <span class="o">=</span> <span class="n">platform</span><span class="o">.</span><span class="n">system</span><span class="p">()</span>  <span class="c1"># 获取操作系统类型</span>

<span class="k">if</span> <span class="n">system</span> <span class="o">==</span> <span class="s1">&#39;Windows&#39;</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;SimHei&#39;</span><span class="p">}</span>
<span class="k">elif</span> <span class="n">system</span> <span class="o">==</span> <span class="s1">&#39;Darwin&#39;</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;Arial Unicode MS&#39;</span><span class="p">}</span>
<span class="k">else</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;sans-serif&#39;</span><span class="p">}</span>
<span class="n">matplotlib</span><span class="o">.</span><span class="n">rc</span><span class="p">(</span><span class="s1">&#39;font&#39;</span><span class="p">,</span> <span class="o">**</span><span class="n">font</span><span class="p">)</span>  <span class="c1"># 设置全局字体</span>
    
<span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">6</span><span class="p">))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">data</span><span class="o">.</span><span class="n">year</span><span class="p">,</span> <span class="n">data</span><span class="o">.</span><span class="n">volume</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">data</span><span class="o">.</span><span class="n">year</span><span class="p">,</span> <span class="n">data</span><span class="o">.</span><span class="n">volume</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s1">&#39;年份&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s1">&#39;回复量&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">&#39;问政湖南回复量(2010-2024)&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/04-matplotlib.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="五使用chineseize-matplotlib">五、使用chineseize-matplotlib</h2>
<p>安装</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">pip install chineseize-matplotlib
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">platform</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">import</span> <span class="nn">matplotlib</span>
<span class="kn">import</span> <span class="nn">chineseize_matplotlib</span>

<span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">6</span><span class="p">))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">data</span><span class="o">.</span><span class="n">year</span><span class="p">,</span> <span class="n">data</span><span class="o">.</span><span class="n">volume</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">data</span><span class="o">.</span><span class="n">year</span><span class="p">,</span> <span class="n">data</span><span class="o">.</span><span class="n">volume</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s1">&#39;年份&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s1">&#39;回复量&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">&#39;问政湖南回复量(2010-2024)&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/05-chinese-mat.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="六使用外源ttf字体文件">六、使用外源ttf字体文件</h2>
<p>本文实验字体 <a href="%E6%96%87%E6%B3%89%E9%A9%BF%E5%BE%AE%E7%B1%B3%E9%BB%91.ttf">文泉驿微米黑.ttf</a> 下载链接</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">plotnine</span> <span class="kn">import</span> <span class="o">*</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">from</span> <span class="nn">matplotlib.font_manager</span> <span class="kn">import</span> <span class="n">FontProperties</span>

<span class="c1">#文泉驿微米黑.ttf位于代码同文件夹</span>
<span class="n">font_prop</span> <span class="o">=</span> <span class="n">FontProperties</span><span class="p">(</span><span class="n">fname</span><span class="o">=</span><span class="s1">&#39;文泉驿微米黑.ttf&#39;</span><span class="p">)</span> 

<span class="p">(</span>
    <span class="n">ggplot</span><span class="p">(</span><span class="n">data</span><span class="p">,</span>  <span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="s1">&#39;year&#39;</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="s1">&#39;volume&#39;</span><span class="p">))</span>
    <span class="o">+</span><span class="n">geom_point</span><span class="p">()</span>
    <span class="o">+</span><span class="n">geom_line</span><span class="p">()</span>
    <span class="o">+</span><span class="n">theme</span><span class="p">(</span><span class="n">figure_size</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">6</span><span class="p">),</span>
           <span class="n">text</span> <span class="o">=</span> <span class="n">element_text</span><span class="p">(</span><span class="n">family</span> <span class="o">=</span> <span class="n">font_prop</span><span class="o">.</span><span class="n">get_name</span><span class="p">()),</span> 
           <span class="n">plot_title</span> <span class="o">=</span> <span class="n">element_text</span><span class="p">(</span><span class="n">family</span> <span class="o">=</span> <span class="n">font_prop</span><span class="o">.</span><span class="n">get_name</span><span class="p">(),</span> <span class="n">size</span><span class="o">=</span><span class="mi">14</span><span class="p">)</span>
          <span class="p">)</span>
    <span class="o">+</span><span class="n">labs</span><span class="p">(</span><span class="n">title</span><span class="o">=</span><span class="s1">&#39;问政湖南留言回复量(2010-2024.6)&#39;</span><span class="p">,</span>
          <span class="n">x</span> <span class="o">=</span> <span class="s1">&#39;年度&#39;</span><span class="p">,</span> 
          <span class="n">y</span> <span class="o">=</span> <span class="s1">&#39;回复量&#39;</span><span class="p">)</span>
<span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/05-plot.png" alt=""  />
</p>
<br>
<p>更美观一些</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">plotnine</span> <span class="kn">import</span> <span class="o">*</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">from</span> <span class="nn">matplotlib.font_manager</span> <span class="kn">import</span> <span class="n">FontProperties</span>
<span class="c1">##需要先安装mizani、plotnine_prism库</span>
<span class="kn">from</span> <span class="nn">plotnine_prism</span> <span class="kn">import</span> <span class="o">*</span>
<span class="kn">from</span> <span class="nn">mizani.breaks</span> <span class="kn">import</span> <span class="n">date_breaks</span>
<span class="kn">from</span> <span class="nn">mizani.formatters</span> <span class="kn">import</span> <span class="n">date_format</span>



<span class="c1">#文泉驿微米黑.ttf位于代码同文件夹</span>
<span class="n">font_prop</span> <span class="o">=</span> <span class="n">FontProperties</span><span class="p">(</span><span class="n">fname</span><span class="o">=</span><span class="s1">&#39;文泉驿微米黑.ttf&#39;</span><span class="p">)</span> 

<span class="n">data</span><span class="p">[</span><span class="s1">&#39;year&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">data</span><span class="p">[</span><span class="s1">&#39;year&#39;</span><span class="p">])</span>

<span class="p">(</span>
    <span class="n">ggplot</span><span class="p">(</span><span class="n">data</span><span class="p">,</span>  <span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="s1">&#39;year&#39;</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="s1">&#39;volume&#39;</span><span class="p">))</span>
    <span class="o">+</span><span class="n">geom_point</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>
    <span class="o">+</span><span class="n">geom_line</span><span class="p">()</span>
    <span class="o">+</span><span class="n">scale_x_datetime</span><span class="p">(</span><span class="n">breaks</span><span class="o">=</span><span class="n">date_breaks</span><span class="p">(</span><span class="s2">&#34;2 years&#34;</span><span class="p">),</span> <span class="n">labels</span><span class="o">=</span><span class="n">date_format</span><span class="p">(</span><span class="s2">&#34;%Y&#34;</span><span class="p">))</span>
    <span class="o">+</span><span class="n">labs</span><span class="p">(</span><span class="n">title</span><span class="o">=</span><span class="s1">&#39;问政湖南留言回复量(2010-2024.6)&#39;</span><span class="p">,</span>
          <span class="n">x</span> <span class="o">=</span> <span class="s1">&#39;年度&#39;</span><span class="p">,</span> 
          <span class="n">y</span> <span class="o">=</span> <span class="s1">&#39;回复量&#39;</span><span class="p">)</span>
    <span class="o">+</span><span class="n">theme_prism</span><span class="p">(</span><span class="n">base_family</span><span class="o">=</span><span class="n">font_prop</span><span class="o">.</span><span class="n">get_name</span><span class="p">())</span>
    <span class="o">+</span><span class="n">theme</span><span class="p">(</span><span class="n">figure_size</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">6</span><span class="p">),</span>
           <span class="n">text</span> <span class="o">=</span> <span class="n">element_text</span><span class="p">(</span><span class="n">family</span> <span class="o">=</span> <span class="n">font_prop</span><span class="o">.</span><span class="n">get_name</span><span class="p">(),</span> <span class="n">size</span><span class="o">=</span><span class="mi">14</span><span class="p">),</span> 
           <span class="n">plot_title</span> <span class="o">=</span> <span class="n">element_text</span><span class="p">(</span><span class="n">family</span> <span class="o">=</span> <span class="n">font_prop</span><span class="o">.</span><span class="n">get_name</span><span class="p">(),</span> <span class="n">size</span><span class="o">=</span><span class="mi">18</span><span class="p">)</span>
          <span class="p">)</span>

<span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/06-plot.png" alt=""  />
</p>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>数据集 | 30w条「问政湖南」留言&amp;回复数据(2010-2024)</title>
      <link>https://textdata.cn/blog/2024-06-05-wenzheng-hunan-dataset/</link>
      <pubDate>Wed, 05 Jun 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-06-05-wenzheng-hunan-dataset/</guid>
      <description>[问政湖南](https://wz.rednet.cn/#/leaveMsgList?reply=1)，类似于 [人民网地方领导留言板](https://textdata.cn/blog/2023-12-22-renmin-gov-leader-comment-board/)， 数据信息量也很大， 网民留言日期2010~2024， 记录数约30w(截止2024-06-05)。 适合社会学、新闻学、公共管理、管理学等领域学者使用。</description>
      <content:encoded><![CDATA[<p><img loading="lazy" src="img/01-website.png" alt=""  />
</p>
<h2 id="一数据集">一、数据集</h2>
<h3 id="11-概况">1.1 概况</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">数据集名称: 问政湖南
网站网址: https://wz.rednet.cn/#/leaveMsgList?reply=1
信息类型:  网民留言、地方机构(领导)回复
所含字段: 用户昵称、留言类型、标题、详细内容、投诉领域(子领域)、地方(省市)、地方领导、是否回复、回复机构、回复内容、回复时间等。
覆盖日期: 2010-10-28 ~ 2024-06-05
采集日期: 2024-06-05
记录条数: 302190
文件格式: csv/xlsx
文件大小: 990M
声明: 科研用途； 如有问题， 请加微信372335839，备注「姓名-学校-专业」
</code></pre></div><p><a href="https://wz.rednet.cn/#/leaveMsgList?reply=1">问政湖南</a>，类似于 <a href="https://textdata.cn/blog/2023-12-22-renmin-gov-leader-comment-board/">人民网地方领导留言板</a>， 数据信息量也很大， 网民留言日期 2010~2024， 记录数约 30w(截止 2024-06-05)。 适合社会学、新闻学、公共管理、管理学等领域学者使用。</p>
<p><img loading="lazy" src="img/02-wz-data.png" alt=""  />
</p>
<br>
<h3 id="12-声明">1.2 声明</h3>
<p>科研用途；如有问题， 请加微信 372335839，备注「姓名-学校-专业」</p>
<p><br><br></p>
<h2 id="二查看数据">二、查看数据</h2>
<h3 id="21-读取数据">2.1 读取数据</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;问政湖南.csv.gz&#39;</span><span class="p">,</span>
                 <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">,</span>
                 <span class="n">low_memory</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>

<span class="n">df</span>
</code></pre></div><p><img loading="lazy" src="img/03-df.png" alt=""  />
</p>
<br>
<h3 id="22-字段">2.2 字段</h3>
<p>人民网地方领导留言板， 只能看 2 年的数据， 除非爬虫早于 2 年前运行，否则无法阅读到 2 年前的数据。而且人民网存在改版， 字段无法对齐。</p>
<p>而问政湖南网， 不同于人民网 html 格式， 采用的 json 数据存储， 字段更干净整洁。所以我们采集到的数据从 2011 到 2023， 无需做字段对齐操作，拿来直接入库。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="k">for</span> <span class="n">col</span> <span class="ow">in</span> <span class="n">df</span><span class="o">.</span><span class="n">columns</span><span class="p">:</span>
    <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39; - </span><span class="si">{</span><span class="n">col</span><span class="si">}</span><span class="s1">&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"> - nickname   留言者昵称
 - type_name  类型,如投诉、举报等
 - title      留言标题
 - content    留言内容
 - desc       留言内容(与content内容略微不同)
 - cate_child_name  问政主题子领域， 违规补课
 - cate_name    问政主题， 如教育、交通等
 - created_at   留言时间
 - mobile       留言设备
 - star         留言获得的点赞数
 - company      地方机构，如市政府、市委等
 - job          领导岗位， 如市委书记、市长等
 - is_reply     是否回复,  1已办理， 2办理中
 - reply_name   回复机构名，如市委办公室、
 - is_self      是否为job自己回复
 - reply_content 回复内容
 - reply_is_edit 回复内容是否编辑
 - reply_time    回复时间
 - reply_published_at 回复内容发布时间
 - done_time      完成时间
 - reply_star     回复得到的点赞数
 - reply_video    回复视频链接
 - updated        留言更新时间
 - crawl_date     数据采集日期
</code></pre></div><br>
<h3 id="23-起止日期">2.3 起止日期</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s1">&#39;created_at&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;created_at&#39;</span><span class="p">])</span>

<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;留言日期(起): &#39;</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;created_at&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">min</span><span class="p">()</span><span class="o">.</span><span class="n">strftime</span><span class="p">(</span><span class="s1">&#39;%Y-%m-</span><span class="si">%d</span><span class="s1">&#39;</span><span class="p">))</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;留言日期(止): &#39;</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;created_at&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">()</span><span class="o">.</span><span class="n">strftime</span><span class="p">(</span><span class="s1">&#39;%Y-%m-</span><span class="si">%d</span><span class="s1">&#39;</span><span class="p">))</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">留言日期(起):  2010-10-28
留言日期(止):  2024-06-05
</code></pre></div><br>
<h3 id="24-年度分布">2.4 年度分布</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">data</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">created_at</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">y</span><span class="p">:</span> <span class="n">y</span><span class="p">[:</span><span class="mi">4</span><span class="p">])</span><span class="o">.</span><span class="n">value_counts</span><span class="p">()</span><span class="o">.</span><span class="n">sort_index</span><span class="p">()</span><span class="o">.</span><span class="n">reset_index</span><span class="p">())</span>
<span class="n">data</span><span class="o">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;year&#39;</span><span class="p">,</span> <span class="s1">&#39;count&#39;</span><span class="p">]</span>
<span class="n">data</span>
</code></pre></div><p><img loading="lazy" src="img/04-data.png" alt=""  />
</p>
<br>
<p><a href="https://textdata.cn/blog/2024-06-05-how-to-show-chinese-in-matplotlib-plotnine/">可视化 | 如何在 matplotlib 中显示中文</a></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">plotnine</span> <span class="kn">import</span> <span class="o">*</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">from</span> <span class="nn">matplotlib.font_manager</span> <span class="kn">import</span> <span class="n">FontProperties</span>

<span class="c1">#文泉驿微米黑.ttf位于代码同文件夹</span>
<span class="n">font_prop</span> <span class="o">=</span> <span class="n">FontProperties</span><span class="p">(</span><span class="n">fname</span><span class="o">=</span><span class="s1">&#39;文泉驿微米黑.ttf&#39;</span><span class="p">)</span>

<span class="n">data</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">created_at</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">y</span><span class="p">:</span> <span class="n">y</span><span class="p">[:</span><span class="mi">4</span><span class="p">])</span><span class="o">.</span><span class="n">value_counts</span><span class="p">()</span><span class="o">.</span><span class="n">sort_index</span><span class="p">()</span><span class="o">.</span><span class="n">reset_index</span><span class="p">())</span>
<span class="n">data</span><span class="o">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;year&#39;</span><span class="p">,</span> <span class="s1">&#39;count&#39;</span><span class="p">]</span>

<span class="p">(</span>
    <span class="n">ggplot</span><span class="p">(</span><span class="n">data</span><span class="p">,</span>  <span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="s1">&#39;year&#39;</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="s1">&#39;count&#39;</span><span class="p">))</span>
    <span class="o">+</span><span class="n">geom_col</span><span class="p">()</span>
    <span class="o">+</span><span class="n">geom_text</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">label</span><span class="o">=</span><span class="s1">&#39;count&#39;</span><span class="p">),</span> <span class="n">data</span><span class="o">=</span><span class="n">data</span><span class="p">,</span> <span class="n">va</span><span class="o">=</span><span class="s1">&#39;bottom&#39;</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s1">&#39;grey&#39;</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
    <span class="o">+</span><span class="n">theme</span><span class="p">(</span><span class="n">figure_size</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">6</span><span class="p">),</span>
           <span class="n">text</span> <span class="o">=</span> <span class="n">element_text</span><span class="p">(</span><span class="n">family</span> <span class="o">=</span> <span class="n">font_prop</span><span class="o">.</span><span class="n">get_name</span><span class="p">()),</span>
           <span class="n">plot_title</span> <span class="o">=</span> <span class="n">element_text</span><span class="p">(</span><span class="n">family</span> <span class="o">=</span> <span class="n">font_prop</span><span class="o">.</span><span class="n">get_name</span><span class="p">(),</span> <span class="n">size</span><span class="o">=</span><span class="mi">14</span><span class="p">)</span>
          <span class="p">)</span>
    <span class="o">+</span><span class="n">labs</span><span class="p">(</span><span class="n">title</span><span class="o">=</span><span class="s1">&#39;问政湖南留言回复量(2010-2024.6)&#39;</span><span class="p">,</span>
          <span class="n">x</span> <span class="o">=</span> <span class="s1">&#39;年度&#39;</span><span class="p">,</span>
          <span class="n">y</span> <span class="o">=</span> <span class="s1">&#39;回复量&#39;</span><span class="p">)</span>
<span class="p">)</span>

</code></pre></div><p><img loading="lazy" src="img/05-plot.png" alt=""  />
</p>
<br>
<h3 id="25-问政主题">2.5 问政主题</h3>
<p>查看 2010-2024.6 年， 不同留 <strong><em>主题类别</em></strong> 的记录数</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s1">&#39;cate_name&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">value_counts</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">cate_name
住建         59316
交通运输       16812
公安         15855
教育         15781
农业农村       12883
生态环境       10996
人社          9775
城管          9596
市场监管        7464
干部          6075
其他          4685
司法          4646
自然资源        4273
卫生健康        3647
民政          3435
医疗保障        2589
水利          2352
金融          1610
通信          1594
电力          1579
财政税收        1267
物价          1099
商务           489
应急管理         247
特种设备、作业       31
烟花爆竹经营        19
安全生产和管理       10
消防救援           3
电动车违规行为        1
Name: count, dtype: int64
</code></pre></div><br>
<h3 id="26-查看某类词">2.6 查看某类词</h3>
<p>查看字段 <strong><em>content 留言内容</em></strong>, 是否出现 <strong><em>扰民|噪音</em></strong> 等词语</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s1">&#39;content&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">&#39;扰民|噪音&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">0         False
1         False
2         False
3         False
4         False
          ...
302185    False
302186    False
302187    False
302188    False
302189     True
Name: content, Length: 302190, dtype: bool
</code></pre></div><br>
<p><strong><em>扰民|噪音</em></strong> 相关记录总的记录(回复)数</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">df[&#39;content&#39;].fillna(&#39;&#39;).str.contains(&#39;扰民|噪音&#39;).sum()
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">13292
</code></pre></div><p>噪音的留言回复记录占总记录数的比例</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s1">&#39;content&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">&#39;扰民|噪音&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span><span class="o">/</span><span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">0.04398557199113141
</code></pre></div><p>跟人民网地方领导留言板的结果十分相似，也是 4%。</p>
<br>
<br>
<br>
<h2 id="三相关内容">三、相关内容</h2>
<h3 id="31-相关研究">3.1 相关研究</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[1]郑石明, 兰雨潇, 黎枫. 网络公共舆论与政府回应的互动逻辑——基于新冠肺炎疫情期间“领导留言板”的数据分析[J]. 公共管理学报, 2021, 18 (03): 24-37+169.
王磊,易扬.公共卫生危机中的数字政府回应如何纾解网络负面舆情——基于人民网“领导留言板”回复情况的调查[J].公共管理学报,2022,19(04):65-78+169.

[2]Lu, Liangdong, Jia Xu, and Jiuchang Wei. &#34;Understanding the effects of the textual complexity on government communication: Insights from China’s online public service platform.&#34; Telematics and Informatics 83 (2023): 102028.
...
</code></pre></div><br>
<h3 id="32-相关推文">3.2 相关推文</h3>
<p><a href="https://textdata.cn/blog/2023-12-22-renmin-gov-leader-comment-board/">数据集 | 人民网地方领导留言板原始文本(2011-2023.12)</a></p>
<p><a href="https://textdata.cn/blog/2023-12-28-train-word2vec-using-renmin-gov-leader-board-dataset/">词向量 | 使用人民网领导留言板语料训练 Word2Vec 模型</a></p>
<br>
<h3 id="33-相关链接">3.3 相关链接</h3>
<p>与 <a href="https://wz.rednet.cn/#/leaveMsgList?reply=1">问政湖南网</a> 最相关的网站还有</p>
<ul>
<li><a href="https://www.rednet.cn/">红网</a></li>
<li><a href="https://people.rednet.cn/front/messages/list?type_id=8">百姓呼声</a></li>
<li><a href="https://315.rednet.cn/#/leaveMsgList">消费维权</a></li>
<li><a href="https://law.rednet.cn/consult.html">问法湖南</a></li>
</ul>
<p><br><br></p>
<h2 id="精选内容">精选内容</h2>
<ul>
<li><a href="https://textdata.cn/blog/datasets_available_for_management_science/">LIST | 可供社科(经管)领域使用的数据集汇总</a></li>
<li><a href="https://textdata.cn/blog/the_text_analysis_list_about_ms/">LIST | 社科(经管)数据挖掘文献资料汇总</a></li>
<li><a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/">推荐 | 文本分析库 cntext2.x 使用手册</a></li>
<li><a href="https://textdata.cn/blog/management_python_course/">付费视频课 | Python 实证指标构建与文本分析</a><br>
<br></li>
</ul>
]]></content:encoded>
    </item>
    
    <item>
      <title>播客数据集 | 30w播客(Podcast)的560w条评论数据(2005-2023)</title>
      <link>https://textdata.cn/blog/2024-06-03-podcasts-dataset/</link>
      <pubDate>Mon, 03 Jun 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-06-03-podcasts-dataset/</guid>
      <description>&lt;h2 id=&#34;一数据集概况&#34;&gt;一、数据集概况&lt;/h2&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;媒体名称: Podcast
数据来源: https://podcasts.apple.com/
覆盖年度: 2005-12-10 ~ 2023-03-07
博客id数量: 303911
评论条数: 5607021
所含字段: podcast_id、title、content、rating、author_id、created_at、category等
获取数据: 200元，加微信 372335839， 备注「姓名-学校-专业-播客」。
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/01-screen.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;规模庞大，字段内容丰富，适合社会学、新闻与传播学、语言学、经济学、管理学等领域学者开展研究。&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;h2 id=&#34;二读取数据&#34;&gt;二、读取数据&lt;/h2&gt;
&lt;p&gt;使用 &lt;code&gt;pandas.read_json()&lt;/code&gt; 读取&lt;/p&gt;
&lt;h3 id=&#34;21-podcastsjson&#34;&gt;2.1 podcasts.json&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;pdf&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_json&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;podcasts.json&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;lines&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#查看podcasts.json字段&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;pdf&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;columns&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;pdf&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;Index([&amp;#39;podcast_id&amp;#39;, &amp;#39;itunes_id&amp;#39;, &amp;#39;slug&amp;#39;, &amp;#39;itunes_url&amp;#39;, &amp;#39;title&amp;#39;, &amp;#39;author&amp;#39;,
       &amp;#39;description&amp;#39;, &amp;#39;average_rating&amp;#39;, &amp;#39;ratings_count&amp;#39;, &amp;#39;scraped_at&amp;#39;],
      dtype=&amp;#39;object&amp;#39;)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/02-pdf.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;22-categoriesjson&#34;&gt;2.2 categories.json&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;cdf&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_json&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;categories.json&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;lines&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#categories.json字段&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;cdf&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;columns&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;cdf&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;Index([&amp;#39;podcast_id&amp;#39;, &amp;#39;itunes_id&amp;#39;, &amp;#39;category&amp;#39;], dtype=&amp;#39;object&amp;#39;)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/03-cdf.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;23-reviewsjson&#34;&gt;2.3 reviews.json&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;rdf = pd.read_json(&amp;#39;reviews.json&amp;#39;, lines=True)

#reviews.json字段
print(rdf.columns)
rdf
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;Index([&amp;#39;podcast_id&amp;#39;, &amp;#39;title&amp;#39;, &amp;#39;content&amp;#39;, &amp;#39;rating&amp;#39;, &amp;#39;author_id&amp;#39;, &amp;#39;created_at&amp;#39;],
      dtype=&amp;#39;object&amp;#39;)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/04-rdf.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;三实验&#34;&gt;三、实验&lt;/h2&gt;
&lt;h3 id=&#34;31-筛选出含某关键词的播客名&#34;&gt;3.1 筛选出含某关键词的播客名&lt;/h3&gt;
&lt;p&gt;从 &lt;em&gt;&lt;strong&gt;podcasts.json&lt;/strong&gt;&lt;/em&gt; 中筛选出含 &lt;em&gt;&lt;strong&gt;China&lt;/strong&gt;&lt;/em&gt; 或 &lt;em&gt;&lt;strong&gt;中国&lt;/strong&gt;&lt;/em&gt; 的播客记录&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;china_podcast_df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pdf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;pdf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;title&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fillna&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;contains&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;China&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)]&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;china_podcast_df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/05-pdf-title.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;#查看这86个播客名&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;china_podcast_df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;title&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;values&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;[&amp;#39;China Arts Podcast&amp;#39;
 &amp;#39;Made in China Podcast: International Business | Crowdfunding | Entrepreneurship&amp;#39;
 &amp;#39;Chinasource Recently Added Resources&amp;#39; &amp;#39;TIC China Network&amp;#39; &amp;#39;UNDP China&amp;#39;
 &amp;#39;Wellness in China&amp;#39; &amp;#39;Party In China&amp;#39; &amp;#39;Tails From China&amp;#39; &amp;#39;Focus on China&amp;#39;
 &amp;#39;CEIBS China Knowledge&amp;#39; &amp;#39;Bottled in China&amp;#39; &amp;#39;Environment China&amp;#39;
 &amp;#39;China Money Podcast - Audio Episodes&amp;#39;
 &amp;#39;China Money Podcast - Video Episodes&amp;#39;
 &amp;#39;China Jedi Podcast: Expat Life | Chinese Culture | Business | Travel | China&amp;#39;
 &amp;#39;China Digital Marketing Podcast&amp;#39; &amp;#39;Goodbye China Podcast&amp;#39;
 &amp;#39;History and Story of China&amp;#39; &amp;#39;Made in China&amp;#39;
 &amp;#39;China Voices: The AmCham Shanghai Podcast&amp;#39;
......
 &amp;#34;China Now&amp;#39;s Podcast&amp;#34; &amp;#39;China: As History Is My Witness&amp;#39;
 &amp;#39;Safeguarding Dunhuang for China and the World&amp;#39; &amp;#39;Biz China&amp;#39;
 &amp;#39;Chinaman Talks Sports&amp;#39; &amp;#39;China in the World&amp;#39; &amp;#39;The History of China&amp;#39;
 &amp;#34;Forbidden City: Inside the Court of China&amp;#39;s Emperors&amp;#34;
 &amp;#39;NAFTA at Twenty: Trade, Transformation and the China Factor&amp;#39;
 &amp;#39;NAFTA at Twenty: Trade, Transformation and the China Factor (Audio Only)&amp;#39;
 &amp;#39;China and the Chinese by Herbert Allen Giles&amp;#39; &amp;#39;China Doing Sweden&amp;#39;
 &amp;#39;China MSG&amp;#39; &amp;#39;Yellow Star: China News&amp;#39; &amp;#39;Made in China&amp;#39;]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;32-筛选出含某关键词的内容名&#34;&gt;3.2 筛选出含某关键词的内容名&lt;/h3&gt;
&lt;p&gt;筛选出含 &lt;em&gt;&lt;strong&gt;China&lt;/strong&gt;&lt;/em&gt; 的节目标题，注意podcast的title不变，但是每期的内容名(title)是变化的。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;#从 reviews.json 中筛选出含 China 或 中国 的评论记录
china_title_df = rdf[rdf[&amp;#39;title&amp;#39;].fillna(&amp;#39;&amp;#39;).str.contains(&amp;#39;China|中国&amp;#39;)]
china_title_df
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/06-rdf-title.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;china_title_df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;content&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;values&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;[&amp;#34;What&amp;#39;s a China?&amp;#34; &amp;#39;Thanks Justin - from China&amp;#39;
 &amp;#39;American Working in China Coffee Industry&amp;#39; &amp;#39;Babybee in China&amp;#39;
 &amp;#39;Listening From China!!&amp;#39; &amp;#39;Right on China.&amp;#39; &amp;#39;Excellent China Series!&amp;#39;
 &amp;#39;China Trade War episode was fantastic&amp;#39;
 &amp;#39;Really enjoyed the China / Tariff discussion&amp;#39; &amp;#39;China Review&amp;#39;
 &amp;#39;Beautiful videos of China!&amp;#39; &amp;#39;Learn about The Real China business&amp;#39;
 &amp;#39;Doing business in China? Listen to this!&amp;#39; &amp;#39;China&amp;#39;
 &amp;#34;Insightful look into China&amp;#39;s growing influence&amp;#34;
 &amp;#39;Great smart brevity on China&amp;#39; &amp;#39;Great insights about China&amp;#39;
 &amp;#39;Best tech podcast for China&amp;#39;
 &amp;#39;Great introduction to China’s history&amp;#39;
......
 &amp;#39;Jump into the rabbit hole of China Tech 🕳&amp;#39; &amp;#39;你好 from China!&amp;#39;
 &amp;#39;Blong in China&amp;#39;
 &amp;#39;Informational but the misconception of Gaokao in China is awkward (gatteca&amp;#39;
 &amp;#39;Listening from China&amp;#39; &amp;#39;Not available in China&amp;#39; &amp;#39;With Love from China&amp;#39;
 &amp;#39;Great talent from China.&amp;#39; &amp;#39;First time to listen to dj music from China&amp;#39;
 &amp;#39;Emergency China podcast was unreal&amp;#39; &amp;#39;China Episode&amp;#39; &amp;#39;China&amp;#39;
 &amp;#39;矮大紧老师的确是现代中国文化圈里面的高山晓辉里的奇松&amp;#39; &amp;#39;Love the China rant&amp;#39; &amp;#39;中国好&amp;#39;
 &amp;#39;Powerful rant on China much needed&amp;#39; &amp;#39;NBA and China&amp;#39;
 &amp;#39;Life in China is Awesome!&amp;#39; &amp;#39;Worthy China Podcast&amp;#39;
 &amp;#39;Learn More About China Now&amp;#39; &amp;#39;Michael from China&amp;#39;
 &amp;#39;Best Survey of China Lecture in iTunes U&amp;#39; &amp;#39;China&amp;#39; &amp;#39;Band in China&amp;#39;
 &amp;#39;Band in China&amp;#39; &amp;#39;关于中国生活有趣的观点&amp;#39; &amp;#39;Deep and personal angle to look at China&amp;#39;
 &amp;#39;A must-listen podcast for understanding the current and future China&amp;#39;
 &amp;#39;Stop crying about China&amp;#39; &amp;#39;New podcast from a great China program&amp;#39;
 &amp;#39;Saying hi from China&amp;#39; &amp;#39;终于有一档中国记者做的播客&amp;#39; &amp;#39;China’s’  Detention Camps&amp;#39;
......
 &amp;#39;Required listening to keep up with contemporary China&amp;#39;
 &amp;#39;Most antiChina guests and content&amp;#39; &amp;#39;Fantastic China-centric podcast&amp;#39;
 &amp;#39;Great, well rounded look at China&amp;#39; &amp;#39;Great info and insights on China&amp;#39;
 &amp;#39;The best Podcast on China-related topics&amp;#39; &amp;#39;Big trouble in little China&amp;#39;
 &amp;#39;中国最好的游戏广播。&amp;#39; &amp;#39;中国第一家做游戏广播的！！&amp;#39; &amp;#39;The best game radio in China!&amp;#39;
 &amp;#39;Best Podcast on China’s History&amp;#39;
 &amp;#39;Great China Insights and interview topics&amp;#39;
 &amp;#39;Howard Whiteson’s China based interviews are Short Concise well- easy&amp;#39;
 &amp;#39;Excellent source for politics in China&amp;#39; &amp;#39;Good honest reporting on China&amp;#39;
 &amp;#34;GOD&amp;#39;S Warning About China&amp;#34; &amp;#39;Hilarious English Pod in China!&amp;#39;
 &amp;#39;Bursting with China Healthcare Insights&amp;#39; &amp;#39;China oh China&amp;#39;
 &amp;#39;The Real China Story&amp;#39;
 &amp;#39;China’s ambitions and their impact: Insightfully and compellingkt, weaves the micro and the macro&amp;#39;
 &amp;#39;Sets the bar for China and international reporting&amp;#39;
 &amp;#34;Amazingly balanced and detailed account of China&amp;#39;s growing influence around the world&amp;#34;
 &amp;#39;On China’s New Silk Road&amp;#39; &amp;#39;China’s plan for the future&amp;#39;
 &amp;#39;Great new Content on China and Sede Vacante&amp;#39; &amp;#39;没有中国特色&amp;#39;
 &amp;#39;“China Joe need we say more”&amp;#39;
 &amp;#39;Interesting and informative podcast on China&amp;#39;
 &amp;#39;SCTV from the South China Sea&amp;#39; &amp;#39;China and Omicron&amp;#39; &amp;#39;Strangers in China&amp;#39;
 &amp;#39;China seems very scary&amp;#39; &amp;#39;China Lockdown&amp;#39;
 &amp;#39;I travel to China regularly just to listen&amp;#39;
 &amp;#39;Best American News I Can Find in China!!!!&amp;#39;]
Selection deleted
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;33-筛选出含某关键词的评论&#34;&gt;3.3 筛选出含某关键词的评论&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;#从 reviews.json 中筛选出含 China 或 中国 的评论记录&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;china_reviews_df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;rdf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;rdf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;content&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fillna&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;contains&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;China|中国&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)]&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;china_reviews_df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/07-rdf-content.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h2 id=&#34;四获取方式&#34;&gt;四、获取方式&lt;/h2&gt;
&lt;p&gt;200元，加微信 372335839， 备注「姓名-学校-专业-播客」。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;相关内容&#34;&gt;相关内容&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&#34;https://textdata.cn/blog/datasets_available_for_management_science/&#34;&gt;LIST | 可供社科(经管)领域使用的数据集汇总&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&#34;https://textdata.cn/blog/the_text_analysis_list_about_ms/&#34;&gt;LIST | 社科(经管)数据挖掘文献资料汇总&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&#34;https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/&#34;&gt;推荐 | 文本分析库cntext使用手册&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&#34;https://textdata.cn/blog/management_python_course/&#34;&gt;付费视频课 | Python实证指标构建与文本分析&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-11-17-how-handle-mega-csv-that-far-exceed-memory/&#34;&gt;代码 | 如何处理远超电脑内存的csv文件&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-12-14-daily-news-dataset/&#34;&gt;50G新闻数据集 | 含 人民日报/光明日报/参考消息/经济日报 等 60+ 家媒体(更新至2024.05)&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-12-18-how-to-generate-panel-data-from-daily-news-dataset/&#34;&gt;&lt;strong&gt;代码 | 使用「新闻数据」构造概念词提及量「面板数据」&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-12-20-measure-china-economic-policy-uncertainty/&#34;&gt;代码 | 使用「新闻数据」测量 「经济政策不确定性EPU」指标&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
</description>
      <content:encoded><![CDATA[<h2 id="一数据集概况">一、数据集概况</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">媒体名称: Podcast
数据来源: https://podcasts.apple.com/
覆盖年度: 2005-12-10 ~ 2023-03-07
博客id数量: 303911
评论条数: 5607021
所含字段: podcast_id、title、content、rating、author_id、created_at、category等
获取数据: 200元，加微信 372335839， 备注「姓名-学校-专业-播客」。
</code></pre></div><p><img loading="lazy" src="img/01-screen.png" alt=""  />
</p>
<p>规模庞大，字段内容丰富，适合社会学、新闻与传播学、语言学、经济学、管理学等领域学者开展研究。</p>
<br>
<br>
<h2 id="二读取数据">二、读取数据</h2>
<p>使用 <code>pandas.read_json()</code> 读取</p>
<h3 id="21-podcastsjson">2.1 podcasts.json</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">pdf</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_json</span><span class="p">(</span><span class="s1">&#39;podcasts.json&#39;</span><span class="p">,</span> <span class="n">lines</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>

<span class="c1">#查看podcasts.json字段</span>
<span class="nb">print</span><span class="p">(</span><span class="n">pdf</span><span class="o">.</span><span class="n">columns</span><span class="p">)</span>
<span class="n">pdf</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">Index([&#39;podcast_id&#39;, &#39;itunes_id&#39;, &#39;slug&#39;, &#39;itunes_url&#39;, &#39;title&#39;, &#39;author&#39;,
       &#39;description&#39;, &#39;average_rating&#39;, &#39;ratings_count&#39;, &#39;scraped_at&#39;],
      dtype=&#39;object&#39;)
</code></pre></div><p><img loading="lazy" src="img/02-pdf.png" alt=""  />
</p>
<br>
<h3 id="22-categoriesjson">2.2 categories.json</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">cdf</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_json</span><span class="p">(</span><span class="s1">&#39;categories.json&#39;</span><span class="p">,</span> <span class="n">lines</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>

<span class="c1">#categories.json字段</span>
<span class="nb">print</span><span class="p">(</span><span class="n">cdf</span><span class="o">.</span><span class="n">columns</span><span class="p">)</span>
<span class="n">cdf</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">Index([&#39;podcast_id&#39;, &#39;itunes_id&#39;, &#39;category&#39;], dtype=&#39;object&#39;)
</code></pre></div><p><img loading="lazy" src="img/03-cdf.png" alt=""  />
</p>
<br>
<h3 id="23-reviewsjson">2.3 reviews.json</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">rdf = pd.read_json(&#39;reviews.json&#39;, lines=True)

#reviews.json字段
print(rdf.columns)
rdf
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">Index([&#39;podcast_id&#39;, &#39;title&#39;, &#39;content&#39;, &#39;rating&#39;, &#39;author_id&#39;, &#39;created_at&#39;],
      dtype=&#39;object&#39;)
</code></pre></div><p><img loading="lazy" src="img/04-rdf.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="三实验">三、实验</h2>
<h3 id="31-筛选出含某关键词的播客名">3.1 筛选出含某关键词的播客名</h3>
<p>从 <em><strong>podcasts.json</strong></em> 中筛选出含 <em><strong>China</strong></em> 或 <em><strong>中国</strong></em> 的播客记录</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">china_podcast_df</span> <span class="o">=</span> <span class="n">pdf</span><span class="p">[</span><span class="n">pdf</span><span class="p">[</span><span class="s1">&#39;title&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">&#39;China&#39;</span><span class="p">)]</span>
<span class="n">china_podcast_df</span>
</code></pre></div><p><img loading="lazy" src="img/05-pdf-title.png" alt=""  />
</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#查看这86个播客名</span>
<span class="nb">print</span><span class="p">(</span><span class="n">china_podcast_df</span><span class="o">.</span><span class="n">title</span><span class="o">.</span><span class="n">values</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[&#39;China Arts Podcast&#39;
 &#39;Made in China Podcast: International Business | Crowdfunding | Entrepreneurship&#39;
 &#39;Chinasource Recently Added Resources&#39; &#39;TIC China Network&#39; &#39;UNDP China&#39;
 &#39;Wellness in China&#39; &#39;Party In China&#39; &#39;Tails From China&#39; &#39;Focus on China&#39;
 &#39;CEIBS China Knowledge&#39; &#39;Bottled in China&#39; &#39;Environment China&#39;
 &#39;China Money Podcast - Audio Episodes&#39;
 &#39;China Money Podcast - Video Episodes&#39;
 &#39;China Jedi Podcast: Expat Life | Chinese Culture | Business | Travel | China&#39;
 &#39;China Digital Marketing Podcast&#39; &#39;Goodbye China Podcast&#39;
 &#39;History and Story of China&#39; &#39;Made in China&#39;
 &#39;China Voices: The AmCham Shanghai Podcast&#39;
......
 &#34;China Now&#39;s Podcast&#34; &#39;China: As History Is My Witness&#39;
 &#39;Safeguarding Dunhuang for China and the World&#39; &#39;Biz China&#39;
 &#39;Chinaman Talks Sports&#39; &#39;China in the World&#39; &#39;The History of China&#39;
 &#34;Forbidden City: Inside the Court of China&#39;s Emperors&#34;
 &#39;NAFTA at Twenty: Trade, Transformation and the China Factor&#39;
 &#39;NAFTA at Twenty: Trade, Transformation and the China Factor (Audio Only)&#39;
 &#39;China and the Chinese by Herbert Allen Giles&#39; &#39;China Doing Sweden&#39;
 &#39;China MSG&#39; &#39;Yellow Star: China News&#39; &#39;Made in China&#39;]
</code></pre></div><br>
<h3 id="32-筛选出含某关键词的内容名">3.2 筛选出含某关键词的内容名</h3>
<p>筛选出含 <em><strong>China</strong></em> 的节目标题，注意podcast的title不变，但是每期的内容名(title)是变化的。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">#从 reviews.json 中筛选出含 China 或 中国 的评论记录
china_title_df = rdf[rdf[&#39;title&#39;].fillna(&#39;&#39;).str.contains(&#39;China|中国&#39;)]
china_title_df
</code></pre></div><p><img loading="lazy" src="img/06-rdf-title.png" alt=""  />
</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="nb">print</span><span class="p">(</span><span class="n">china_title_df</span><span class="o">.</span><span class="n">content</span><span class="o">.</span><span class="n">values</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[&#34;What&#39;s a China?&#34; &#39;Thanks Justin - from China&#39;
 &#39;American Working in China Coffee Industry&#39; &#39;Babybee in China&#39;
 &#39;Listening From China!!&#39; &#39;Right on China.&#39; &#39;Excellent China Series!&#39;
 &#39;China Trade War episode was fantastic&#39;
 &#39;Really enjoyed the China / Tariff discussion&#39; &#39;China Review&#39;
 &#39;Beautiful videos of China!&#39; &#39;Learn about The Real China business&#39;
 &#39;Doing business in China? Listen to this!&#39; &#39;China&#39;
 &#34;Insightful look into China&#39;s growing influence&#34;
 &#39;Great smart brevity on China&#39; &#39;Great insights about China&#39;
 &#39;Best tech podcast for China&#39;
 &#39;Great introduction to China’s history&#39;
......
 &#39;Jump into the rabbit hole of China Tech 🕳&#39; &#39;你好 from China!&#39;
 &#39;Blong in China&#39;
 &#39;Informational but the misconception of Gaokao in China is awkward (gatteca&#39;
 &#39;Listening from China&#39; &#39;Not available in China&#39; &#39;With Love from China&#39;
 &#39;Great talent from China.&#39; &#39;First time to listen to dj music from China&#39;
 &#39;Emergency China podcast was unreal&#39; &#39;China Episode&#39; &#39;China&#39;
 &#39;矮大紧老师的确是现代中国文化圈里面的高山晓辉里的奇松&#39; &#39;Love the China rant&#39; &#39;中国好&#39;
 &#39;Powerful rant on China much needed&#39; &#39;NBA and China&#39;
 &#39;Life in China is Awesome!&#39; &#39;Worthy China Podcast&#39;
 &#39;Learn More About China Now&#39; &#39;Michael from China&#39;
 &#39;Best Survey of China Lecture in iTunes U&#39; &#39;China&#39; &#39;Band in China&#39;
 &#39;Band in China&#39; &#39;关于中国生活有趣的观点&#39; &#39;Deep and personal angle to look at China&#39;
 &#39;A must-listen podcast for understanding the current and future China&#39;
 &#39;Stop crying about China&#39; &#39;New podcast from a great China program&#39;
 &#39;Saying hi from China&#39; &#39;终于有一档中国记者做的播客&#39; &#39;China’s’  Detention Camps&#39;
......
 &#39;Required listening to keep up with contemporary China&#39;
 &#39;Most antiChina guests and content&#39; &#39;Fantastic China-centric podcast&#39;
 &#39;Great, well rounded look at China&#39; &#39;Great info and insights on China&#39;
 &#39;The best Podcast on China-related topics&#39; &#39;Big trouble in little China&#39;
 &#39;中国最好的游戏广播。&#39; &#39;中国第一家做游戏广播的！！&#39; &#39;The best game radio in China!&#39;
 &#39;Best Podcast on China’s History&#39;
 &#39;Great China Insights and interview topics&#39;
 &#39;Howard Whiteson’s China based interviews are Short Concise well- easy&#39;
 &#39;Excellent source for politics in China&#39; &#39;Good honest reporting on China&#39;
 &#34;GOD&#39;S Warning About China&#34; &#39;Hilarious English Pod in China!&#39;
 &#39;Bursting with China Healthcare Insights&#39; &#39;China oh China&#39;
 &#39;The Real China Story&#39;
 &#39;China’s ambitions and their impact: Insightfully and compellingkt, weaves the micro and the macro&#39;
 &#39;Sets the bar for China and international reporting&#39;
 &#34;Amazingly balanced and detailed account of China&#39;s growing influence around the world&#34;
 &#39;On China’s New Silk Road&#39; &#39;China’s plan for the future&#39;
 &#39;Great new Content on China and Sede Vacante&#39; &#39;没有中国特色&#39;
 &#39;“China Joe need we say more”&#39;
 &#39;Interesting and informative podcast on China&#39;
 &#39;SCTV from the South China Sea&#39; &#39;China and Omicron&#39; &#39;Strangers in China&#39;
 &#39;China seems very scary&#39; &#39;China Lockdown&#39;
 &#39;I travel to China regularly just to listen&#39;
 &#39;Best American News I Can Find in China!!!!&#39;]
Selection deleted
</code></pre></div><br>
<h3 id="33-筛选出含某关键词的评论">3.3 筛选出含某关键词的评论</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#从 reviews.json 中筛选出含 China 或 中国 的评论记录</span>
<span class="n">china_reviews_df</span> <span class="o">=</span> <span class="n">rdf</span><span class="p">[</span><span class="n">rdf</span><span class="p">[</span><span class="s1">&#39;content&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">&#39;China|中国&#39;</span><span class="p">)]</span>
<span class="n">china_reviews_df</span>
</code></pre></div><p><img loading="lazy" src="img/07-rdf-content.png" alt=""  />
</p>
<br>
<h2 id="四获取方式">四、获取方式</h2>
<p>200元，加微信 372335839， 备注「姓名-学校-专业-播客」。</p>
<p><br><br></p>
<h2 id="相关内容">相关内容</h2>
<ul>
<li>
<p><a href="https://textdata.cn/blog/datasets_available_for_management_science/">LIST | 可供社科(经管)领域使用的数据集汇总</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/the_text_analysis_list_about_ms/">LIST | 社科(经管)数据挖掘文献资料汇总</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/">推荐 | 文本分析库cntext使用手册</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/management_python_course/">付费视频课 | Python实证指标构建与文本分析</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-11-17-how-handle-mega-csv-that-far-exceed-memory/">代码 | 如何处理远超电脑内存的csv文件</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-12-14-daily-news-dataset/">50G新闻数据集 | 含 人民日报/光明日报/参考消息/经济日报 等 60+ 家媒体(更新至2024.05)</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-12-18-how-to-generate-panel-data-from-daily-news-dataset/"><strong>代码 | 使用「新闻数据」构造概念词提及量「面板数据」</strong></a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-12-20-measure-china-economic-policy-uncertainty/">代码 | 使用「新闻数据」测量 「经济政策不确定性EPU」指标</a></p>
</li>
</ul>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>新闻数据集 | 1102w条纽约时报(1920-2020)</title>
      <link>https://textdata.cn/blog/2024-06-01-new-york-times-article-from-1920-2020/</link>
      <pubDate>Thu, 30 May 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-06-01-new-york-times-article-from-1920-2020/</guid>
      <description>新闻数据集研究价值大， 您可从中提取丰富的指标，包括但不限于经济政策不确定性指数EPU 、 媒体关注度指数、文本相似度、情感分析。而且可训练词向量，构建新的词典，开发新的指标指数。计算机自然语言处理、经济学、管理学、新闻传播学、公共管理等领域均可使用。</description>
      <content:encoded><![CDATA[<h2 id="一数据集概况">一、数据集概况</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">媒体名称: New York Times
覆盖年度: 1920 ~ 2020
记录条数: 11027535
所含字段: year, title, excerpt
数据集地址: https://www.kaggle.com/datasets/tumanovalexander/nyt-articles-data/data
</code></pre></div><p><img loading="lazy" src="img/screen.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="二查看数据">二、查看数据</h2>
<h3 id="21-读取数据">2.1 读取数据</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_parquet</span><span class="p">(</span><span class="s1">&#39;nyt_data.parquet&#39;</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">drop_dupliacates</span><span class="p">(</span><span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">df</span>
</code></pre></div><p><img loading="lazy" src="img/01-df.png" alt=""  />
</p>
<br>
<h3 id="22-文本长度">2.2 文本长度</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">title_mean_len</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">title</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">len</span><span class="p">()</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span>
<span class="n">excerpt_mean_len</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">excerpt</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">len</span><span class="p">()</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;标题平均长度: </span><span class="si">{</span><span class="n">title_mean_len</span><span class="si">:</span><span class="s1">.2f</span><span class="si">}</span><span class="s1">&#39;</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;摘录平均长度: </span><span class="si">{</span><span class="n">excerpt_mean_len</span><span class="si">:</span><span class="s1">.2f</span><span class="si">}</span><span class="s1">&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">标题平均长度: 173.30
摘录平均长度: 68.43
</code></pre></div><br>
<h3 id="23-缺失率">2.3 缺失率</h3>
<p>这里我们定义文本长度为0，则该字段为缺失。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">title_na_ratio</span> <span class="o">=</span> <span class="mi">100</span> <span class="o">*</span> <span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="o">.</span><span class="n">title</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">len</span><span class="p">()</span><span class="o">==</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">size</span> <span class="o">/</span> <span class="n">df</span><span class="o">.</span><span class="n">size</span>
<span class="n">excerpt_na_ratio</span> <span class="o">=</span> <span class="mi">100</span> <span class="o">*</span> <span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="o">.</span><span class="n">excerpt</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">len</span><span class="p">()</span><span class="o">==</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">size</span> <span class="o">/</span> <span class="n">df</span><span class="o">.</span><span class="n">size</span>

<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;标题缺失率: </span><span class="si">{</span><span class="n">title_na_ratio</span><span class="si">:</span><span class="s1">.2f</span><span class="si">}</span><span class="s1">%&#39;</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;摘录缺失率: </span><span class="si">{</span><span class="n">excerpt_na_ratio</span><span class="si">:</span><span class="s1">.2f</span><span class="si">}</span><span class="s1">%&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">标题缺失率: 0.00%
摘录缺失率: 52.25%
</code></pre></div><p><br><br></p>
<h2 id="类似的数据集">类似的数据集</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">媒体名称: Times of India
覆盖年度: 2001 ~ 2023.q2
记录条数: 3876557
所含字段: publish_date, headline_category, headline_text
数据集地址: https://www.kaggle.com/datasets/therohk/india-headlines-news-dataset
</code></pre></div><p><br><br></p>
<h2 id="三相关内容">三、相关内容</h2>
<ul>
<li>
<p><a href="https://textdata.cn/blog/2023-12-14-daily-news-dataset/">50G新闻数据集 | 含 人民日报/光明日报/参考消息/经济日报 等 60+ 家媒体(更新至2024.05)</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-11-17-how-handle-mega-csv-that-far-exceed-memory/">代码 | 如何处理远超电脑内存的csv文件</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-12-18-how-to-generate-panel-data-from-daily-news-dataset/"><strong>代码 | 使用「新闻数据」构造概念词提及量「面板数据」</strong></a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-12-28-visualize-the-culture-change-using-people-daily-dataset/"><strong>可视化 | 人民日报语料反映七十年文化演变</strong></a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-12-20-measure-china-economic-policy-uncertainty/">代码 | 使用「新闻数据」测量 「经济政策不确定性EPU」指标</a></p>
</li>
</ul>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>图文 | PyCharm专业版下载&amp;安装&amp;激活</title>
      <link>https://textdata.cn/blog/2024-05-27-pychram-professional-installation-and-usage/</link>
      <pubDate>Mon, 27 May 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-05-27-pychram-professional-installation-and-usage/</guid>
      <description>&lt;h2 id=&#34;一pycharm&#34;&gt;一、PyCharm&lt;/h2&gt;
&lt;p&gt;&lt;a href=&#34;https://www.jetbrains.com/zh-cn/pycharm/&#34;&gt;PyCharm&lt;/a&gt; 是一个简单的代码编辑器，它通过提供一系列高级功能，如代码分析、智能代码完成、一键式代码快速补全建议，大大提升了Python开发的效率和质量。而且现在支持jupyter notebook， 界面更美观易用。&lt;/p&gt;
&lt;p&gt;大邓一直建议做数据分析的用户不要用其他编辑器，尽量使用jupyter notebook。 现在大家多了一个选项，即PyCharm中的jupyter notebook。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/01-jupyter_file_screen.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h2 id=&#34;二下载激活&#34;&gt;二、下载&amp;amp;激活&lt;/h2&gt;
&lt;h3 id=&#34;21-下载&#34;&gt;2.1 下载&lt;/h3&gt;
&lt;p&gt;打开 &lt;a href=&#34;https://www.jetbrains.com/pycharm/download/&#34;&gt;PyCharm官网 https://www.jetbrains.com/pycharm/download/&lt;/a&gt; ，点击Download下载。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/02-download.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;如果无法打开该网页， 可以直接网盘下载&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;链接: https://pan.baidu.com/s/11wSef6kjPge3YVK66C1yuA?pwd=ur2f 提取码: ur2f 
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h3 id=&#34;22-安装&#34;&gt;2.2 安装&lt;/h3&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/1.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/2.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/3.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/4.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/5.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/6.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/7.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/8.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/9-1.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/9-2.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;localhost

*.github.com,plugins.jetbrains.com

SFXUSA86FM-eyJsaWNlbnNlSWQiOiJTRlhVU0E4NkZNIiwibGljZW5zZWVOYW1lIjoi5pyd6Zm956eR5oqA5aSn5a24IiwibGljZW5zZWVUeXBlIjoiQ0xBU1NST09NIiwiYXNzaWduZWVOYW1lIjoiVGFvYmFv77ya5p6B5a6i5LiT5LqrICAtLS0g6LCo6Ziy55uX5Y2W77yBIiwiYXNzaWduZWVFbWFpbCI6IktyaXN0YW5fQmxvd2VAb3V0bG9vay5jb20iLCJsaWNlbnNlUmVzdHJpY3Rpb24iOiJGb3IgZWR1Y2F0aW9uYWwgdXNlIG9ubHkiLCJjaGVja0NvbmN1cnJlbnRVc2UiOmZhbHNlLCJwcm9kdWN0cyI6W3siY29kZSI6IkdPIiwicGFpZFVwVG8iOiIyMDI1LTAyLTE5IiwiZXh0ZW5kZWQiOmZhbHNlfSx7ImNvZGUiOiJSUzAiLCJwYWlkVXBUbyI6IjIwMjUtMDItMTkiLCJleHRlbmRlZCI6ZmFsc2V9LHsiY29kZSI6IkRNIiwicGFpZFVwVG8iOiIyMDI1LTAyLTE5IiwiZXh0ZW5kZWQiOmZhbHNlfSx7ImNvZGUiOiJDTCIsInBhaWRVcFRvIjoiMjAyNS0wMi0xOSIsImV4dGVuZGVkIjpmYWxzZX0seyJjb2RlIjoiUlNVIiwicGFpZFVwVG8iOiIyMDI1LTAyLTE5IiwiZXh0ZW5kZWQiOmZhbHNlfSx7ImNvZGUiOiJSU0MiLCJwYWlkVXBUbyI6IjIwMjUtMDItMTkiLCJleHRlbmRlZCI6dHJ1ZX0seyJjb2RlIjoiUEMiLCJwYWlkVXBUbyI6IjIwMjUtMDItMTkiLCJleHRlbmRlZCI6ZmFsc2V9LHsiY29kZSI6IkRTIiwicGFpZFVwVG8iOiIyMDI1LTAyLTE5IiwiZXh0ZW5kZWQiOmZhbHNlfSx7ImNvZGUiOiJSRCIsInBhaWRVcFRvIjoiMjAyNS0wMi0xOSIsImV4dGVuZGVkIjpmYWxzZX0seyJjb2RlIjoiUkMiLCJwYWlkVXBUbyI6IjIwMjUtMDItMTkiLCJleHRlbmRlZCI6ZmFsc2V9LHsiY29kZSI6IlJTRiIsInBhaWRVcFRvIjoiMjAyNS0wMi0xOSIsImV4dGVuZGVkIjp0cnVlfSx7ImNvZGUiOiJSTSIsInBhaWRVcFRvIjoiMjAyNS0wMi0xOSIsImV4dGVuZGVkIjpmYWxzZX0seyJjb2RlIjoiSUkiLCJwYWlkVXBUbyI6IjIwMjUtMDItMTkiLCJleHRlbmRlZCI6ZmFsc2V9LHsiY29kZSI6IkRQTiIsInBhaWRVcFRvIjoiMjAyNS0wMi0xOSIsImV4dGVuZGVkIjpmYWxzZX0seyJjb2RlIjoiREIiLCJwYWlkVXBUbyI6IjIwMjUtMDItMTkiLCJleHRlbmRlZCI6ZmFsc2V9LHsiY29kZSI6IkRDIiwicGFpZFVwVG8iOiIyMDI1LTAyLTE5IiwiZXh0ZW5kZWQiOmZhbHNlfSx7ImNvZGUiOiJQUyIsInBhaWRVcFRvIjoiMjAyNS0wMi0xOSIsImV4dGVuZGVkIjpmYWxzZX0seyJjb2RlIjoiUlNWIiwicGFpZFVwVG8iOiIyMDI1LTAyLTE5IiwiZXh0ZW5kZWQiOnRydWV9LHsiY29kZSI6IldTIiwicGFpZFVwVG8iOiIyMDI1LTAyLTE5IiwiZXh0ZW5kZWQiOmZhbHNlfSx7ImNvZGUiOiJQU0kiLCJwYWlkVXBUbyI6IjIwMjUtMDItMTkiLCJleHRlbmRlZCI6dHJ1ZX0seyJjb2RlIjoiUENXTVAiLCJwYWlkVXBUbyI6IjIwMjUtMDItMTkiLCJleHRlbmRlZCI6dHJ1ZX0seyJjb2RlIjoiUlMiLCJwYWlkVXBUbyI6IjIwMjUtMDItMTkiLCJleHRlbmRlZCI6dHJ1ZX0seyJjb2RlIjoiRFAiLCJwYWlkVXBUbyI6IjIwMjUtMDItMTkiLCJleHRlbmRlZCI6dHJ1ZX0seyJjb2RlIjoiUERCIiwicGFpZFVwVG8iOiIyMDI1LTAyLTE5IiwiZXh0ZW5kZWQiOnRydWV9XSwibWV0YWRhdGEiOiIwMTIwMjQwMjI2TFBBQTAwMzAwOCIsImhhc2giOiI1NDY4ODAyOS8yNTk5OTU2NTotMTQ5MzMwODg5NSIsImdyYWNlUGVyaW9kRGF5cyI6NywiYXV0b1Byb2xvbmdhdGVkIjpmYWxzZSwiaXNBdXRvUHJvbG9uZ2F0ZWQiOmZhbHNlLCJ0cmlhbCI6ZmFsc2UsImFpQWxsb3dlZCI6dHJ1ZX0=-JDVXZeZnNxn5sMQEXZ2TOZlrMOVI37CPE25JugHcDUdJPc75u4D+IEwoFl1GRB8GKrIhSwJa6OhgHpyXyMqLXtroe/p+qWo6kLi86iTuXpK+E4UQPQP9X9cZTxgupD4py7/Pps4qeuwiWIsbESoDDxRsuivhh1xka8lfJHoPDMwdV7DNjRFUUFpJrDr7KYp5zGRFU9hIUfh8YzZ0lQTAzboQyUwMoTRRiUOM5hs/2/RG6VA1gPaeqRaE6v0nphHTZ6By3Zvs5tj9qh6iW07jtXTxXk0MDzNrQpMh2MUvPB0dikKjDMxgUKFGEiDKvFilZJ+y0ErfdFekBn+mfInr0Q==-MIIETDCCAjSgAwIBAgIBDzANBgkqhkiG9w0BAQsFADAYMRYwFAYDVQQDDA1KZXRQcm9maWxlIENBMB4XDTIyMTAxMDE2MDU0NFoXDTI0MTAxMTE2MDU0NFowHzEdMBsGA1UEAwwUcHJvZDJ5LWZyb20tMjAyMjEwMTAwggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQC/W3uCpU5M2y48rUR/3fFR6y4xj1nOm3rIuGp2brELVGzdgK2BezjnDXpAxVDw5657hBkAUMoyByiDs2MgmVi9IcqdAwpk988/Daaajq9xuU1of59jH9eQ9c3BmsEtdA4boN3VpenYKATwmpKYkJKVc07ZKoXL6kSyZuF7Jq7HoQZcclChbF75QJPGbri3cw9vDk/e46kuzfwpGftvl6+vKibpInO6Dv0ocwImDbOutyZC7E+BwpEm1TJZW4XovMBegHhWC04cJvpH1u98xoR94ichw0jKhdppywARe43rGU96163RckIuFmFDQKZV9SMUrwpQFu4Z2D5yTNqnlLRfAgMBAAGjgZkwgZYwCQYDVR0TBAIwADAdBgNVHQ4EFgQU5FZqQ4gnVc+inIeZF+o3ID+VhcEwSAYDVR0jBEEwP4AUo562SGdCEjZBvW3gubSgUouX8bOhHKQaMBgxFjAUBgNVBAMMDUpldFByb2ZpbGUgQ0GCCQDSbLGDsoN54TATBgNVHSUEDDAKBggrBgEFBQcDATALBgNVHQ8EBAMCBaAwDQYJKoZIhvcNAQELBQADggIBANLG1anEKid4W87vQkqWaQTkRtFKJ2GFtBeMhvLhIyM6Cg3FdQnMZr0qr9mlV0w289pf/+M14J7S7SgsfwxMJvFbw9gZlwHvhBl24N349GuthshGO9P9eKmNPgyTJzTtw6FedXrrHV99nC7spaY84e+DqfHGYOzMJDrg8xHDYLLHk5Q2z5TlrztXMbtLhjPKrc2+ZajFFshgE5eowfkutSYxeX8uA5czFNT1ZxmDwX1KIelbqhh6XkMQFJui8v8Eo396/sN3RAQSfvBd7Syhch2vlaMP4FAB11AlMKO2x/1hoKiHBU3oU3OKRTfoUTfy1uH3T+t03k1Qkr0dqgHLxiv6QU5WrarR9tx/dapqbsSmrYapmJ7S5+ghc4FTWxXJB1cjJRh3X+gwJIHjOVW+5ZVqXTG2s2Jwi2daDt6XYeigxgL2SlQpeL5kvXNCcuSJurJVcRZFYUkzVv85XfDauqGxYqaehPcK2TzmcXOUWPfxQxLJd2TrqSiO+mseqqkNTb3ZDiYS/ZqdQoGYIUwJqXo+EDgqlmuWUhkWwCkyo4rtTZeAj+nP00v3n8JmXtO30Fip+lxpfsVR3tO1hk4Vi2kmVjXyRkW2G7D7WAVt+91ahFoSeRWlKyb4KcvGvwUaa43fWLem2hyI4di2pZdr3fcYJ3xvL5ejL3m14bKsfoOv
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/10.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/11.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;三使用&#34;&gt;三、使用&lt;/h2&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/12.png&#34; alt=&#34;&#34;  /&gt;
&lt;br&gt;&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/13.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/14.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/15.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
</description>
      <content:encoded><![CDATA[<h2 id="一pycharm">一、PyCharm</h2>
<p><a href="https://www.jetbrains.com/zh-cn/pycharm/">PyCharm</a> 是一个简单的代码编辑器，它通过提供一系列高级功能，如代码分析、智能代码完成、一键式代码快速补全建议，大大提升了Python开发的效率和质量。而且现在支持jupyter notebook， 界面更美观易用。</p>
<p>大邓一直建议做数据分析的用户不要用其他编辑器，尽量使用jupyter notebook。 现在大家多了一个选项，即PyCharm中的jupyter notebook。</p>
<p><img loading="lazy" src="img/01-jupyter_file_screen.png" alt=""  />
</p>
<br>
<h2 id="二下载激活">二、下载&amp;激活</h2>
<h3 id="21-下载">2.1 下载</h3>
<p>打开 <a href="https://www.jetbrains.com/pycharm/download/">PyCharm官网 https://www.jetbrains.com/pycharm/download/</a> ，点击Download下载。</p>
<p><img loading="lazy" src="img/02-download.png" alt=""  />
</p>
<br>
<p>如果无法打开该网页， 可以直接网盘下载</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">链接: https://pan.baidu.com/s/11wSef6kjPge3YVK66C1yuA?pwd=ur2f 提取码: ur2f 
</code></pre></div><p><br><br></p>
<h3 id="22-安装">2.2 安装</h3>
<p><img loading="lazy" src="img/1.png" alt=""  />
</p>
<br>
<p><img loading="lazy" src="img/2.png" alt=""  />
</p>
<br>
<p><img loading="lazy" src="img/3.png" alt=""  />
</p>
<br>
<p><img loading="lazy" src="img/4.png" alt=""  />
</p>
<br>
<p><img loading="lazy" src="img/5.png" alt=""  />
</p>
<br>
<p><img loading="lazy" src="img/6.png" alt=""  />
</p>
<br>
<p><img loading="lazy" src="img/7.png" alt=""  />
</p>
<br>
<p><img loading="lazy" src="img/8.png" alt=""  />
</p>
<br>
<p><img loading="lazy" src="img/9-1.png" alt=""  />
</p>
<p><img loading="lazy" src="img/9-2.png" alt=""  />
</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">localhost

*.github.com,plugins.jetbrains.com

SFXUSA86FM-eyJsaWNlbnNlSWQiOiJTRlhVU0E4NkZNIiwibGljZW5zZWVOYW1lIjoi5pyd6Zm956eR5oqA5aSn5a24IiwibGljZW5zZWVUeXBlIjoiQ0xBU1NST09NIiwiYXNzaWduZWVOYW1lIjoiVGFvYmFv77ya5p6B5a6i5LiT5LqrICAtLS0g6LCo6Ziy55uX5Y2W77yBIiwiYXNzaWduZWVFbWFpbCI6IktyaXN0YW5fQmxvd2VAb3V0bG9vay5jb20iLCJsaWNlbnNlUmVzdHJpY3Rpb24iOiJGb3IgZWR1Y2F0aW9uYWwgdXNlIG9ubHkiLCJjaGVja0NvbmN1cnJlbnRVc2UiOmZhbHNlLCJwcm9kdWN0cyI6W3siY29kZSI6IkdPIiwicGFpZFVwVG8iOiIyMDI1LTAyLTE5IiwiZXh0ZW5kZWQiOmZhbHNlfSx7ImNvZGUiOiJSUzAiLCJwYWlkVXBUbyI6IjIwMjUtMDItMTkiLCJleHRlbmRlZCI6ZmFsc2V9LHsiY29kZSI6IkRNIiwicGFpZFVwVG8iOiIyMDI1LTAyLTE5IiwiZXh0ZW5kZWQiOmZhbHNlfSx7ImNvZGUiOiJDTCIsInBhaWRVcFRvIjoiMjAyNS0wMi0xOSIsImV4dGVuZGVkIjpmYWxzZX0seyJjb2RlIjoiUlNVIiwicGFpZFVwVG8iOiIyMDI1LTAyLTE5IiwiZXh0ZW5kZWQiOmZhbHNlfSx7ImNvZGUiOiJSU0MiLCJwYWlkVXBUbyI6IjIwMjUtMDItMTkiLCJleHRlbmRlZCI6dHJ1ZX0seyJjb2RlIjoiUEMiLCJwYWlkVXBUbyI6IjIwMjUtMDItMTkiLCJleHRlbmRlZCI6ZmFsc2V9LHsiY29kZSI6IkRTIiwicGFpZFVwVG8iOiIyMDI1LTAyLTE5IiwiZXh0ZW5kZWQiOmZhbHNlfSx7ImNvZGUiOiJSRCIsInBhaWRVcFRvIjoiMjAyNS0wMi0xOSIsImV4dGVuZGVkIjpmYWxzZX0seyJjb2RlIjoiUkMiLCJwYWlkVXBUbyI6IjIwMjUtMDItMTkiLCJleHRlbmRlZCI6ZmFsc2V9LHsiY29kZSI6IlJTRiIsInBhaWRVcFRvIjoiMjAyNS0wMi0xOSIsImV4dGVuZGVkIjp0cnVlfSx7ImNvZGUiOiJSTSIsInBhaWRVcFRvIjoiMjAyNS0wMi0xOSIsImV4dGVuZGVkIjpmYWxzZX0seyJjb2RlIjoiSUkiLCJwYWlkVXBUbyI6IjIwMjUtMDItMTkiLCJleHRlbmRlZCI6ZmFsc2V9LHsiY29kZSI6IkRQTiIsInBhaWRVcFRvIjoiMjAyNS0wMi0xOSIsImV4dGVuZGVkIjpmYWxzZX0seyJjb2RlIjoiREIiLCJwYWlkVXBUbyI6IjIwMjUtMDItMTkiLCJleHRlbmRlZCI6ZmFsc2V9LHsiY29kZSI6IkRDIiwicGFpZFVwVG8iOiIyMDI1LTAyLTE5IiwiZXh0ZW5kZWQiOmZhbHNlfSx7ImNvZGUiOiJQUyIsInBhaWRVcFRvIjoiMjAyNS0wMi0xOSIsImV4dGVuZGVkIjpmYWxzZX0seyJjb2RlIjoiUlNWIiwicGFpZFVwVG8iOiIyMDI1LTAyLTE5IiwiZXh0ZW5kZWQiOnRydWV9LHsiY29kZSI6IldTIiwicGFpZFVwVG8iOiIyMDI1LTAyLTE5IiwiZXh0ZW5kZWQiOmZhbHNlfSx7ImNvZGUiOiJQU0kiLCJwYWlkVXBUbyI6IjIwMjUtMDItMTkiLCJleHRlbmRlZCI6dHJ1ZX0seyJjb2RlIjoiUENXTVAiLCJwYWlkVXBUbyI6IjIwMjUtMDItMTkiLCJleHRlbmRlZCI6dHJ1ZX0seyJjb2RlIjoiUlMiLCJwYWlkVXBUbyI6IjIwMjUtMDItMTkiLCJleHRlbmRlZCI6dHJ1ZX0seyJjb2RlIjoiRFAiLCJwYWlkVXBUbyI6IjIwMjUtMDItMTkiLCJleHRlbmRlZCI6dHJ1ZX0seyJjb2RlIjoiUERCIiwicGFpZFVwVG8iOiIyMDI1LTAyLTE5IiwiZXh0ZW5kZWQiOnRydWV9XSwibWV0YWRhdGEiOiIwMTIwMjQwMjI2TFBBQTAwMzAwOCIsImhhc2giOiI1NDY4ODAyOS8yNTk5OTU2NTotMTQ5MzMwODg5NSIsImdyYWNlUGVyaW9kRGF5cyI6NywiYXV0b1Byb2xvbmdhdGVkIjpmYWxzZSwiaXNBdXRvUHJvbG9uZ2F0ZWQiOmZhbHNlLCJ0cmlhbCI6ZmFsc2UsImFpQWxsb3dlZCI6dHJ1ZX0=-JDVXZeZnNxn5sMQEXZ2TOZlrMOVI37CPE25JugHcDUdJPc75u4D+IEwoFl1GRB8GKrIhSwJa6OhgHpyXyMqLXtroe/p+qWo6kLi86iTuXpK+E4UQPQP9X9cZTxgupD4py7/Pps4qeuwiWIsbESoDDxRsuivhh1xka8lfJHoPDMwdV7DNjRFUUFpJrDr7KYp5zGRFU9hIUfh8YzZ0lQTAzboQyUwMoTRRiUOM5hs/2/RG6VA1gPaeqRaE6v0nphHTZ6By3Zvs5tj9qh6iW07jtXTxXk0MDzNrQpMh2MUvPB0dikKjDMxgUKFGEiDKvFilZJ+y0ErfdFekBn+mfInr0Q==-MIIETDCCAjSgAwIBAgIBDzANBgkqhkiG9w0BAQsFADAYMRYwFAYDVQQDDA1KZXRQcm9maWxlIENBMB4XDTIyMTAxMDE2MDU0NFoXDTI0MTAxMTE2MDU0NFowHzEdMBsGA1UEAwwUcHJvZDJ5LWZyb20tMjAyMjEwMTAwggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQC/W3uCpU5M2y48rUR/3fFR6y4xj1nOm3rIuGp2brELVGzdgK2BezjnDXpAxVDw5657hBkAUMoyByiDs2MgmVi9IcqdAwpk988/Daaajq9xuU1of59jH9eQ9c3BmsEtdA4boN3VpenYKATwmpKYkJKVc07ZKoXL6kSyZuF7Jq7HoQZcclChbF75QJPGbri3cw9vDk/e46kuzfwpGftvl6+vKibpInO6Dv0ocwImDbOutyZC7E+BwpEm1TJZW4XovMBegHhWC04cJvpH1u98xoR94ichw0jKhdppywARe43rGU96163RckIuFmFDQKZV9SMUrwpQFu4Z2D5yTNqnlLRfAgMBAAGjgZkwgZYwCQYDVR0TBAIwADAdBgNVHQ4EFgQU5FZqQ4gnVc+inIeZF+o3ID+VhcEwSAYDVR0jBEEwP4AUo562SGdCEjZBvW3gubSgUouX8bOhHKQaMBgxFjAUBgNVBAMMDUpldFByb2ZpbGUgQ0GCCQDSbLGDsoN54TATBgNVHSUEDDAKBggrBgEFBQcDATALBgNVHQ8EBAMCBaAwDQYJKoZIhvcNAQELBQADggIBANLG1anEKid4W87vQkqWaQTkRtFKJ2GFtBeMhvLhIyM6Cg3FdQnMZr0qr9mlV0w289pf/+M14J7S7SgsfwxMJvFbw9gZlwHvhBl24N349GuthshGO9P9eKmNPgyTJzTtw6FedXrrHV99nC7spaY84e+DqfHGYOzMJDrg8xHDYLLHk5Q2z5TlrztXMbtLhjPKrc2+ZajFFshgE5eowfkutSYxeX8uA5czFNT1ZxmDwX1KIelbqhh6XkMQFJui8v8Eo396/sN3RAQSfvBd7Syhch2vlaMP4FAB11AlMKO2x/1hoKiHBU3oU3OKRTfoUTfy1uH3T+t03k1Qkr0dqgHLxiv6QU5WrarR9tx/dapqbsSmrYapmJ7S5+ghc4FTWxXJB1cjJRh3X+gwJIHjOVW+5ZVqXTG2s2Jwi2daDt6XYeigxgL2SlQpeL5kvXNCcuSJurJVcRZFYUkzVv85XfDauqGxYqaehPcK2TzmcXOUWPfxQxLJd2TrqSiO+mseqqkNTb3ZDiYS/ZqdQoGYIUwJqXo+EDgqlmuWUhkWwCkyo4rtTZeAj+nP00v3n8JmXtO30Fip+lxpfsVR3tO1hk4Vi2kmVjXyRkW2G7D7WAVt+91ahFoSeRWlKyb4KcvGvwUaa43fWLem2hyI4di2pZdr3fcYJ3xvL5ejL3m14bKsfoOv
</code></pre></div><br>
<p><img loading="lazy" src="img/10.png" alt=""  />
</p>
<br>
<p><img loading="lazy" src="img/11.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="三使用">三、使用</h2>
<p><img loading="lazy" src="img/12.png" alt=""  />
<br></p>
<p><img loading="lazy" src="img/13.png" alt=""  />
</p>
<br>
<p><img loading="lazy" src="img/14.png" alt=""  />
</p>
<br>
<p><img loading="lazy" src="img/15.png" alt=""  />
</p>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>博客新增留言交流功能</title>
      <link>https://textdata.cn/blog/2024-05-17-add-comment-with-github-discussion-and-giscus/</link>
      <pubDate>Fri, 17 May 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-05-17-add-comment-with-github-discussion-and-giscus/</guid>
      <description>&lt;p&gt;之前博客 &lt;a href=&#34;https://textdata.cn/&#34;&gt;https://textdata.cn/&lt;/a&gt; 只能留言， 但不能互评， 也不能追评。 现在评论系统改为 giscus ， 支持互评、追评， 在这里说不定还能开帖子 discussion 进行交友^_^。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/01-homepage.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;单个推文内的留言区&#34;&gt;单个推文内的留言区&lt;/h2&gt;
&lt;p&gt;任意一篇推文底部有评论区，可以对推文进行留言， 留言者也可与其他人进行追评。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/02-postdiscus.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;整个博客的留言区&#34;&gt;整个博客的留言区&lt;/h2&gt;
&lt;p&gt;还可以查看整个博客内所有的留言，这有点小社区论坛的意思^_^&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/03-discus-ban.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
</description>
      <content:encoded><![CDATA[<p>之前博客 <a href="https://textdata.cn/">https://textdata.cn/</a> 只能留言， 但不能互评， 也不能追评。 现在评论系统改为 giscus ， 支持互评、追评， 在这里说不定还能开帖子 discussion 进行交友^_^。</p>
<p><img loading="lazy" src="img/01-homepage.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="单个推文内的留言区">单个推文内的留言区</h2>
<p>任意一篇推文底部有评论区，可以对推文进行留言， 留言者也可与其他人进行追评。</p>
<p><img loading="lazy" src="img/02-postdiscus.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="整个博客的留言区">整个博客的留言区</h2>
<p>还可以查看整个博客内所有的留言，这有点小社区论坛的意思^_^</p>
<p><img loading="lazy" src="img/03-discus-ban.png" alt=""  />
</p>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>cntext2.x | 新增读取pdf/docx| 提取MD&amp;A | 文本可视化等功能</title>
      <link>https://textdata.cn/blog/2024-05-14-add-readpdf-readdocx-lexical-dispersion-plot/</link>
      <pubDate>Tue, 14 May 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-05-14-add-readpdf-readdocx-lexical-dispersion-plot/</guid>
      <description>&lt;h2 id=&#34;一cntext&#34;&gt;一、cntext&lt;/h2&gt;
&lt;h3 id=&#34;11-新增函数&#34;&gt;1.1 新增函数&lt;/h3&gt;
&lt;p&gt;cntext2.1.2新增函数有&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;&lt;strong&gt;get_cntext_path()&lt;/strong&gt;&lt;/em&gt;  查看cntext2.x的安装路径&lt;/li&gt;
&lt;li&gt;&lt;em&gt;&lt;strong&gt;read_pdf()/read_docx()&lt;/strong&gt;&lt;/em&gt;  读取 pdf、docx文件&lt;/li&gt;
&lt;li&gt;&lt;em&gt;&lt;strong&gt;extract_mda()&lt;/strong&gt;&lt;/em&gt; 提取中文年报文本中的管理层讨论与分析&lt;/li&gt;
&lt;li&gt;&lt;em&gt;&lt;strong&gt;matplotlib_chinese()&lt;/strong&gt;&lt;/em&gt; 支持matplotlib显示中文&lt;/li&gt;
&lt;li&gt;&lt;em&gt;&lt;strong&gt;lexical_dispersion_plot1()&lt;/strong&gt;&lt;/em&gt; 词汇分散图&lt;/li&gt;
&lt;li&gt;&lt;em&gt;&lt;strong&gt;lexical_dispersion_plot2()&lt;/strong&gt;&lt;/em&gt; 词汇分散图&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;已购买cntext2.x的用户，可私信找到大邓获取最新版本安装包！&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h3 id=&#34;12-安装&#34;&gt;1.2 安装&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;pip3 install distinctiveness
pip3 install cntext --upgrade
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h2 id=&#34;二实验&#34;&gt;二、实验&lt;/h2&gt;
&lt;h3 id=&#34;21-get_cntext_path&#34;&gt;2.1 get_cntext_path()&lt;/h3&gt;
&lt;p&gt;如果你熟悉PYTHON，想对cntext内进行修改， 可以使用该函数找到cntext安装路径。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;cntext&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;ct&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;get_cntext_path&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/cntext
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;blockquote&gt;
&lt;p&gt;不同电脑返回的位置是不同的，以上路径是大邓Mac中cntext2.x的安装路径&lt;/p&gt;
&lt;/blockquote&gt;
&lt;br&gt;
&lt;h3 id=&#34;22-read_docx&#34;&gt;2.2 read_docx()&lt;/h3&gt;
&lt;p&gt;读取 docx文件。 自己diy一个 test.docx , 在文件内写一个句子，测一测&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;o&#34;&gt;%%&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;time&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;cntext&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;ct&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_docx&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;test.docx&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;这是来自docx文件里的内容
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;23-read_pdf&#34;&gt;2.3 read_pdf()&lt;/h3&gt;
&lt;p&gt;读取 pdf文件&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;o&#34;&gt;%%&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;time&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;cntext&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;ct&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#读取格力电器2023会计年度的年报文件&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_pdf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;格力电器2023.pdf&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;CPU times: user 5.5 s, sys: 48.9 ms, total: 5.55 s
Wall time: 5.55 s

\n珠海格力电器股份有限公司 2023年年度报告全文  \n珠海格力电器股份有限公司  \n2023年年度报告  \n \n \n二〇二四年四月 \n珠海格力电器股份有限公司 2023年年度报告全文  \n 第 2 页 共 249 页 第一节 重要提示、目录和释义  \n公司董事会、监事会及董事、监事、高级管理人员保证年度报告内容\n的真实、准确、完整，不存在虚假记载、误导性陈述或重大遗漏，并承担\n个别和连带的法律责任。  \n公司负责人董明珠、主管会计工作负责人廖建雄及会计机构负责人\n（会计主管人员）刘炎姿声明：保证本年度报告中财务报告的真实、准确、\n完整。 \n所有董事均已出席了审议本报告的董事会会议。  \n本报告中所涉及的未来计划、发展战略等前瞻性陈述，不构成公司对\n投资者的实质承诺，投资者及相关人士均应当对此保持足够的风险认识，\n并且应当理解计划、预测与承诺之间的差异，敬请注意投资风险，理性投\n资。 \n公司经本次董事会审议通过的利润分配预案为：拟以本利润分配预案\n披露时享有利润分配权的股本总额  5,521,943,646 股（总股本\n5,631,405,741 股扣除公司回购账户持有的股份 109,462,095 股）为基数，\n向全体股东每 10股派发现金红利 23.80元（含税），送红股 0股（含\n税），不以公积金转增股本。  \n   \n珠海格力电器股份有限公司 2023年年度报告全文  \n 第 3 页 共 249 页 目录 \n第一节 重要提示、目录和释义  ................................ ..........................  2 \n第二节 公司简介和主要财务指标  ................................ ........................  6 \n第三节 管理层讨论与分析  ................................ ...............................  10 \n第四节 公司治理  ................................ ................................ ........  42 \n第五节 环境和社会责任  ................................ ..

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;24-extract_mda&#34;&gt;2.4 extract_mda()&lt;/h3&gt;
&lt;p&gt;提取A股年报中的MD&amp;amp;A文本内容。如果返回&#39;&#39;,则提取失败。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;ct.extract_mda(text, kws_pattern=&amp;#39;&amp;#39;)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;ul&gt;
&lt;li&gt;text 中国A股年报原始文本&lt;/li&gt;
&lt;li&gt;kws_pattern 管理层讨论与分析章节识别关键词的模板。cntext内置的kws_pattern内容如下&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;kws_pattern = &amp;#39;董事会报告|董事会报告与管理讨论|企业运营与管理评述|经营总结与分析|管理层评估与未来展望|董事局报告|管理层讨论与分析|经营情况讨论与分析|经营业绩分析|业务回顾与展望|公司经营分析|管理层评论与分析|执行摘要与业务回顾|业务运营分析&amp;#39;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;p&gt;基本上2015年之后，识别命中率在90%以上。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;cntext&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;ct&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#读取格力电器2023会计年度的年报文件&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_pdf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;格力电器2023.pdf&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#提取mda&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;mda_text&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;extract_mda&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;mda_text&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;管理层讨论与分析  \n一、报告期内公司所处行业情况  \n（一）行业发展现状  \n1.消费领域 ——家电行业稳定增长，空调市场恢复明显  \n2023年，中国经济保持了整体恢复向好的态势，激发消费是稳增长的重中之重。国家鼓励和推动消费品以旧换\n新，促进消费经济大循环，加速更新需求释放，推动高能效产品设备销售和出口增长，进一步激发绿色消费潜力。  \n1）家电行业稳定增长  \n2023年，国内经济恢复明显，家电行业稳定增长。根据全国家用电器工业信息中心发布的《 2023年中国家电\n行业年度报告》，家电行业外销明显增长，出口规模为 6,174亿元，同比增长 9.9%；国内市场实现稳步增长，销售\n规模为7,736亿元，同比增长 1.7%。 \n2）空调市场规模实现较好恢复  \n2023年，空调市场恢复明显。根据奥维云网（ AVC）零售推总数据， 2023年空调市场实现零售额 2,117亿元，\n同比增长 7.5%，零售量 6,085万台，同比增长 6.5%。根据产业在线数据， 2023年，家用空调生产 16,869.2 万台，\n同比增长 11.1%，销售17,044.0 万台，同比增长 11.2%，其中内销出货 9,959.7万台，同比增长 13.8%，出口出货\n7,084.3万台，同比增长 7.8%，内外销实现双增长。  \n2.工业领域 ——工业经济稳中向上态势  \n根据工信部数据， 2023年，我国规模以上工业增加值同比增长 4.6%，同比提升 1个百分点，其中制造业规模\n以上工业增加值同比增长 5.0%。 \n智能制造产业规模日益增长。从《中国制造 2025》再到《“十四五”智能制造发展规划》，均以发展先进智能\n制造业为核心目标，布局规划制造强国的推进路径。我国已 初步形成以自动化生产线、智能检测与装配装备、智能\n控制系统、工业机器人等为代表的智能制造产业体系，产业规模日益增长。中商产业研究院预计， 2023年我国智能\n制造装备市场规模将超过 2.97万亿元。前瞻产业研究院预测，到 2027年，我国智能制造行业市场规模将达到 6.6\n万亿元，其中智能制造装备市场规模约 5.4万亿元，智能制造系统解决方案市场规模约 1.2万亿元。 2023年，国内\n加快推动传统产业技术改造升级，加大智能制造推广力度，组建成  62家“灯塔工厂”，占全球“灯塔工厂”总数\n的40%，培育了 421家国家级智能制造示范 工厂，万余家省级数字化车间和智能工厂。  \n空调核心零部件产业规模增长明显。根据产业在线数据， 2023年，空调转子压缩机市场高速发展，全年产量达\n到2.61亿台，同比增长 12.2%；全年销售量达到 2.62亿台，成为行业新巅峰。内销市场，转子压缩机表现出色，\n全年保持正向增长，预计内销为 2.27亿台，同比增长 14.3%；外销市场，全年预计出口 3,564.7万台，同比增长\n2.1%。受益于 2023年下游空调市场销售规模的增长，空调电机行业产销规模同步提升，达到 4.22亿台，同比增长\n6.8%；内销市场出货约为 3.5亿台，同 比增长8.4%；出口市场出货约为 0.7亿台，同比持平。压缩机和电机产业规\n模的增长，为整个空调行业的发展提供了有力支持。  \n
.......
.......
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;25-matplotlib_chinese&#34;&gt;2.5 matplotlib_chinese()&lt;/h3&gt;
&lt;p&gt;matplotlib默认不支持中文可视化， cntext新增该函数，可以解决中文可视化问题&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;cntext&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;ct&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;matplotlib_chinese&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;figure&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;figsize&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;7&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;4&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;plot&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;([&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;3&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;4&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;4&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;9&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;16&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;title&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;中文图表&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;fontsize&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;10&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;show&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/27-chinese-matplotlib.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;26-lexical_dispersion_plot1&#34;&gt;2.6 lexical_dispersion_plot1()&lt;/h3&gt;
&lt;p&gt;词汇分散图可视化， 对某一个文本text， 可视化不同目标类别词targets_dict在文本中出现位置&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;lexical_dispersion_plot1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;targets_dict&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;lang&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;chinese&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;figsize&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;12&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;6&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;title&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;特定词汇在不同文本来源的相对离散图&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;prop&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;ul&gt;
&lt;li&gt;&lt;em&gt;&lt;strong&gt;text&lt;/strong&gt;&lt;/em&gt;: 文本数据&lt;/li&gt;
&lt;li&gt;&lt;em&gt;&lt;strong&gt;targets_dict&lt;/strong&gt;&lt;/em&gt;:  目标类别词字典； targets_dict={&amp;lsquo;pos&amp;rsquo;: [&amp;lsquo;开心&amp;rsquo;, &amp;lsquo;快乐&amp;rsquo;], &amp;lsquo;neg&amp;rsquo;: [&amp;lsquo;悲伤&amp;rsquo;, &amp;lsquo;难过&amp;rsquo;]}&lt;/li&gt;
&lt;li&gt;&lt;em&gt;&lt;strong&gt;lang&lt;/strong&gt;&lt;/em&gt;: 文本数据texts_dict的语言类型，默认&amp;rsquo;chinese&#39;.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;&lt;strong&gt;figsize&lt;/strong&gt;&lt;/em&gt;: 图的长宽尺寸. 默认 (8, 5).&lt;/li&gt;
&lt;li&gt;&lt;em&gt;&lt;strong&gt;title&lt;/strong&gt;&lt;/em&gt; : 图的标题；&lt;/li&gt;
&lt;li&gt;&lt;em&gt;&lt;strong&gt;prop&lt;/strong&gt;&lt;/em&gt;: 横坐标字符位置是否为相对位置. 默认True，横坐标索引值取值范围0 ~ 100&lt;/li&gt;
&lt;/ul&gt;
&lt;br&gt;
&lt;p&gt;点击下载 &lt;a href=&#34;https://textdata.cn/data/%E4%B8%89%E4%BD%93.txt&#34;&gt;&lt;strong&gt;三体.txt&lt;/strong&gt;&lt;/a&gt;、&lt;a href=&#34;https://textdata.cn/data/%E5%9F%BA%E5%9C%B0.txt&#34;&gt;&lt;strong&gt;基地.txt&lt;/strong&gt;&lt;/a&gt;两本小说文件。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;cntext&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;ct&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;roles_dict&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;
    &lt;span class=&#34;s2&#34;&gt;&amp;#34;汪淼&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;汪淼&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt;
    &lt;span class=&#34;s2&#34;&gt;&amp;#34;叶文洁&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;叶文洁&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt;
    &lt;span class=&#34;s2&#34;&gt;&amp;#34;罗辑&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;罗辑&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;santi_text&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;open&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;三体.txt&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;encoding&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;utf-8&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;ax&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;lexical_dispersion_plot1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;santi_text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;  &lt;span class=&#34;c1&#34;&gt;#文本数据&lt;/span&gt;
                            &lt;span class=&#34;n&#34;&gt;targets_dict&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;roles_dict&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;c1&#34;&gt;#角色&lt;/span&gt;
                            &lt;span class=&#34;n&#34;&gt;figsize&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;10&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;4&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt;  &lt;span class=&#34;c1&#34;&gt;#尺寸大小&lt;/span&gt;
                            &lt;span class=&#34;n&#34;&gt;lang&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;chinese&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;  &lt;span class=&#34;c1&#34;&gt;#中文数据&lt;/span&gt;
                            &lt;span class=&#34;n&#34;&gt;title&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;《三体》小说角色出现位置&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;c1&#34;&gt;#标题&lt;/span&gt;
                            &lt;span class=&#34;n&#34;&gt;prop&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;    &lt;span class=&#34;c1&#34;&gt;#相对位置(横坐标轴取值范围0-100)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;ax&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/23-lexical_dispersion_plot1-relative.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;lexical_dispersion_plot1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;santi_text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;  &lt;span class=&#34;c1&#34;&gt;#文本数据&lt;/span&gt;
                            &lt;span class=&#34;n&#34;&gt;targets_dict&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;roles_dict&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;c1&#34;&gt;#角色&lt;/span&gt;
                            &lt;span class=&#34;n&#34;&gt;figsize&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;10&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;4&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt;  &lt;span class=&#34;c1&#34;&gt;#尺寸大小&lt;/span&gt;
                            &lt;span class=&#34;n&#34;&gt;lang&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;chinese&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;  &lt;span class=&#34;c1&#34;&gt;#中文数据&lt;/span&gt;
                            &lt;span class=&#34;n&#34;&gt;title&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;《三体》小说角色出现位置&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;c1&#34;&gt;#标题&lt;/span&gt;
                            &lt;span class=&#34;n&#34;&gt;prop&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;kc&#34;&gt;False&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;    &lt;span class=&#34;c1&#34;&gt;#绝对位置(横坐标轴取值范围与小说文本长度有关)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/24-lexical_dispersion_plot1-absolute.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;cntext&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;ct&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#diy了一个小词典&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;senti_dict&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;
    &lt;span class=&#34;s1&#34;&gt;&amp;#39;pos&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;开心&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;幸福&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;快乐&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;安宁&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;希望&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt;
    &lt;span class=&#34;s1&#34;&gt;&amp;#39;neg&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;紧张&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;恐惧&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;害怕&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;绝望&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;santi_text&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;open&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;三体.txt&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;encoding&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;utf-8&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;ax&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;lexical_dispersion_plot1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;santi_text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
                            &lt;span class=&#34;n&#34;&gt;targets_dict&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;senti_dict&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
                            &lt;span class=&#34;n&#34;&gt;figsize&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;10&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt; 
                            &lt;span class=&#34;n&#34;&gt;lang&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;chinese&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
                            &lt;span class=&#34;n&#34;&gt;title&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;《三体》情绪词出现位置&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
                            &lt;span class=&#34;n&#34;&gt;prop&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;ax&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/25-santi_sentiment.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;27--lexical_dispersion_plot2&#34;&gt;2.7  lexical_dispersion_plot2()&lt;/h3&gt;
&lt;p&gt;词汇分散图可视化， 对某几个文本texts_dict， 可视化某些目标词targets在文本中出现相对位置(0~100)&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;lexical_dispersion_plot2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;texts_dict&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;targets&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;lang&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;chinese&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;figsize&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;12&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;6&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;title&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;特定词汇在不同文本来源的相对离散图&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;ul&gt;
&lt;li&gt;&lt;em&gt;&lt;strong&gt;texts_dict&lt;/strong&gt;&lt;/em&gt;: 多个文本的字典数据。形如{&amp;lsquo;source1&amp;rsquo;: &amp;lsquo;source1的文本内容&amp;rsquo;, &amp;lsquo;source2&amp;rsquo;: &amp;lsquo;source2的文本内容&amp;rsquo;}&lt;/li&gt;
&lt;li&gt;&lt;em&gt;&lt;strong&gt;targets&lt;/strong&gt;&lt;/em&gt;: 目标词列表&lt;/li&gt;
&lt;li&gt;&lt;em&gt;&lt;strong&gt;lang&lt;/strong&gt;&lt;/em&gt;: 文本数据texts_dict的语言类型，默认&amp;rsquo;chinese&#39;.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;&lt;strong&gt;figsize&lt;/strong&gt;&lt;/em&gt;: 图的长宽尺寸. 默认 (8, 5).&lt;/li&gt;
&lt;li&gt;&lt;em&gt;&lt;strong&gt;title&lt;/strong&gt;&lt;/em&gt; : 图的标题；&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;targets&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;太空&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;宇宙&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;texts_dict&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;三体&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;open&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;三体.txt&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;encoding&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;utf-8&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(),&lt;/span&gt;
              &lt;span class=&#34;s1&#34;&gt;&amp;#39;基地&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;open&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;基地.txt&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;encoding&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;utf-8&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()}&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;ax&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;lexical_dispersion_plot2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;texts_dict&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;texts_dict&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
                            &lt;span class=&#34;n&#34;&gt;targets&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;targets&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
                            &lt;span class=&#34;n&#34;&gt;figsize&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;10&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt; 
                            &lt;span class=&#34;n&#34;&gt;title&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;&amp;#34;太空/宇宙&amp;#34;词语出现位置&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
                            &lt;span class=&#34;n&#34;&gt;lang&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;chinese&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;ax&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/26-santi_base.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;cntext使用声明&#34;&gt;cntext使用声明&lt;/h2&gt;
&lt;p&gt;如在研究或项目中使用 cntext ，请在文中介绍并附引用声明。引用格式可参考 &lt;a href=&#34;https://textdata.cn/blog/2025-09-09-transform-text-data-into-structured-data-with-lmstudio-and-cntext/#%E4%BD%BF%E7%94%A8%E5%A3%B0%E6%98%8E&#34;&gt;cntext 推荐引用格式&lt;/a&gt;&lt;/p&gt;
</description>
      <content:encoded><![CDATA[<h2 id="一cntext">一、cntext</h2>
<h3 id="11-新增函数">1.1 新增函数</h3>
<p>cntext2.1.2新增函数有</p>
<ul>
<li><em><strong>get_cntext_path()</strong></em>  查看cntext2.x的安装路径</li>
<li><em><strong>read_pdf()/read_docx()</strong></em>  读取 pdf、docx文件</li>
<li><em><strong>extract_mda()</strong></em> 提取中文年报文本中的管理层讨论与分析</li>
<li><em><strong>matplotlib_chinese()</strong></em> 支持matplotlib显示中文</li>
<li><em><strong>lexical_dispersion_plot1()</strong></em> 词汇分散图</li>
<li><em><strong>lexical_dispersion_plot2()</strong></em> 词汇分散图</li>
</ul>
<p>已购买cntext2.x的用户，可私信找到大邓获取最新版本安装包！</p>
<p><br><br></p>
<h3 id="12-安装">1.2 安装</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">pip3 install distinctiveness
pip3 install cntext --upgrade
</code></pre></div><br>
<h2 id="二实验">二、实验</h2>
<h3 id="21-get_cntext_path">2.1 get_cntext_path()</h3>
<p>如果你熟悉PYTHON，想对cntext内进行修改， 可以使用该函数找到cntext安装路径。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="n">ct</span><span class="o">.</span><span class="n">get_cntext_path</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/cntext
</code></pre></div><blockquote>
<p>不同电脑返回的位置是不同的，以上路径是大邓Mac中cntext2.x的安装路径</p>
</blockquote>
<br>
<h3 id="22-read_docx">2.2 read_docx()</h3>
<p>读取 docx文件。 自己diy一个 test.docx , 在文件内写一个句子，测一测</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>
<span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="n">text</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">read_docx</span><span class="p">(</span><span class="s1">&#39;test.docx&#39;</span><span class="p">)</span>

<span class="n">text</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">这是来自docx文件里的内容
</code></pre></div><br>
<h3 id="23-read_pdf">2.3 read_pdf()</h3>
<p>读取 pdf文件</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>
<span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="c1">#读取格力电器2023会计年度的年报文件</span>
<span class="n">text</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">read_pdf</span><span class="p">(</span><span class="s1">&#39;格力电器2023.pdf&#39;</span><span class="p">)</span>

<span class="n">text</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">CPU times: user 5.5 s, sys: 48.9 ms, total: 5.55 s
Wall time: 5.55 s

\n珠海格力电器股份有限公司 2023年年度报告全文  \n珠海格力电器股份有限公司  \n2023年年度报告  \n \n \n二〇二四年四月 \n珠海格力电器股份有限公司 2023年年度报告全文  \n 第 2 页 共 249 页 第一节 重要提示、目录和释义  \n公司董事会、监事会及董事、监事、高级管理人员保证年度报告内容\n的真实、准确、完整，不存在虚假记载、误导性陈述或重大遗漏，并承担\n个别和连带的法律责任。  \n公司负责人董明珠、主管会计工作负责人廖建雄及会计机构负责人\n（会计主管人员）刘炎姿声明：保证本年度报告中财务报告的真实、准确、\n完整。 \n所有董事均已出席了审议本报告的董事会会议。  \n本报告中所涉及的未来计划、发展战略等前瞻性陈述，不构成公司对\n投资者的实质承诺，投资者及相关人士均应当对此保持足够的风险认识，\n并且应当理解计划、预测与承诺之间的差异，敬请注意投资风险，理性投\n资。 \n公司经本次董事会审议通过的利润分配预案为：拟以本利润分配预案\n披露时享有利润分配权的股本总额  5,521,943,646 股（总股本\n5,631,405,741 股扣除公司回购账户持有的股份 109,462,095 股）为基数，\n向全体股东每 10股派发现金红利 23.80元（含税），送红股 0股（含\n税），不以公积金转增股本。  \n   \n珠海格力电器股份有限公司 2023年年度报告全文  \n 第 3 页 共 249 页 目录 \n第一节 重要提示、目录和释义  ................................ ..........................  2 \n第二节 公司简介和主要财务指标  ................................ ........................  6 \n第三节 管理层讨论与分析  ................................ ...............................  10 \n第四节 公司治理  ................................ ................................ ........  42 \n第五节 环境和社会责任  ................................ ..

</code></pre></div><br>
<h3 id="24-extract_mda">2.4 extract_mda()</h3>
<p>提取A股年报中的MD&amp;A文本内容。如果返回'',则提取失败。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">ct.extract_mda(text, kws_pattern=&#39;&#39;)
</code></pre></div><ul>
<li>text 中国A股年报原始文本</li>
<li>kws_pattern 管理层讨论与分析章节识别关键词的模板。cntext内置的kws_pattern内容如下</li>
</ul>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">kws_pattern = &#39;董事会报告|董事会报告与管理讨论|企业运营与管理评述|经营总结与分析|管理层评估与未来展望|董事局报告|管理层讨论与分析|经营情况讨论与分析|经营业绩分析|业务回顾与展望|公司经营分析|管理层评论与分析|执行摘要与业务回顾|业务运营分析&#39;
</code></pre></div><br>
<p>基本上2015年之后，识别命中率在90%以上。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="c1">#读取格力电器2023会计年度的年报文件</span>
<span class="n">text</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">read_pdf</span><span class="p">(</span><span class="s1">&#39;格力电器2023.pdf&#39;</span><span class="p">)</span>

<span class="c1">#提取mda</span>
<span class="n">mda_text</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">extract_mda</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>

<span class="n">mda_text</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">管理层讨论与分析  \n一、报告期内公司所处行业情况  \n（一）行业发展现状  \n1.消费领域 ——家电行业稳定增长，空调市场恢复明显  \n2023年，中国经济保持了整体恢复向好的态势，激发消费是稳增长的重中之重。国家鼓励和推动消费品以旧换\n新，促进消费经济大循环，加速更新需求释放，推动高能效产品设备销售和出口增长，进一步激发绿色消费潜力。  \n1）家电行业稳定增长  \n2023年，国内经济恢复明显，家电行业稳定增长。根据全国家用电器工业信息中心发布的《 2023年中国家电\n行业年度报告》，家电行业外销明显增长，出口规模为 6,174亿元，同比增长 9.9%；国内市场实现稳步增长，销售\n规模为7,736亿元，同比增长 1.7%。 \n2）空调市场规模实现较好恢复  \n2023年，空调市场恢复明显。根据奥维云网（ AVC）零售推总数据， 2023年空调市场实现零售额 2,117亿元，\n同比增长 7.5%，零售量 6,085万台，同比增长 6.5%。根据产业在线数据， 2023年，家用空调生产 16,869.2 万台，\n同比增长 11.1%，销售17,044.0 万台，同比增长 11.2%，其中内销出货 9,959.7万台，同比增长 13.8%，出口出货\n7,084.3万台，同比增长 7.8%，内外销实现双增长。  \n2.工业领域 ——工业经济稳中向上态势  \n根据工信部数据， 2023年，我国规模以上工业增加值同比增长 4.6%，同比提升 1个百分点，其中制造业规模\n以上工业增加值同比增长 5.0%。 \n智能制造产业规模日益增长。从《中国制造 2025》再到《“十四五”智能制造发展规划》，均以发展先进智能\n制造业为核心目标，布局规划制造强国的推进路径。我国已 初步形成以自动化生产线、智能检测与装配装备、智能\n控制系统、工业机器人等为代表的智能制造产业体系，产业规模日益增长。中商产业研究院预计， 2023年我国智能\n制造装备市场规模将超过 2.97万亿元。前瞻产业研究院预测，到 2027年，我国智能制造行业市场规模将达到 6.6\n万亿元，其中智能制造装备市场规模约 5.4万亿元，智能制造系统解决方案市场规模约 1.2万亿元。 2023年，国内\n加快推动传统产业技术改造升级，加大智能制造推广力度，组建成  62家“灯塔工厂”，占全球“灯塔工厂”总数\n的40%，培育了 421家国家级智能制造示范 工厂，万余家省级数字化车间和智能工厂。  \n空调核心零部件产业规模增长明显。根据产业在线数据， 2023年，空调转子压缩机市场高速发展，全年产量达\n到2.61亿台，同比增长 12.2%；全年销售量达到 2.62亿台，成为行业新巅峰。内销市场，转子压缩机表现出色，\n全年保持正向增长，预计内销为 2.27亿台，同比增长 14.3%；外销市场，全年预计出口 3,564.7万台，同比增长\n2.1%。受益于 2023年下游空调市场销售规模的增长，空调电机行业产销规模同步提升，达到 4.22亿台，同比增长\n6.8%；内销市场出货约为 3.5亿台，同 比增长8.4%；出口市场出货约为 0.7亿台，同比持平。压缩机和电机产业规\n模的增长，为整个空调行业的发展提供了有力支持。  \n
.......
.......
</code></pre></div><br>
<h3 id="25-matplotlib_chinese">2.5 matplotlib_chinese()</h3>
<p>matplotlib默认不支持中文可视化， cntext新增该函数，可以解决中文可视化问题</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="n">plt</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">matplotlib_chinese</span><span class="p">()</span>
<span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">7</span><span class="p">,</span> <span class="mi">4</span><span class="p">))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">],</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">9</span><span class="p">,</span> <span class="mi">16</span><span class="p">])</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">&#39;中文图表&#39;</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/27-chinese-matplotlib.png" alt=""  />
</p>
<br>
<h3 id="26-lexical_dispersion_plot1">2.6 lexical_dispersion_plot1()</h3>
<p>词汇分散图可视化， 对某一个文本text， 可视化不同目标类别词targets_dict在文本中出现位置</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">ct</span><span class="o">.</span><span class="n">lexical_dispersion_plot1</span><span class="p">(</span><span class="n">text</span><span class="p">,</span> <span class="n">targets_dict</span><span class="p">,</span> <span class="n">lang</span><span class="o">=</span><span class="s1">&#39;chinese&#39;</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">12</span><span class="p">,</span> <span class="mi">6</span><span class="p">),</span> <span class="n">title</span><span class="o">=</span><span class="s1">&#39;特定词汇在不同文本来源的相对离散图&#39;</span><span class="p">,</span> <span class="n">prop</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</code></pre></div><ul>
<li><em><strong>text</strong></em>: 文本数据</li>
<li><em><strong>targets_dict</strong></em>:  目标类别词字典； targets_dict={&lsquo;pos&rsquo;: [&lsquo;开心&rsquo;, &lsquo;快乐&rsquo;], &lsquo;neg&rsquo;: [&lsquo;悲伤&rsquo;, &lsquo;难过&rsquo;]}</li>
<li><em><strong>lang</strong></em>: 文本数据texts_dict的语言类型，默认&rsquo;chinese'.</li>
<li><em><strong>figsize</strong></em>: 图的长宽尺寸. 默认 (8, 5).</li>
<li><em><strong>title</strong></em> : 图的标题；</li>
<li><em><strong>prop</strong></em>: 横坐标字符位置是否为相对位置. 默认True，横坐标索引值取值范围0 ~ 100</li>
</ul>
<br>
<p>点击下载 <a href="https://textdata.cn/data/%E4%B8%89%E4%BD%93.txt"><strong>三体.txt</strong></a>、<a href="https://textdata.cn/data/%E5%9F%BA%E5%9C%B0.txt"><strong>基地.txt</strong></a>两本小说文件。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="n">roles_dict</span> <span class="o">=</span> <span class="p">{</span>
    <span class="s2">&#34;汪淼&#34;</span><span class="p">:</span> <span class="p">[</span><span class="s1">&#39;汪淼&#39;</span><span class="p">],</span>
    <span class="s2">&#34;叶文洁&#34;</span><span class="p">:</span> <span class="p">[</span><span class="s1">&#39;叶文洁&#39;</span><span class="p">],</span>
    <span class="s2">&#34;罗辑&#34;</span><span class="p">:</span> <span class="p">[</span><span class="s1">&#39;罗辑&#39;</span><span class="p">]</span>
<span class="p">}</span>

<span class="n">santi_text</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s1">&#39;三体.txt&#39;</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>

<span class="n">ax</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">lexical_dispersion_plot1</span><span class="p">(</span><span class="n">text</span> <span class="o">=</span> <span class="n">santi_text</span><span class="p">,</span>  <span class="c1">#文本数据</span>
                            <span class="n">targets_dict</span> <span class="o">=</span> <span class="n">roles_dict</span><span class="p">,</span> <span class="c1">#角色</span>
                            <span class="n">figsize</span> <span class="o">=</span> <span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">4</span><span class="p">),</span>  <span class="c1">#尺寸大小</span>
                            <span class="n">lang</span> <span class="o">=</span> <span class="s1">&#39;chinese&#39;</span><span class="p">,</span>  <span class="c1">#中文数据</span>
                            <span class="n">title</span> <span class="o">=</span> <span class="s1">&#39;《三体》小说角色出现位置&#39;</span><span class="p">,</span> <span class="c1">#标题</span>
                            <span class="n">prop</span> <span class="o">=</span> <span class="kc">True</span><span class="p">)</span>    <span class="c1">#相对位置(横坐标轴取值范围0-100)</span>
<span class="n">ax</span>
</code></pre></div><p><img loading="lazy" src="img/23-lexical_dispersion_plot1-relative.png" alt=""  />
</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">ct</span><span class="o">.</span><span class="n">lexical_dispersion_plot1</span><span class="p">(</span><span class="n">text</span> <span class="o">=</span> <span class="n">santi_text</span><span class="p">,</span>  <span class="c1">#文本数据</span>
                            <span class="n">targets_dict</span> <span class="o">=</span> <span class="n">roles_dict</span><span class="p">,</span> <span class="c1">#角色</span>
                            <span class="n">figsize</span> <span class="o">=</span> <span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">4</span><span class="p">),</span>  <span class="c1">#尺寸大小</span>
                            <span class="n">lang</span> <span class="o">=</span> <span class="s1">&#39;chinese&#39;</span><span class="p">,</span>  <span class="c1">#中文数据</span>
                            <span class="n">title</span> <span class="o">=</span> <span class="s1">&#39;《三体》小说角色出现位置&#39;</span><span class="p">,</span> <span class="c1">#标题</span>
                            <span class="n">prop</span> <span class="o">=</span> <span class="kc">False</span><span class="p">)</span>    <span class="c1">#绝对位置(横坐标轴取值范围与小说文本长度有关)</span>
</code></pre></div><p><img loading="lazy" src="img/24-lexical_dispersion_plot1-absolute.png" alt=""  />
</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="c1">#diy了一个小词典</span>
<span class="n">senti_dict</span> <span class="o">=</span> <span class="p">{</span>
    <span class="s1">&#39;pos&#39;</span><span class="p">:</span> <span class="p">[</span><span class="s1">&#39;开心&#39;</span><span class="p">,</span> <span class="s1">&#39;幸福&#39;</span><span class="p">,</span> <span class="s1">&#39;快乐&#39;</span><span class="p">,</span> <span class="s1">&#39;安宁&#39;</span><span class="p">,</span> <span class="s1">&#39;希望&#39;</span><span class="p">],</span>
    <span class="s1">&#39;neg&#39;</span><span class="p">:</span> <span class="p">[</span><span class="s1">&#39;紧张&#39;</span><span class="p">,</span> <span class="s1">&#39;恐惧&#39;</span><span class="p">,</span> <span class="s1">&#39;害怕&#39;</span><span class="p">,</span> <span class="s1">&#39;绝望&#39;</span><span class="p">]</span>
<span class="p">}</span>

<span class="n">santi_text</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s1">&#39;三体.txt&#39;</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>

<span class="n">ax</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">lexical_dispersion_plot1</span><span class="p">(</span><span class="n">text</span> <span class="o">=</span> <span class="n">santi_text</span><span class="p">,</span> 
                            <span class="n">targets_dict</span> <span class="o">=</span> <span class="n">senti_dict</span><span class="p">,</span> 
                            <span class="n">figsize</span> <span class="o">=</span> <span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">2</span><span class="p">),</span> 
                            <span class="n">lang</span> <span class="o">=</span> <span class="s1">&#39;chinese&#39;</span><span class="p">,</span> 
                            <span class="n">title</span> <span class="o">=</span> <span class="s1">&#39;《三体》情绪词出现位置&#39;</span><span class="p">,</span>
                            <span class="n">prop</span> <span class="o">=</span> <span class="kc">True</span><span class="p">)</span>
<span class="n">ax</span>
</code></pre></div><p><img loading="lazy" src="img/25-santi_sentiment.png" alt=""  />
</p>
<br>
<h3 id="27--lexical_dispersion_plot2">2.7  lexical_dispersion_plot2()</h3>
<p>词汇分散图可视化， 对某几个文本texts_dict， 可视化某些目标词targets在文本中出现相对位置(0~100)</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">ct</span><span class="o">.</span><span class="n">lexical_dispersion_plot2</span><span class="p">(</span><span class="n">texts_dict</span><span class="p">,</span> <span class="n">targets</span><span class="p">,</span> <span class="n">lang</span><span class="o">=</span><span class="s1">&#39;chinese&#39;</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">12</span><span class="p">,</span> <span class="mi">6</span><span class="p">),</span> <span class="n">title</span><span class="o">=</span><span class="s1">&#39;特定词汇在不同文本来源的相对离散图&#39;</span><span class="p">)</span>
</code></pre></div><ul>
<li><em><strong>texts_dict</strong></em>: 多个文本的字典数据。形如{&lsquo;source1&rsquo;: &lsquo;source1的文本内容&rsquo;, &lsquo;source2&rsquo;: &lsquo;source2的文本内容&rsquo;}</li>
<li><em><strong>targets</strong></em>: 目标词列表</li>
<li><em><strong>lang</strong></em>: 文本数据texts_dict的语言类型，默认&rsquo;chinese'.</li>
<li><em><strong>figsize</strong></em>: 图的长宽尺寸. 默认 (8, 5).</li>
<li><em><strong>title</strong></em> : 图的标题；</li>
</ul>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">targets</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;太空&#39;</span><span class="p">,</span> <span class="s1">&#39;宇宙&#39;</span><span class="p">]</span>

<span class="n">texts_dict</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;三体&#39;</span><span class="p">:</span> <span class="nb">open</span><span class="p">(</span><span class="s1">&#39;三体.txt&#39;</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">(),</span>
              <span class="s1">&#39;基地&#39;</span><span class="p">:</span> <span class="nb">open</span><span class="p">(</span><span class="s1">&#39;基地.txt&#39;</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()}</span>

<span class="n">ax</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">lexical_dispersion_plot2</span><span class="p">(</span><span class="n">texts_dict</span> <span class="o">=</span> <span class="n">texts_dict</span><span class="p">,</span>
                            <span class="n">targets</span> <span class="o">=</span> <span class="n">targets</span><span class="p">,</span> 
                            <span class="n">figsize</span> <span class="o">=</span> <span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">2</span><span class="p">),</span> 
                            <span class="n">title</span> <span class="o">=</span> <span class="s1">&#39;&#34;太空/宇宙&#34;词语出现位置&#39;</span><span class="p">,</span>
                            <span class="n">lang</span> <span class="o">=</span> <span class="s1">&#39;chinese&#39;</span><span class="p">)</span>
<span class="n">ax</span>
</code></pre></div><p><img loading="lazy" src="img/26-santi_base.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="cntext使用声明">cntext使用声明</h2>
<p>如在研究或项目中使用 cntext ，请在文中介绍并附引用声明。引用格式可参考 <a href="https://textdata.cn/blog/2025-09-09-transform-text-data-into-structured-data-with-lmstudio-and-cntext/#%E4%BD%BF%E7%94%A8%E5%A3%B0%E6%98%8E">cntext 推荐引用格式</a></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>代码 | 使用地方gov工作报告生成某类概念词词频「面板数据」</title>
      <link>https://textdata.cn/blog/2023-12-17-how-to-generate-panel-data-from-gov-report-dataset/</link>
      <pubDate>Sat, 11 May 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-12-17-how-to-generate-panel-data-from-gov-report-dataset/</guid>
      <description>使用31省市的2002-2024年的省级政府工作报告，绘制出的不同类别关键词的趋势图。直接上最终效果效果图</description>
      <content:encoded><![CDATA[<p>使用31省市的2002-2024年的省级政府工作报告，绘制出的不同类别关键词的趋势图。 直接上效果效果图</p>
<p><img loading="lazy" src="img/12-tri-agri.png" alt=""  />
</p>
<p><img loading="lazy" src="img/13-inovation-plot.png" alt=""  />
</p>
<p><img loading="lazy" src="img/14-enviroment-plot.png" alt=""  />
</p>
<br>
<p>其实绘制三种图的数据是面板型数据，今天主要分享如何利用省级政府工作报告构建某类概念词频(创新、环保、三农)的面板数据，并绘制8省市概念词频折线图。 大家可以根据自己的研究需要更改代码， 生成自己概念的词频面板数据。</p>
<br>
<br>
<h2 id="获取数据">获取数据</h2>
<p><a href="https://textdata.cn/blog/2023-12-17-gov-anual-report-dataset/">数据集(付费) | 国、省、市三级政府工作报告文本</a></p>
<p>数据集100元，  <strong>加微信 372335839， 备注「姓名-学校-专业」</strong>。</p>
<p><br><br></p>
<h2 id="一直接上代码">一、直接上代码</h2>
<h3 id="11-代码文件结构">1.1 代码文件结构</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">项目文件夹
   |---代码.ipynb
   |---GovReportData        #数据集 | 国、省、市三级政府工作报告文本
           |---city.csv     #市政府工作报告（2002-2024）
           |---province.csv #省政府工作报告（2002-2024）
           |---nation.csv   #国务院政府工作报告（2002-2024）
</code></pre></div><br>
<h3 id="12-读取数据">1.2 读取数据</h3>
<p>读取省报告数据文件 <strong>GovReportData/province.csv</strong> ，<a href="https://textdata.cn/blog/2023-12-17-gov-anual-report-dataset/">点击链接，获取政府工作报告数据集</a></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">pdf</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;GovReportData/province.csv&#39;</span><span class="p">)</span>
<span class="n">pdf</span>
</code></pre></div><p><img loading="lazy" src="img/01-df.png" alt=""  />
</p>
<br>
<h3 id="12-生成面板数据函数">1.2 生成面板数据函数</h3>
<p>假设你使用的政府(省、市)工作报告数据是大邓提供的，可以直接使用下面封装的函数，快速生成概念词典，指定省份指定年度区间的面板数据。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">generate_prov_panel_data</span><span class="p">(</span><span class="n">csvf</span><span class="p">,</span> <span class="n">concept_words</span><span class="p">,</span> <span class="n">selected_provs</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span> <span class="n">selected_years</span><span class="o">=</span><span class="kc">None</span><span class="p">):</span>
    <span class="s2">&#34;&#34;&#34;
</span><span class="s2">    csvf: csv的文件路径
</span><span class="s2">    concept_words: 概念词词语列表
</span><span class="s2">    selected_provs: 筛选指定省份的数据进行计算，列表
</span><span class="s2">    selected_years: 筛选指定年度的数据进行计算，数字列表
</span><span class="s2">    
</span><span class="s2">    结果返回dataframe， 每一行代表一个省，每一列代表一年。
</span><span class="s2">    &#34;&#34;&#34;</span>
    
    <span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
    <span class="kn">import</span> <span class="nn">jieba</span>
    
    <span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="n">csvf</span><span class="p">)</span>
    <span class="n">df</span><span class="p">[</span><span class="s1">&#39;doc&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;doc&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">lower</span><span class="p">()</span>
    <span class="n">df</span><span class="p">[</span><span class="s1">&#39;year&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;year&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="nb">str</span><span class="p">)</span>


    <span class="n">table_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">pivot_table</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> 
                       <span class="n">columns</span><span class="o">=</span><span class="s1">&#39;year&#39;</span><span class="p">,</span>  <span class="c1">#列-年份</span>
                       <span class="n">index</span><span class="o">=</span><span class="s1">&#39;province&#39;</span><span class="p">,</span>    <span class="c1">#行-省份</span>
                       <span class="n">values</span><span class="o">=</span><span class="s1">&#39;doc&#39;</span><span class="p">,</span>   <span class="c1">#单元格-文本</span>
                       <span class="n">aggfunc</span><span class="o">=</span><span class="k">lambda</span> <span class="n">cs</span><span class="p">:</span> <span class="s1">&#39;&#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="nb">str</span><span class="p">(</span><span class="n">c</span><span class="p">)</span> <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">cs</span><span class="p">))</span> <span class="c1">#让单元格填充文本</span>

    <span class="k">if</span> <span class="n">selected_provs</span><span class="p">:</span>
        <span class="n">table_df</span> <span class="o">=</span> <span class="n">table_df</span><span class="p">[</span><span class="n">table_df</span><span class="o">.</span><span class="n">index</span><span class="o">.</span><span class="n">isin</span><span class="p">(</span><span class="n">selected_provs</span><span class="p">)]</span>
    
    <span class="k">if</span> <span class="n">selected_years</span><span class="p">:</span>
        <span class="n">selected_years</span> <span class="o">=</span> <span class="p">[</span><span class="nb">str</span><span class="p">(</span><span class="n">y</span><span class="p">)</span> <span class="k">for</span> <span class="n">y</span> <span class="ow">in</span> <span class="n">selected_years</span><span class="p">]</span>
        <span class="n">table_df</span> <span class="o">=</span> <span class="n">table_df</span><span class="p">[</span><span class="n">selected_years</span><span class="p">]</span>


    <span class="n">word_count_df</span> <span class="o">=</span> <span class="n">table_df</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">row</span><span class="p">:</span> <span class="n">row</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">t</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="n">jieba</span><span class="o">.</span><span class="n">lcut</span><span class="p">(</span><span class="nb">str</span><span class="p">(</span><span class="n">t</span><span class="p">)))))</span>
    <span class="n">concept_word_count_df</span> <span class="o">=</span> <span class="n">table_df</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">row</span><span class="p">:</span> <span class="n">row</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">count</span><span class="p">(</span><span class="s1">&#39;|&#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">concept_words</span><span class="p">)))</span>
    <span class="n">concept_word_ratio_df</span> <span class="o">=</span> <span class="n">concept_word_count_df</span><span class="o">/</span><span class="n">word_count_df</span>
    <span class="k">return</span> <span class="n">concept_word_ratio_df</span>


<span class="n">concept_words</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;农村&#39;</span><span class="p">,</span> <span class="s1">&#39;农业&#39;</span><span class="p">,</span> <span class="s1">&#39;农民&#39;</span><span class="p">]</span>
<span class="c1">#所有省份，所有年度(2002-2024)</span>
<span class="n">panel_data_df</span> <span class="o">=</span> <span class="n">generate_prov_panel_data</span><span class="p">(</span><span class="n">csvf</span><span class="o">=</span><span class="s1">&#39;GovReportData/province.csv&#39;</span><span class="p">,</span> 
                                         <span class="n">concept_words</span> <span class="o">=</span> <span class="n">concept_words</span><span class="p">)</span>

<span class="nb">print</span><span class="p">(</span><span class="n">panel_data_df</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>

<span class="c1">#如果需要保存</span>
<span class="c1">#panel_data_df.to_csv(&#39;省-三农-面板2001-2024.csv&#39;)</span>
<span class="c1">#panel_data_df.to_excel(&#39;省-三农-面板2001-2024.xlsx&#39;)</span>
<span class="n">panel_data_df</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">(31, 24)
</code></pre></div><p><img loading="lazy" src="img/02-panel-data.png" alt=""  />
</p>
<br>
<p>生成 山东省河北省2010-2024期间政府工作报告提及三农词词频占比的面板数据</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">concept_words</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;农村&#39;</span><span class="p">,</span> <span class="s1">&#39;农业&#39;</span><span class="p">,</span> <span class="s1">&#39;农民&#39;</span><span class="p">]</span>
<span class="n">selected_provs</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;山东省&#39;</span><span class="p">,</span> <span class="s1">&#39;河北省&#39;</span><span class="p">]</span>
<span class="n">selected_years</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">2010</span><span class="p">,</span> <span class="mi">2025</span><span class="p">))</span>
<span class="n">panel_data_df</span> <span class="o">=</span> <span class="n">generate_prov_panel_data</span><span class="p">(</span><span class="n">csvf</span><span class="o">=</span><span class="s1">&#39;GovReportData/province.csv&#39;</span><span class="p">,</span> 
                                         <span class="n">concept_words</span> <span class="o">=</span> <span class="n">concept_words</span><span class="p">,</span> 
                                         <span class="n">selected_provs</span> <span class="o">=</span> <span class="n">selected_provs</span><span class="p">,</span>
                                         <span class="n">selected_years</span> <span class="o">=</span> <span class="n">selected_years</span><span class="p">)</span>


<span class="c1">#如果需要保存</span>
<span class="c1">#panel_data_df.to_csv(&#39;山东河北-三农-面板2010-2024.csv&#39;)</span>
<span class="c1">#panel_data_df.to_excel(&#39;山东河北-三农-面板2010-2024.xlsx&#39;)</span>

<span class="n">panel_data_df</span>
</code></pre></div><p><img loading="lazy" src="img/03-hebei-shandong-panel-data.png" alt=""  />
</p>
<br>
<h3 id="13-绘制折线图">1.3 绘制折线图</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">plot_line</span><span class="p">(</span><span class="n">panel_df</span><span class="p">,</span> <span class="n">title</span><span class="p">):</span>
    <span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
    <span class="kn">import</span> <span class="nn">matplotlib</span>
    <span class="kn">import</span> <span class="nn">scienceplots</span>
    <span class="kn">import</span> <span class="nn">platform</span>
    <span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
    <span class="kn">import</span> <span class="nn">matplotlib_inline</span>
    <span class="n">matplotlib_inline</span><span class="o">.</span><span class="n">backend_inline</span><span class="o">.</span><span class="n">set_matplotlib_formats</span><span class="p">(</span><span class="s1">&#39;png&#39;</span><span class="p">,</span> <span class="s1">&#39;svg&#39;</span><span class="p">)</span>
    <span class="kn">import</span> <span class="nn">jieba</span>
    <span class="kn">import</span> <span class="nn">warnings</span>
    <span class="n">warnings</span><span class="o">.</span><span class="n">filterwarnings</span><span class="p">(</span><span class="s1">&#39;ignore&#39;</span><span class="p">)</span>

    <span class="n">plt</span><span class="o">.</span><span class="n">style</span><span class="o">.</span><span class="n">use</span><span class="p">([</span><span class="s1">&#39;science&#39;</span><span class="p">,</span> <span class="s1">&#39;no-latex&#39;</span><span class="p">,</span> <span class="s1">&#39;cjk-sc-font&#39;</span><span class="p">])</span>
    <span class="n">system</span> <span class="o">=</span> <span class="n">platform</span><span class="o">.</span><span class="n">system</span><span class="p">()</span>  <span class="c1"># 获取操作系统类型</span>

    <span class="k">if</span> <span class="n">system</span> <span class="o">==</span> <span class="s1">&#39;Windows&#39;</span><span class="p">:</span>
        <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;SimHei&#39;</span><span class="p">}</span>
    <span class="k">elif</span> <span class="n">system</span> <span class="o">==</span> <span class="s1">&#39;Darwin&#39;</span><span class="p">:</span>
        <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;Arial Unicode MS&#39;</span><span class="p">}</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;sans-serif&#39;</span><span class="p">}</span>
    <span class="n">matplotlib</span><span class="o">.</span><span class="n">rc</span><span class="p">(</span><span class="s1">&#39;font&#39;</span><span class="p">,</span> <span class="o">**</span><span class="n">font</span><span class="p">)</span>  <span class="c1"># 设置全局字体</span>
    
    
    
    <span class="n">panel_df_T</span> <span class="o">=</span> <span class="n">panel_df</span><span class="o">.</span><span class="n">T</span>

    <span class="n">ax</span> <span class="o">=</span> <span class="n">panel_df_T</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span>
    <span class="c1"># 添加图例，并指定位置和偏移</span>
    <span class="n">ax</span><span class="o">.</span><span class="n">legend</span><span class="p">(</span><span class="n">loc</span><span class="o">=</span><span class="s1">&#39;upper right&#39;</span><span class="p">,</span> <span class="n">bbox_to_anchor</span><span class="o">=</span><span class="p">(</span><span class="mf">1.15</span><span class="p">,</span> <span class="mf">1.05</span><span class="p">))</span>


    <span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="n">title</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">15</span><span class="p">)</span>
    <span class="n">plt</span><span class="o">.</span><span class="n">xticks</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="mi">12</span><span class="p">)</span>
    <span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s1">&#39;年份&#39;</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">13</span><span class="p">)</span>
    <span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s1">&#39;词频&#39;</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">13</span><span class="p">)</span>

    <span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div><br>
<p>现在我们试试</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">concept_words</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;农村&#39;</span><span class="p">,</span> <span class="s1">&#39;农业&#39;</span><span class="p">,</span> <span class="s1">&#39;农民&#39;</span><span class="p">]</span>
<span class="n">selected_provs</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;山东省&#39;</span><span class="p">,</span> <span class="s1">&#39;河北省&#39;</span><span class="p">]</span>
<span class="n">selected_years</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">2010</span><span class="p">,</span> <span class="mi">2025</span><span class="p">))</span> <span class="c1">#2010年-2024年</span>

<span class="c1">#生成面板数据</span>
<span class="n">panel_data_df</span> <span class="o">=</span> <span class="n">generate_prov_panel_data</span><span class="p">(</span><span class="n">csvf</span><span class="o">=</span><span class="s1">&#39;GovReportData/province.csv&#39;</span><span class="p">,</span>
                                         <span class="n">concept_words</span> <span class="o">=</span> <span class="n">concept_words</span><span class="p">,</span> 
                                         <span class="n">selected_provs</span> <span class="o">=</span> <span class="n">selected_provs</span><span class="p">,</span>
                                         <span class="n">selected_years</span> <span class="o">=</span> <span class="n">selected_years</span><span class="p">)</span>

<span class="c1">#绘图</span>
<span class="n">plot_line</span><span class="p">(</span><span class="n">panel_df</span><span class="o">=</span><span class="n">panel_data_df</span><span class="p">,</span> 
          <span class="n">title</span><span class="o">=</span><span class="s1">&#39;山东、河北三农词折线图(2010-2024)&#39;</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/04-hebei-shandong-plot.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="二函数代码拆解">二、函数代码拆解</h2>
<h3 id="21-读取数据">2.1 读取数据</h3>
<p>36.6M的数据，含file和text两个字段， <a href="2023-12-17-gov-anual-report-dataset/"><strong>点击获取政府公告文件</strong></a></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-Python" data-lang="Python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">pdf</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;GovReportData/province.csv&#39;</span><span class="p">)</span>
<span class="n">pdf</span>
</code></pre></div><p><img loading="lazy" src="img/01-df.png" alt=""  />
</p>
<br>
<h3 id="22-构建透视表">2.2 构建透视表</h3>
<p>构建透视表，行索引名为省 prov，列名为时间year， 单元格内填充工作报告文本。</p>
<p>代码不用太深究，只要知道代码操作前后数据形态的变化即可。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">table_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">pivot_table</span><span class="p">(</span><span class="n">pdf</span><span class="p">,</span> 
                       <span class="n">columns</span><span class="o">=</span><span class="s1">&#39;year&#39;</span><span class="p">,</span>  <span class="c1">#列-年份</span>
                       <span class="n">index</span><span class="o">=</span><span class="s1">&#39;province&#39;</span><span class="p">,</span>    <span class="c1">#行-省份</span>
                       <span class="n">values</span><span class="o">=</span><span class="s1">&#39;doc&#39;</span><span class="p">,</span>   <span class="c1">#单元格-文本</span>
                       <span class="n">aggfunc</span><span class="o">=</span><span class="k">lambda</span> <span class="n">cs</span><span class="p">:</span> <span class="s1">&#39;&#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="nb">str</span><span class="p">(</span><span class="n">c</span><span class="p">)</span> <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">cs</span><span class="p">))</span> <span class="c1">#让单元格填充文本</span>

<span class="nb">print</span><span class="p">(</span><span class="n">table_df</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
<span class="n">table_df</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">(31, 24)
</code></pre></div><p><img loading="lazy" src="img/06-df.png" alt=""  />
</p>
<p>table_df是一个31行， 24列的矩阵。 每行代表一个省，每一列代表一个年份。</p>
<br>
<h3 id="23-统计总词数">2.3 统计总词数</h3>
<p>统计所有报告的词语数。代码高度抽象， 咱们只看结果。 从 table_df 变为 word_count_df</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">jieba</span>
<span class="n">word_count_df</span> <span class="o">=</span> <span class="n">table_df</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">row</span><span class="p">:</span> <span class="n">row</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">t</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="n">jieba</span><span class="o">.</span><span class="n">lcut</span><span class="p">(</span><span class="nb">str</span><span class="p">(</span><span class="n">t</span><span class="p">)))))</span>
<span class="n">word_count_df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">(31, 24)
</code></pre></div><p><img loading="lazy" src="img/07-word_count_df.png" alt=""  />
</p>
<br>
<h3 id="24-统计概念词频占比">2.4 统计概念词频(占比)</h3>
<p>统计所有报告中，某概念词词频，以三农为例</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">concept_words</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;农村&#39;</span><span class="p">,</span> <span class="s1">&#39;农业&#39;</span><span class="p">,</span> <span class="s1">&#39;农民&#39;</span><span class="p">]</span>

<span class="n">concept_word_count_df</span> <span class="o">=</span> <span class="n">table_df</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">row</span><span class="p">:</span> <span class="n">row</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">count</span><span class="p">(</span><span class="s1">&#39;|&#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">concept_words</span><span class="p">)))</span>
<span class="nb">print</span><span class="p">(</span><span class="n">concept_word_count_df</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>

<span class="c1">#为方便，只展示前5行</span>
<span class="n">concept_word_count_df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">(31, 24)
</code></pre></div><p><img loading="lazy" src="img/08-tri-df.png" alt=""  />
</p>
<br>
<p>将数据转化为词频占比，即 <strong>报告「三农词」出现次数/报告总词数</strong></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">concept_word_ratio_df</span> <span class="o">=</span> <span class="n">concept_word_count_df</span><span class="o">/</span><span class="n">word_count_df</span>
<span class="nb">print</span><span class="p">(</span><span class="n">concept_word_ratio_df</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
<span class="n">concept_word_ratio_df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">(31, 24)
</code></pre></div><p><img loading="lazy" src="img/09-concept_word_ratio_df.png" alt=""  />
</p>
<br>
<p>到目前为止， 已经将一坨文本，转化为结构化的面板数据， 其实现在就可以保存起来啦。</p>
<br>
<h3 id="25-保存结果">2.5 保存结果</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">concept_word_ratio_df</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s1">&#39;concept_word_ratio.csv&#39;</span><span class="p">)</span>
</code></pre></div><p><br><br></p>
<h2 id="三可视化">三、可视化</h2>
<h3 id="31-稍作解释">3.1 稍作解释</h3>
<p>可视化 plot_line 函数内部没有进行过多的数据变换， 仅仅只是进行了转置 和 日期格式变化。本小节只稍作解释，马上进入后续的三个可视化案例。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">concept_word_ratio_df_T = concept_word_ratio_df.T
concept_word_ratio_df_T
</code></pre></div><p><img loading="lazy" src="img/10-T.png" alt=""  />
</p>
<br>
<h3 id="32-三农折线图">3.2 「三农」折线图</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">selected_provs</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;河北省&#39;</span><span class="p">,</span> <span class="s1">&#39;山东省&#39;</span><span class="p">,</span> <span class="s1">&#39;北京市&#39;</span><span class="p">,</span> <span class="s1">&#39;上海市&#39;</span><span class="p">,</span> <span class="s1">&#39;广东省&#39;</span><span class="p">,</span> <span class="s1">&#39;浙江省&#39;</span><span class="p">,</span> <span class="s1">&#39;黑龙江省&#39;</span><span class="p">,</span> <span class="s1">&#39;湖南省&#39;</span><span class="p">]</span>
<span class="n">concept_words</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;农村&#39;</span><span class="p">,</span> <span class="s1">&#39;农业&#39;</span><span class="p">,</span> <span class="s1">&#39;农民&#39;</span><span class="p">]</span>
<span class="n">tri_agri_panel_df</span> <span class="o">=</span> <span class="n">generate_prov_panel_data</span><span class="p">(</span><span class="n">csvf</span><span class="o">=</span><span class="s1">&#39;GovReportData/province.csv&#39;</span><span class="p">,</span> 
                                             <span class="n">concept_words</span> <span class="o">=</span><span class="n">concept_words</span><span class="p">,</span> 
                                             <span class="n">selected_provs</span> <span class="o">=</span> <span class="n">selected_provs</span><span class="p">)</span>


<span class="n">plot_line</span><span class="p">(</span><span class="n">panel_df</span><span class="o">=</span><span class="n">tri_agri_panel_df</span><span class="p">,</span> 
          <span class="n">title</span><span class="o">=</span><span class="s1">&#39;8省市2002-2024年「三农」词频趋势&#39;</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/12-tri-agri.png" alt=""  />
</p>
<p>从上图中，可以看出</p>
<ul>
<li>05年提及三农词占比最多的是湖南，是20年以来8省市中占比值最高记录</li>
<li>大多数省份在07年达到峰值</li>
<li>07年前，工作报告中提及三农词提及三农词的占比趋势是<strong>上升的</strong></li>
<li>07年后，工作报告中提及三农词提及三农词的占比趋势是<strong>下升的</strong>。</li>
</ul>
<br>
<h3 id="33-创新折线图">3.3 「创新」折线图</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">selected_provs</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;河北省&#39;</span><span class="p">,</span> <span class="s1">&#39;山东省&#39;</span><span class="p">,</span> <span class="s1">&#39;北京市&#39;</span><span class="p">,</span> <span class="s1">&#39;上海市&#39;</span><span class="p">,</span> <span class="s1">&#39;广东省&#39;</span><span class="p">,</span> <span class="s1">&#39;浙江省&#39;</span><span class="p">,</span> <span class="s1">&#39;黑龙江省&#39;</span><span class="p">,</span> <span class="s1">&#39;湖南省&#39;</span><span class="p">]</span>
<span class="n">concept_words</span> <span class="o">=</span>  <span class="p">[</span><span class="s1">&#39;科学&#39;</span><span class="p">,</span> <span class="s1">&#39;技术&#39;</span><span class="p">,</span> <span class="s1">&#39;创新&#39;</span><span class="p">,</span> <span class="s1">&#39;研发&#39;</span><span class="p">,</span> <span class="s1">&#39;科技&#39;</span><span class="p">]</span>

<span class="n">inovation_panel_df</span> <span class="o">=</span> <span class="n">generate_prov_panel_data</span><span class="p">(</span><span class="n">csvf</span><span class="o">=</span><span class="s1">&#39;GovReportData/province.csv&#39;</span><span class="p">,</span> 
                                              <span class="n">concept_words</span> <span class="o">=</span><span class="n">concept_words</span><span class="p">,</span> 
                                              <span class="n">selected_provs</span> <span class="o">=</span> <span class="n">selected_provs</span><span class="p">)</span>


<span class="n">plot_line</span><span class="p">(</span><span class="n">panel_df</span><span class="o">=</span><span class="n">inovation_panel_df</span><span class="p">,</span> 
          <span class="n">title</span><span class="o">=</span><span class="s1">&#39;8省市2002-2024年「创新」词频趋势&#39;</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/13-inovation-plot.png" alt=""  />
</p>
<p>从上图中，可以看出</p>
<ul>
<li>
<p>整体看，2000年以来八省市工作报告中提及科创相关词的比例是稳定的。</p>
</li>
<li>
<p>2010年之后， <strong>黑龙江</strong>是八省市中提起科创概念词最少的省份。</p>
</li>
<li>
<p>河北省2020年支棱起来了，是提及科创概念词最高的，而且是八省市所有年份最高！</p>
</li>
</ul>
<br>
<h3 id="34-环保折线图">3.4 「环保」折线图</h3>
<p>参考 <code>陈诗一,陈登科.雾霾污染、政府治理与经济高质量发展[J].经济研究,2018,53(02):20-34.</code></p>
<p>本文选取省级政府工作报告中与环境相关词汇出现频数及其比重来度量 <strong>政府环境治理政策</strong> （Chen et al．，2016）。 该指标不仅全面地度量了地方政府环境治理的力度 ， 而且由于地方政府工作报告一般发生在年初 ， 该年度的经济发展无法反向影响事先已经确定的政府工作报告 ， 从而可以减缓采用已有度量指标所产生的的内生性问题 。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">selected_provs</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;河北省&#39;</span><span class="p">,</span> <span class="s1">&#39;山东省&#39;</span><span class="p">,</span> <span class="s1">&#39;北京市&#39;</span><span class="p">,</span> <span class="s1">&#39;上海市&#39;</span><span class="p">,</span> <span class="s1">&#39;广东省&#39;</span><span class="p">,</span> <span class="s1">&#39;浙江省&#39;</span><span class="p">,</span> <span class="s1">&#39;黑龙江省&#39;</span><span class="p">,</span> <span class="s1">&#39;湖南省&#39;</span><span class="p">]</span>

<span class="c1">#词语来自 {陈诗一,陈登科.雾霾污染、政府治理与经济高质量发展[J].经济研究,2018,53(02):20-34.}</span>
<span class="n">concept_words</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;环境保护&#39;</span><span class="p">,</span> <span class="s1">&#39;环保&#39;</span><span class="p">,</span> <span class="s1">&#39;污染&#39;</span><span class="p">,</span> <span class="s1">&#39;能耗&#39;</span><span class="p">,</span> <span class="s1">&#39;减排&#39;</span><span class="p">,</span> <span class="s1">&#39;排污&#39;</span><span class="p">,</span> 
                 <span class="s1">&#39;生态&#39;</span><span class="p">,</span> <span class="s1">&#39;绿色&#39;</span><span class="p">,</span> <span class="s1">&#39;低碳&#39;</span><span class="p">,</span> <span class="s1">&#39;空气&#39;</span><span class="p">,</span> <span class="s1">&#39;化学需氧量&#39;</span><span class="p">,</span> 
                 <span class="s1">&#39;二氧化硫&#39;</span><span class="p">,</span> <span class="s1">&#39;二氧化碳&#39;</span><span class="p">,</span> <span class="s1">&#39;pm10&#39;</span><span class="p">,</span> <span class="s1">&#39;pm2.5&#39;</span><span class="p">]</span>


<span class="n">environment_panel_df</span> <span class="o">=</span> <span class="n">generate_prov_panel_data</span><span class="p">(</span><span class="n">csvf</span><span class="o">=</span><span class="s1">&#39;GovReportData/province.csv&#39;</span><span class="p">,</span> 
                                                <span class="n">concept_words</span> <span class="o">=</span><span class="n">concept_words</span><span class="p">,</span> 
                                                <span class="n">selected_provs</span> <span class="o">=</span> <span class="n">selected_provs</span><span class="p">)</span>


<span class="n">plot_line</span><span class="p">(</span><span class="n">panel_df</span> <span class="o">=</span> <span class="n">environment_panel_df</span><span class="p">,</span> 
          <span class="n">title</span><span class="o">=</span><span class="s1">&#39;8省市2002-2024年「环保」词频趋势&#39;</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/14-enviroment-plot.png" alt=""  />
</p>
<br>
<br>
<h2 id="四相关内容">四、相关内容</h2>
<h3 id="41-相关代码">4.1 相关代码</h3>
<ul>
<li><a href="https://textdata.cn/blog/2023-12-27-measure-gov-digitalization/">代码 | 使用gov工作报告生成数字化词频「面板数据」</a></li>
</ul>
<blockquote>
<p>之前看到一篇论文研究人民网留言板问答中的政府回复行为， 控制变量使用的是政府数字化程度。</p>
<p>论文使用政府工作报告数字化词语提及次数， 用来测量政府的数字化程度。</p>
<p><strong>但从今天的实验看，用数字化词频测量政府数字化程度，不怎么准，  要慎重使用</strong>。</p>
</blockquote>
<ul>
<li><a href="https://textdata.cn/blog/2023-12-18-how-to-generate-panel-data-from-daily-news-dataset/">代码 | 使用「新闻数据」构造概念词提及量「面板数据」</a></li>
<li><a href="https://textdata.cn/blog/2023-02-26-cctv1-xwlb-news-text-dataset/">数据(付费) | 使用cctv新闻联播文稿构造面板数据</a></li>
</ul>
<br>
<h3 id="43-相关文献">4.3 相关文献</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[1]陈诗一,陈登科.雾霾污染、政府治理与经济高质量发展[J].经济研究,2018,53(02):20-34.
</code></pre></div><br>
<br>
<h2 id="五获取数据集">五、获取数据集</h2>
<p><a href="https://textdata.cn/blog/2023-12-17-gov-anual-report-dataset/">数据集| 国、省、市三级政府工作报告文本</a></p>
<br>
<br>
]]></content:encoded>
    </item>
    
    <item>
      <title>数据集 | 国、省、市三级政府工作报告文本(1954-2024)</title>
      <link>https://textdata.cn/blog/2023-12-17-gov-anual-report-dataset/</link>
      <pubDate>Sat, 11 May 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-12-17-gov-anual-report-dataset/</guid>
      <description>&lt;h2 id=&#34;相关代码&#34;&gt;相关代码&lt;/h2&gt;
&lt;p&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-12-17-how-to-generate-panel-data-from-gov-report-dataset/&#34;&gt;代码 | 使用地方gov工作报告生成某类概念词词频面板数据&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;一数据集&#34;&gt;一、数据集&lt;/h2&gt;
&lt;h3 id=&#34;11-数据简介&#34;&gt;1.1 数据简介&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;国级(guo wu yuan)工作报告1954-2024, 记录数71

省级zf工作报告2002-2024, 记录数744

市级zf工作报告2003-2024, 记录数6204
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;12-说明&#34;&gt;1.2 说明&lt;/h3&gt;
&lt;p&gt;本文内容仅为科研分享， 不代表本人的政治立场。如有问题， 加微信 372335839，  备注「姓名-学校-专业-政府工作报告」。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;13-文件树目录&#34;&gt;1.3 文件树目录&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;|- 代码.ipynb

|- GovReportData
    |-nation
        |-1954.txt
        |-1955.txt
        |-...
        |-2023.txt
        |-2024.txt
        
    |-prov
        |-安徽省2001.txt
        |-...
        |-安徽省2024.txt
        |-...
        |-浙江省2024.txt
        
    |-city
        |-安康市2003.txt
        |-...
        |-安庆市2003.txt
        |-...
        |-安庆市2024.txt
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;br&gt;
&lt;h2 id=&#34;二查看数据&#34;&gt;二、查看数据&lt;/h2&gt;
&lt;h3 id=&#34;21-国级报告&#34;&gt;2.1 国级报告&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;ndf&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;GovReportData/nation.csv&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;ndf&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/01-nation-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;22-省级报告&#34;&gt;2.2 省级报告&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;pdf&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;GovReportData/province.csv&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;pdf&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/02-prov-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;23-市级报告&#34;&gt;2.3 市级报告&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;cdf&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;GovReportData/city.csv&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;cdf&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/03-city-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;三-实验-文本分析&#34;&gt;三、 实验-文本分析&lt;/h2&gt;
&lt;h3 id=&#34;31-国-词频&#34;&gt;3.1 国-词频&lt;/h3&gt;
&lt;p&gt;计算总词语数、某类词出现的次数，计算各政府提及【环保】的频率&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;jieba&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;ndf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;word_num&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ndf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;doc&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fillna&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;apply&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;k&#34;&gt;lambda&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;jieba&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;lcut&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)))&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;ndf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;env_num&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ndf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;doc&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fillna&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;count&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;环保|环境|污染|青山|绿水&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;ndf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;env_ratio&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ndf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;env_num&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;/&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ndf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;word_num&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;ndf&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/04-nation-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;32-可视化&#34;&gt;3.2 可视化&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;plt&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;matplotlib&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;matplotlib_inline&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;matplotlib_inline&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;backend_inline&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;set_matplotlib_formats&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;png&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;svg&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;scienceplots&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;platform&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;numpy&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;np&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;jieba&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;style&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;use&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;([&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;science&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;no-latex&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;cjk-sc-font&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;system&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;platform&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;system&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;  &lt;span class=&#34;c1&#34;&gt;# 获取操作系统类型&lt;/span&gt;

&lt;span class=&#34;k&#34;&gt;if&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;system&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;==&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;Windows&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;font&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;family&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;SimHei&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;elif&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;system&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;==&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;Darwin&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;font&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;family&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;Arial Unicode MS&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;else&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;font&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;family&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;sans-serif&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;matplotlib&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;rc&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;font&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;**&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;font&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;  &lt;span class=&#34;c1&#34;&gt;# 设置全局字体&lt;/span&gt;



&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;figure&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;figsize&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;12&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;6&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;ndf&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;sort_values&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;year&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;inplace&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;scatter&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ndf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;year&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ndf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;env_ratio&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;plot&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ndf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;year&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ndf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;env_ratio&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;title&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;国级报告中“环保概念词”提及频率折线图(1954-2024)&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;show&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/gov-plot.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;大家应该都学过正泰分布中， 数据中大多数的记录会落在 均值+-标准差 范围内，&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/mean&amp;#43;std.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;这里设置 &lt;strong&gt;top_nation_mask&lt;/strong&gt;、&lt;strong&gt;bottom_nation_mask&lt;/strong&gt; ，分别识别到最重视环保的年份、最不重视环保的年份&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;top_nation_mask&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ndf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;env_ratio&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;mean&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ndf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;env_ratio&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;std&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;bottom_nation_mask&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ndf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;env_ratio&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;mean&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;-&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ndf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;env_ratio&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;std&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;

&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;最重视环保的年份&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ndf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ndf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;env_ratio&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;top_nation_mask&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;year&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;values&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;

&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;最忽视环保的年份&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ndf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ndf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;env_ratio&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;bottom_nation_mask&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;][&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;year&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;values&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;最重视环保的年份
[2001 2003 2005 2006 2007 2015 2016 2017 2019 2021 2023]

最忽视环保的年份
[1954 1955 1956 1957 1958 1959 1960 1964 1975 1978 1979 1980 1981 1983
 1985 1987]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;p&gt;可以看到进入21世纪，国家对环保重视从报告中就能看出。而在前期，因为生存是首要解决的，对环境保护的认识事不足的。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;32-省-词频&#34;&gt;3.2 省-词频&lt;/h3&gt;
&lt;p&gt;计算总词语数、某类词出现的次数，计算各省提及【环保】的频率。因为省份的记录有770条，现在咱们把条件变严格，&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt; top = mean+3*std, 
 bottom = mean-2std
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;大家可以自己设置条件的严格程度&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;pdf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;word_num&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pdf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;doc&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fillna&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;apply&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;k&#34;&gt;lambda&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;jieba&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;lcut&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)))&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;pdf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;env_num&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pdf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;doc&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fillna&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;count&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;环保|环境|污染|青山|绿水&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;pdf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;env_ratio&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pdf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;env_num&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;/&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;pdf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;word_num&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;top_prov_mask&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pdf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;env_ratio&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;mean&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;3&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;*&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;pdf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;env_ratio&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;std&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;bottom_prov_mask&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pdf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;env_ratio&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;mean&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;-&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;*&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;pdf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;env_ratio&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;std&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;

&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;最重视环保的省(年份)&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;pdf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;pdf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;env_ratio&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;top_prov_mask&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;][[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;province&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;year&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/07-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;重视环保结果挺合理的， 某人曾在浙江任职过，对环保比较重视，近年来浙江也比较重视环保，是真的很早就执行，环保搞得很好。而河北，笔者家乡，主要是跟钢铁产业关停并转，守卫di都蓝天有很大关系。&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;p&gt;更多内容可在大邓博客 textdata.cn 中寻找相关代码。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
</description>
      <content:encoded><![CDATA[<h2 id="相关代码">相关代码</h2>
<p><a href="https://textdata.cn/blog/2023-12-17-how-to-generate-panel-data-from-gov-report-dataset/">代码 | 使用地方gov工作报告生成某类概念词词频面板数据</a></p>
<p><br><br></p>
<h2 id="一数据集">一、数据集</h2>
<h3 id="11-数据简介">1.1 数据简介</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">国级(guo wu yuan)工作报告1954-2024, 记录数71

省级zf工作报告2002-2024, 记录数744

市级zf工作报告2003-2024, 记录数6204
</code></pre></div><br>
<h3 id="12-说明">1.2 说明</h3>
<p>本文内容仅为科研分享， 不代表本人的政治立场。如有问题， 加微信 372335839，  备注「姓名-学校-专业-政府工作报告」。</p>
<br>
<h3 id="13-文件树目录">1.3 文件树目录</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">|- 代码.ipynb

|- GovReportData
    |-nation
        |-1954.txt
        |-1955.txt
        |-...
        |-2023.txt
        |-2024.txt
        
    |-prov
        |-安徽省2001.txt
        |-...
        |-安徽省2024.txt
        |-...
        |-浙江省2024.txt
        
    |-city
        |-安康市2003.txt
        |-...
        |-安庆市2003.txt
        |-...
        |-安庆市2024.txt
</code></pre></div><br>
<br>
<h2 id="二查看数据">二、查看数据</h2>
<h3 id="21-国级报告">2.1 国级报告</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">ndf</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;GovReportData/nation.csv&#39;</span><span class="p">)</span>
<span class="n">ndf</span>
</code></pre></div><p><img loading="lazy" src="img/01-nation-df.png" alt=""  />
</p>
<br>
<h3 id="22-省级报告">2.2 省级报告</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">pdf</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;GovReportData/province.csv&#39;</span><span class="p">)</span>
<span class="n">pdf</span>
</code></pre></div><p><img loading="lazy" src="img/02-prov-df.png" alt=""  />
</p>
<br>
<h3 id="23-市级报告">2.3 市级报告</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">cdf</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;GovReportData/city.csv&#39;</span><span class="p">)</span>
<span class="n">cdf</span>
</code></pre></div><p><img loading="lazy" src="img/03-city-df.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="三-实验-文本分析">三、 实验-文本分析</h2>
<h3 id="31-国-词频">3.1 国-词频</h3>
<p>计算总词语数、某类词出现的次数，计算各政府提及【环保】的频率</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">jieba</span>

<span class="n">ndf</span><span class="p">[</span><span class="s1">&#39;word_num&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">ndf</span><span class="p">[</span><span class="s1">&#39;doc&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">text</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="n">jieba</span><span class="o">.</span><span class="n">lcut</span><span class="p">(</span><span class="n">text</span><span class="p">)))</span>
<span class="n">ndf</span><span class="p">[</span><span class="s1">&#39;env_num&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">ndf</span><span class="p">[</span><span class="s1">&#39;doc&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">count</span><span class="p">(</span><span class="s1">&#39;环保|环境|污染|青山|绿水&#39;</span><span class="p">)</span>
<span class="n">ndf</span><span class="p">[</span><span class="s1">&#39;env_ratio&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">ndf</span><span class="p">[</span><span class="s1">&#39;env_num&#39;</span><span class="p">]</span><span class="o">/</span><span class="n">ndf</span><span class="p">[</span><span class="s1">&#39;word_num&#39;</span><span class="p">]</span>
<span class="n">ndf</span>
</code></pre></div><p><img loading="lazy" src="img/04-nation-df.png" alt=""  />
</p>
<br>
<h3 id="32-可视化">3.2 可视化</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">import</span> <span class="nn">matplotlib</span>
<span class="kn">import</span> <span class="nn">matplotlib_inline</span>
<span class="n">matplotlib_inline</span><span class="o">.</span><span class="n">backend_inline</span><span class="o">.</span><span class="n">set_matplotlib_formats</span><span class="p">(</span><span class="s1">&#39;png&#39;</span><span class="p">,</span> <span class="s1">&#39;svg&#39;</span><span class="p">)</span>
<span class="kn">import</span> <span class="nn">scienceplots</span>
<span class="kn">import</span> <span class="nn">platform</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="kn">import</span> <span class="nn">jieba</span>

<span class="n">plt</span><span class="o">.</span><span class="n">style</span><span class="o">.</span><span class="n">use</span><span class="p">([</span><span class="s1">&#39;science&#39;</span><span class="p">,</span> <span class="s1">&#39;no-latex&#39;</span><span class="p">,</span> <span class="s1">&#39;cjk-sc-font&#39;</span><span class="p">])</span>
<span class="n">system</span> <span class="o">=</span> <span class="n">platform</span><span class="o">.</span><span class="n">system</span><span class="p">()</span>  <span class="c1"># 获取操作系统类型</span>

<span class="k">if</span> <span class="n">system</span> <span class="o">==</span> <span class="s1">&#39;Windows&#39;</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;SimHei&#39;</span><span class="p">}</span>
<span class="k">elif</span> <span class="n">system</span> <span class="o">==</span> <span class="s1">&#39;Darwin&#39;</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;Arial Unicode MS&#39;</span><span class="p">}</span>
<span class="k">else</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;sans-serif&#39;</span><span class="p">}</span>
<span class="n">matplotlib</span><span class="o">.</span><span class="n">rc</span><span class="p">(</span><span class="s1">&#39;font&#39;</span><span class="p">,</span> <span class="o">**</span><span class="n">font</span><span class="p">)</span>  <span class="c1"># 设置全局字体</span>



<span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">12</span><span class="p">,</span> <span class="mi">6</span><span class="p">))</span>
<span class="n">ndf</span><span class="o">.</span><span class="n">sort_values</span><span class="p">(</span><span class="s1">&#39;year&#39;</span><span class="p">,</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">ndf</span><span class="p">[</span><span class="s1">&#39;year&#39;</span><span class="p">],</span> <span class="n">ndf</span><span class="p">[</span><span class="s1">&#39;env_ratio&#39;</span><span class="p">])</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">ndf</span><span class="p">[</span><span class="s1">&#39;year&#39;</span><span class="p">],</span> <span class="n">ndf</span><span class="p">[</span><span class="s1">&#39;env_ratio&#39;</span><span class="p">])</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">&#39;国级报告中“环保概念词”提及频率折线图(1954-2024)&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/gov-plot.png" alt=""  />
</p>
<br>
<p>大家应该都学过正泰分布中， 数据中大多数的记录会落在 均值+-标准差 范围内，</p>
<p><img loading="lazy" src="img/mean&#43;std.png" alt=""  />
</p>
<p>这里设置 <strong>top_nation_mask</strong>、<strong>bottom_nation_mask</strong> ，分别识别到最重视环保的年份、最不重视环保的年份</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">top_nation_mask</span> <span class="o">=</span> <span class="n">ndf</span><span class="p">[</span><span class="s1">&#39;env_ratio&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span> <span class="o">+</span> <span class="n">ndf</span><span class="p">[</span><span class="s1">&#39;env_ratio&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">std</span><span class="p">()</span>
<span class="n">bottom_nation_mask</span> <span class="o">=</span> <span class="n">ndf</span><span class="p">[</span><span class="s1">&#39;env_ratio&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span> <span class="o">-</span> <span class="n">ndf</span><span class="p">[</span><span class="s1">&#39;env_ratio&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">std</span><span class="p">()</span>

<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;最重视环保的年份&#39;</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">ndf</span><span class="p">[</span><span class="n">ndf</span><span class="p">[</span><span class="s1">&#39;env_ratio&#39;</span><span class="p">]</span><span class="o">&gt;</span><span class="n">top_nation_mask</span><span class="p">]</span><span class="o">.</span><span class="n">year</span><span class="o">.</span><span class="n">values</span><span class="p">)</span>

<span class="nb">print</span><span class="p">()</span>

<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;最忽视环保的年份&#39;</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">ndf</span><span class="p">[</span><span class="n">ndf</span><span class="p">[</span><span class="s1">&#39;env_ratio&#39;</span><span class="p">]</span><span class="o">&lt;</span><span class="n">bottom_nation_mask</span><span class="p">][</span><span class="s1">&#39;year&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">values</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">最重视环保的年份
[2001 2003 2005 2006 2007 2015 2016 2017 2019 2021 2023]

最忽视环保的年份
[1954 1955 1956 1957 1958 1959 1960 1964 1975 1978 1979 1980 1981 1983
 1985 1987]
</code></pre></div><br>
<p>可以看到进入21世纪，国家对环保重视从报告中就能看出。而在前期，因为生存是首要解决的，对环境保护的认识事不足的。</p>
<br>
<h3 id="32-省-词频">3.2 省-词频</h3>
<p>计算总词语数、某类词出现的次数，计算各省提及【环保】的频率。因为省份的记录有770条，现在咱们把条件变严格，</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"> top = mean+3*std, 
 bottom = mean-2std
</code></pre></div><p>大家可以自己设置条件的严格程度</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">pdf</span><span class="p">[</span><span class="s1">&#39;word_num&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pdf</span><span class="p">[</span><span class="s1">&#39;doc&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">text</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="n">jieba</span><span class="o">.</span><span class="n">lcut</span><span class="p">(</span><span class="n">text</span><span class="p">)))</span>
<span class="n">pdf</span><span class="p">[</span><span class="s1">&#39;env_num&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pdf</span><span class="p">[</span><span class="s1">&#39;doc&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">count</span><span class="p">(</span><span class="s1">&#39;环保|环境|污染|青山|绿水&#39;</span><span class="p">)</span>
<span class="n">pdf</span><span class="p">[</span><span class="s1">&#39;env_ratio&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pdf</span><span class="p">[</span><span class="s1">&#39;env_num&#39;</span><span class="p">]</span><span class="o">/</span><span class="n">pdf</span><span class="p">[</span><span class="s1">&#39;word_num&#39;</span><span class="p">]</span>

<span class="n">top_prov_mask</span> <span class="o">=</span> <span class="n">pdf</span><span class="p">[</span><span class="s1">&#39;env_ratio&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span> <span class="o">+</span> <span class="mi">3</span><span class="o">*</span><span class="n">pdf</span><span class="p">[</span><span class="s1">&#39;env_ratio&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">std</span><span class="p">()</span>
<span class="n">bottom_prov_mask</span> <span class="o">=</span> <span class="n">pdf</span><span class="p">[</span><span class="s1">&#39;env_ratio&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span> <span class="o">-</span> <span class="mi">2</span><span class="o">*</span><span class="n">pdf</span><span class="p">[</span><span class="s1">&#39;env_ratio&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">std</span><span class="p">()</span>

<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;最重视环保的省(年份)&#39;</span><span class="p">)</span>
<span class="n">pdf</span><span class="p">[</span><span class="n">pdf</span><span class="p">[</span><span class="s1">&#39;env_ratio&#39;</span><span class="p">]</span><span class="o">&gt;</span><span class="n">top_prov_mask</span><span class="p">][[</span><span class="s1">&#39;province&#39;</span><span class="p">,</span> <span class="s1">&#39;year&#39;</span><span class="p">]]</span>
</code></pre></div><p><img loading="lazy" src="img/07-df.png" alt=""  />
</p>
<br>
<p>重视环保结果挺合理的， 某人曾在浙江任职过，对环保比较重视，近年来浙江也比较重视环保，是真的很早就执行，环保搞得很好。而河北，笔者家乡，主要是跟钢铁产业关停并转，守卫di都蓝天有很大关系。</p>
<br>
<br>
<p>更多内容可在大邓博客 textdata.cn 中寻找相关代码。</p>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>数据集 | 2006年-2023年A股企业社会责任报告/环境报告书/可持续发展报告</title>
      <link>https://textdata.cn/blog/2023-08-11-china-a-market-corporate-social-responsibility-dataste/</link>
      <pubDate>Wed, 08 May 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-08-11-china-a-market-corporate-social-responsibility-dataste/</guid>
      <description>企业社会责任（csr)已成为全球学术界研究的热点，</description>
      <content:encoded><![CDATA[<p>CSR数据多为非结构文本数据，可以做词频统计、情感分析、话题模型等文本分析任务。今天给大家奉上A股CSR数据集， <strong>对文本分析感兴趣的同学， 欢迎报名视频课「Python实证指标构建与文本分析」</strong>。 本文仅展示A股企业社会责任数据集，并作简单分析。</p>
<br>
<h2 id="一csr数据集">一、CSR数据集</h2>
<p>目前这是市面上最全最完整的原始数据，数据已整理到csv压缩文件（大小308M）。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">「A股企业社会责任报告数据集」基本信息
- 记录数14845
- 沪深2383家公司
- 年度2006-2023
- 公布日期2007-03-14 ~ 2024-06-22
- txt、pdf、csv
</code></pre></div><p><img loading="lazy" src="img/cover.png" alt=""  />
</p>
<br>
<h3 id="声明">声明</h3>
<p>科研用途；如有问题， 请加微信372335839，备注「姓名-学校-专业」</p>
<p><br><br></p>
<h2 id="二相关文献">二、相关文献</h2>
<p>近年来，企业社会责任（csr)已成为全球学术界研究的热点，</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[1]解学梅,朱琪玮.企业绿色创新实践如何破解“和谐共生”难题？[J].管理世界,2021,37(01):128-149+9.
[2]谢红军,吕雪.负责任的国际投资：ESG与中国OFDI[J].经济研究,2022,57(03):83-99.
[3]Schaefer, Sarah Desirée, Ralf Terlutter, and Sandra Diehl. &#34;Is my company really doing good? Factors influencing employees&#39; evaluation of the authenticity of their company&#39;s corporate social responsibility engagement.&#34; Journal of business research 101 (2019): 128-143.
</code></pre></div><br>
<br>
<h2 id="三实验">三、实验</h2>
<h3 id="31-读取数据">3.1 读取数据</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;CSR2006-2023.csv.gz&#39;</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">)</span>
<span class="n">df</span>
</code></pre></div><p><img loading="lazy" src="img/df1.png" alt=""  />
</p>
<br>
<h3 id="32-字段">3.2 字段</h3>
<p><em><strong>CSR2006-2023.csv.gz</strong></em> 含字段</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">- code 股票代码
- name 公司简称
- year 会计年度
- pub_date 发布日期
- type 报告类型， 
    - 企业社会责任CSR
    - 环境、社会及治理ESG、
    - 可持续发展SD
    - 环境报告书ENV； 
    报告可为某种类型，也可是多种类型的组合。
</code></pre></div><p>查看不同报告类型的记录数</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="o">.</span><span class="n">type</span><span class="o">.</span><span class="n">value_counts</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">type
#CSR           11900
#ESG            1982
#SD              447
#CSR#ESG         232
#ENV             211
#ESG#SD           42
#CSR#SD           28
#SD#ESG            2
#CSR#ESG#SD        1
Name: count, dtype: int64
</code></pre></div><br>
<h3 id="33-记录数">3.3 记录数</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#ESG报告数</span>
<span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<pre><code>14845
</code></pre>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#发布ESG报告的公司数</span>
<span class="n">df</span><span class="o">.</span><span class="n">code</span><span class="o">.</span><span class="n">nunique</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<pre><code>2383
</code></pre>
<br>
<h3 id="34-会计年度">3.4 会计年度</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#有ESG报告的年份</span>

<span class="c1">#sorted(df[&#39;year&#39;].unique())</span>
<span class="nb">sorted</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">year</span><span class="o">.</span><span class="n">unique</span><span class="p">())</span>
</code></pre></div><p>Run</p>
<pre><code>[2006,
 2007,
 2008,
 2009,
 2010,
 2011,
 2012,
 2013,
 2014,
 2015,
 2016,
 2017,
 2018,
 2019,
 2020,
 2021,
 2022,
 2023]
</code></pre>
<br>
<h3 id="35-发布日期">3.5 发布日期</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s1">&#39;pub_date&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;pub_date&#39;</span><span class="p">],</span> <span class="n">errors</span><span class="o">=</span><span class="s1">&#39;coerce&#39;</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;pub_date&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">min</span><span class="p">())</span>
<span class="nb">print</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;pub_date&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">())</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">2007-03-14 00:00:00
2024-06-22 00:00:00
</code></pre></div><p><br><br></p>
<h2 id="四esg年度发布量">四、ESG年度发布量</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">plotnine</span> <span class="kn">import</span> <span class="o">*</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">from</span> <span class="nn">matplotlib.font_manager</span> <span class="kn">import</span> <span class="n">FontProperties</span>

<span class="c1">#文泉驿微米黑.ttf位于代码同文件夹</span>
<span class="n">font_prop</span> <span class="o">=</span> <span class="n">FontProperties</span><span class="p">(</span><span class="n">fname</span><span class="o">=</span><span class="s1">&#39;文泉驿微米黑.ttf&#39;</span><span class="p">)</span> 

<span class="n">data</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">&#39;year&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="nb">len</span><span class="p">)</span><span class="o">.</span><span class="n">reset_index</span><span class="p">())</span>
<span class="n">data</span><span class="o">.</span><span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;year&#39;</span><span class="p">,</span> <span class="s1">&#39;volume&#39;</span><span class="p">]</span>

<span class="p">(</span>
    <span class="n">ggplot</span><span class="p">(</span><span class="n">data</span><span class="p">,</span>  <span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="s1">&#39;year&#39;</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="s1">&#39;volume&#39;</span><span class="p">))</span>
    <span class="o">+</span><span class="n">geom_col</span><span class="p">()</span>
    <span class="o">+</span><span class="n">geom_text</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">label</span><span class="o">=</span><span class="s1">&#39;volume&#39;</span><span class="p">),</span> <span class="n">data</span><span class="o">=</span><span class="n">data</span><span class="p">,</span> <span class="n">va</span><span class="o">=</span><span class="s1">&#39;bottom&#39;</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s1">&#39;grey&#39;</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
    <span class="o">+</span><span class="n">theme</span><span class="p">(</span><span class="n">figure_size</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">6</span><span class="p">),</span>
           <span class="n">text</span> <span class="o">=</span> <span class="n">element_text</span><span class="p">(</span><span class="n">family</span> <span class="o">=</span> <span class="n">font_prop</span><span class="o">.</span><span class="n">get_name</span><span class="p">()),</span> 
           <span class="n">plot_title</span> <span class="o">=</span> <span class="n">element_text</span><span class="p">(</span><span class="n">family</span> <span class="o">=</span> <span class="n">font_prop</span><span class="o">.</span><span class="n">get_name</span><span class="p">(),</span> <span class="n">size</span><span class="o">=</span><span class="mi">14</span><span class="p">)</span>
          <span class="p">)</span>
    <span class="o">+</span><span class="n">labs</span><span class="p">(</span><span class="n">title</span><span class="o">=</span><span class="s1">&#39;A股企业社会责任报告数(2006~2023)&#39;</span><span class="p">,</span>
          <span class="n">x</span> <span class="o">=</span> <span class="s1">&#39;年度&#39;</span><span class="p">,</span> 
          <span class="n">y</span> <span class="o">=</span> <span class="s1">&#39;报告数&#39;</span><span class="p">)</span>
<span class="p">)</span>

</code></pre></div><p><img loading="lazy" src="img/plot.png" alt=""  />
</p>
<p>​</p>
<p><br><br></p>
<h2 id="五沪深发布量">五、沪深发布量</h2>
<p>大邓记得深圳交易所大多数股票以0开头，上海交易所股票则大多以6开头。 可以简单通过第一位数字来判断两个交易所发布量</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#切片，选取股票代码字符串第二个位置的数字</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;code&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">slice</span><span class="p">(</span><span class="n">start</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">stop</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="n">value_counts</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<pre><code>code
6    8339
0    5193
3    1265
8      19
9      17
2      10
4       2
Name: count, dtype: int64
</code></pre>
<br>
<p>运行结果，除了0和6还出现了2、3、9。综上，股票代码</p>
<ul>
<li>
<p>0 深交所</p>
</li>
<li>
<p>3 创业板</p>
</li>
<li>
<p>6 上交所</p>
</li>
<li>
<p>其他</p>
</li>
</ul>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;code&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="s1">&#39;A6&#39;</span><span class="p">)]</span>
</code></pre></div><p><img loading="lazy" src="img/df2.png" alt=""  />
</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;code&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="s1">&#39;A0&#39;</span><span class="p">)]</span>
</code></pre></div><p><img loading="lazy" src="img/df3.png" alt=""  />
</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#股票代码第一位出现2或者9的股票</span>
<span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;code&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">match</span><span class="p">(</span><span class="s1">&#39;A2|A9&#39;</span><span class="p">)]</span>
</code></pre></div><p><img loading="lazy" src="img/df4.png" alt=""  />
</p>
<p>​</p>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>金融研究 | 使用Python测量关键审计事项的「信息含量」</title>
      <link>https://textdata.cn/blog/2023-01-13-information-content-of-critical-audit/</link>
      <pubDate>Tue, 30 Apr 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-01-13-information-content-of-critical-audit/</guid>
      <description>关键审计事项是来自审计师视角的信息， 其蕴含的特质性信息对实现沟通价值至关重要。本文采用文本 分析方法计算的文本相似度衡量关键审计事项特质性信息含量，考察其对公司债券发行定价的影响。 结果发现， 以较低文本相似度代表的较高关键审计事项信息含量能够降低公司债券发行定价。 较高的审计师专业胜任能力 和独立性能够增强关键审计事项信息含量对公司债券发行定价的降低作用。 信息不对称的缓解是关键审计事项 信息含量降低公司债券发行定价的具体影响渠道。考虑关键审计事项类型后发现， 关联交易类关键审计事项信 息含量对公司债券发行定价的降低作用更强。本文研究结论有助于未来改进关键审计事项的披露要求。</description>
      <content:encoded><![CDATA[<p>今日分享「信息含量」的第二种算法， 不同于之前 <a href="https://textdata.cn/blog/2023-01-06-mda_informative_content/">中国工业经济 | MD&amp;A信息含量指标构建代码实现</a> ， 金日分享的「信息含量」算法更简单易懂，运行速度更快。</p>
<p><br><br></p>
<h2 id="一信息含量">一、信息含量</h2>
<h3 id="11-文献">1.1 文献</h3>
<p>宋建波,冯晓晴.关键审计事项信息含量与公司债券发行定价——基于文本相似度视角[J].会计研究,2022,(03):174-191.
<img loading="lazy" src="img/key-audit_cover.png" alt=""  />
</p>
<br>
<h3 id="12-信息的分类">1.2 信息的分类</h3>
<ul>
<li><strong>标准信息</strong>，将关键审计事项段中与同行业其他公司重复或相似的信息定义为不具有信息含量的内容 ( 标准信息)。</li>
<li><strong>特质性信息</strong> 将区别于同行业其他公司的信息定义为真正具有信息含量的内容 ( 特质性信息) 。 与标准信息相比， 特质性信息才是缓解公司与投资者之间信息不对称的关键。</li>
</ul>
<p><br><br></p>
<h2 id="二算法">二、算法</h2>
<p>该文基于 <strong>向量空间模型</strong> (VSM) ， 采用某家公司关键审计事项文本内容与同行业其他公司关键审计事项文本内容之间的余弦相似度来衡量关键审计事项的特质性信息含量。</p>
<p>要测量信息含量的数学表达大概这样</p>
<ul>
<li>
<p><strong>文本向量化</strong>。</p>
<ul>
<li>使用TF-IDF将公司审计文本向量化 <em><strong>Corp_Vec_it</strong></em></li>
<li>公司所在行业众多的 <em><strong>Corp_Vec_jt</strong></em> 的均值向量 <em><strong>Industry_Vec_t</strong></em> 。注意计算均值向量时要剔除概公司。</li>
</ul>
</li>
<li>
<p><strong>余弦相似度cosine(Corp_Vec_it, Industry_Vec_t)</strong></p>
</li>
<li>
<p><code>信息含量 = -cosine(Corp_Vec_it, Industry_Vec_t)</code></p>
</li>
</ul>
<p><br><br></p>
<h2 id="三代码实现">三、代码实现</h2>
<h3 id="31-文件结构">3.1 文件结构</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">- 金融研究2023信息含量文件夹
    - 代码.ipynb                                    #代码文件
    
    - data                                         #数据文件夹
       - mda01-23.csv.gz                           #md&amp;a
       - 上市公司基本信息2000-2023.xlsx               #股票行业信息
       
    - 关键审计-信息含量01-23.csv                      #计算结果
</code></pre></div><br>
<h3 id="32-读取数据">3.2 读取数据</h3>
<p>原文数据描述</p>
<blockquote>
<p>对于全部 A 股公司而言，新准则要求在针对 2017 财年 会计报表的审计报告中首次包含关键审计事项。 由于针对 2017 财年会计报表的审计报告于 2018 年发布， 债券投资者在 2018 年方能获取 2017 财年的关键审计事项信息， 进而在 2018 年进行债券投资时考虑关键审计事项信息。 因此，本文实证检验 2017－2018 会计年度审计报告中的关键审计事项信息对 2018－2019 年度非金融业上市公司发行的 357 只公司债券定价的影响。 关键审计事项信息含量数据通过 Python 编程语言进行文本分析计算得到; 公司债券限制性契约条款数据通过手工整理得到; 其他数据来自于 CSMAＲ 数据库。所有连续变量均进行 1%和 99%分位数的缩尾处理。</p>
</blockquote>
<br>
<p>大邓这里没有「<strong>审计报告文本</strong>」数据集， 用「<strong>管理层讨论与分析</strong>」代替。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>

<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="c1">#读取md&amp;a</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;data/mda01-23.csv.gz&#39;</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;会计年度&#39;</span><span class="p">,</span> <span class="s1">&#39;股票代码&#39;</span><span class="p">,</span> <span class="s1">&#39;经营讨论与分析内容&#39;</span><span class="p">]</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;会计年度&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;会计年度&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="nb">str</span><span class="p">)</span>


<span class="c1">#上市公司行业信息</span>
<span class="n">ind_info_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_excel</span><span class="p">(</span><span class="s1">&#39;data/上市公司基本信息2000-2023.xlsx&#39;</span><span class="p">,</span> <span class="n">usecols</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;Symbol&#39;</span><span class="p">,</span> <span class="s1">&#39;EndDate&#39;</span><span class="p">,</span> <span class="s1">&#39;IndustryCodeC&#39;</span><span class="p">,</span> <span class="s1">&#39;ShortName&#39;</span><span class="p">])</span>
<span class="n">ind_info_df</span> <span class="o">=</span> <span class="n">ind_info_df</span><span class="p">[</span><span class="n">ind_info_df</span><span class="o">.</span><span class="n">Symbol</span><span class="o">!=</span><span class="s1">&#39;股票代码&#39;</span><span class="p">]</span>
<span class="n">ind_info_df</span><span class="p">[</span><span class="s1">&#39;会计年度&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">ind_info_df</span><span class="o">.</span><span class="n">EndDate</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">date</span><span class="p">:</span> <span class="n">date</span><span class="p">[:</span><span class="mi">4</span><span class="p">])</span>
<span class="n">ind_info_df</span><span class="o">.</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="s1">&#39;Symbol&#39;</span><span class="p">:</span> <span class="s1">&#39;股票代码&#39;</span><span class="p">,</span> <span class="s1">&#39;IndustryCodeC&#39;</span><span class="p">:</span><span class="s1">&#39;行业代码&#39;</span><span class="p">,</span> <span class="s1">&#39;ShortName&#39;</span><span class="p">:</span> <span class="s1">&#39;股票简称&#39;</span><span class="p">},</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">ind_info_df</span> <span class="o">=</span> <span class="n">ind_info_df</span><span class="p">[[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">,</span> <span class="s1">&#39;会计年度&#39;</span><span class="p">,</span> <span class="s1">&#39;行业代码&#39;</span><span class="p">,</span> <span class="s1">&#39;股票简称&#39;</span><span class="p">]]</span>

<span class="c1">#合并数据</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">merge</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">ind_info_df</span><span class="p">,</span> <span class="n">on</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">,</span> <span class="s1">&#39;会计年度&#39;</span><span class="p">],</span> <span class="n">how</span><span class="o">=</span><span class="s1">&#39;inner&#39;</span><span class="p">)</span>

<span class="c1"># 剔除金融行业处理</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="o">~</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;行业代码&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s2">&#34;J&#34;</span><span class="p">)]</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;会计年度&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;会计年度&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="nb">str</span><span class="p">)</span>

<span class="c1">#行业内企业数量过少，会导致行业向量与某个或某几个企业向量相关性增大，极端情况下，一个企业就是一个行业。剔除掉企业数较少的行业，这里只保留大于20的行业。</span>
<span class="n">ind_codes</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;行业代码&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">value_counts</span><span class="p">()</span>
<span class="n">ind_codes</span> <span class="o">=</span> <span class="n">ind_codes</span><span class="p">[</span><span class="n">ind_codes</span><span class="o">&gt;</span><span class="mi">20</span><span class="p">]</span><span class="o">.</span><span class="n">index</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;行业代码&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">isin</span><span class="p">(</span><span class="n">ind_codes</span><span class="p">)]</span>
<span class="n">df</span>
</code></pre></div><p><img loading="lazy" src="img/01-df.png" alt=""  />
</p>
<br>
<h3 id="33-文本向量化">3.3 文本向量化</h3>
<p>使用sklearn，将该企业文本(审计报告文本)转为TF-IDF的企业向量。步骤</p>
<ol>
<li>分词整理</li>
<li><code>tf-idf</code>文本向量化</li>
<li>合并多个字段为新的df</li>
</ol>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>

<span class="kn">import</span> <span class="nn">jieba</span>
<span class="kn">import</span> <span class="nn">re</span>
<span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="c1">#cntext1.9.2</span>
<span class="c1">#stopwords = ct.load_pkl_dict(&#39;STOPWORDS.pkl&#39;)[&#39;STOPWORDS&#39;][&#39;chinese&#39;]</span>

<span class="c1">##cntext2.1.7</span>
<span class="n">stopwords</span><span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">read_yaml_dict</span><span class="p">(</span><span class="s1">&#39;enzh_common_StopWords.yaml&#39;</span><span class="p">)[</span><span class="s1">&#39;Dictionary&#39;</span><span class="p">][</span><span class="s1">&#39;chinese&#39;</span><span class="p">]</span>


<span class="k">def</span> <span class="nf">transform</span><span class="p">(</span><span class="n">text</span><span class="p">):</span>
    <span class="c1">#只保留md&amp;a中的中文内容</span>
    <span class="n">text</span> <span class="o">=</span> <span class="s1">&#39;&#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">re</span><span class="o">.</span><span class="n">findall</span><span class="p">(</span><span class="s1">&#39;[</span><span class="se">\u4e00</span><span class="s1">-</span><span class="se">\u9fa5</span><span class="s1">]+&#39;</span><span class="p">,</span> <span class="n">text</span><span class="p">))</span>
    <span class="c1">#剔除停用词</span>
    <span class="n">words</span> <span class="o">=</span> <span class="p">[</span><span class="n">w</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">jieba</span><span class="o">.</span><span class="n">cut</span><span class="p">(</span><span class="n">text</span><span class="p">)</span> <span class="k">if</span> <span class="n">w</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">stopwords</span><span class="p">]</span>
    <span class="c1">#整理为用空格间隔的字符串(类西方语言文本格式)</span>
    <span class="k">return</span> <span class="s1">&#39; &#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">words</span><span class="p">)</span>



<span class="n">df</span><span class="p">[</span><span class="s1">&#39;clean_text&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;经营讨论与分析内容&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">transform</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">CPU times: user 54min 3s, sys: 56.4 s, total: 54min 59s
Wall time: 55min 16s
</code></pre></div><p><img loading="lazy" src="img/02-df.png" alt=""  />
</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>
<span class="kn">from</span> <span class="nn">sklearn.feature_extraction.text</span> <span class="kn">import</span> <span class="n">TfidfVectorizer</span>

<span class="n">cv</span> <span class="o">=</span> <span class="n">TfidfVectorizer</span><span class="p">(</span><span class="n">min_df</span><span class="o">=</span><span class="mf">0.05</span><span class="p">,</span> <span class="n">max_df</span><span class="o">=</span><span class="mf">0.5</span><span class="p">)</span> 
<span class="c1"># 生成稀疏bow矩阵</span>
<span class="c1">#dtm 文档-词频-矩阵</span>
<span class="n">dtm_df</span> <span class="o">=</span> <span class="n">cv</span><span class="o">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;clean_text&#39;</span><span class="p">])</span> 
<span class="c1">#保证新生成的dtm_df2.index 与 df2.index 完全相同</span>
<span class="n">dtm_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">dtm_df</span><span class="o">.</span><span class="n">toarray</span><span class="p">(),</span> <span class="n">index</span><span class="o">=</span><span class="n">df</span><span class="o">.</span><span class="n">index</span><span class="p">)</span>
<span class="n">dtm_df</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">CPU times: user 1min 2s, sys: 1.5 s, total: 1min 4s
Wall time: 1min 4s
</code></pre></div><p><img loading="lazy" src="img/03-df.png" alt=""  />
</p>
<br>
<h3 id="36-小实验">3.6 小实验</h3>
<p>指定某年份，某公司，某行业， 尝试着分别得到公司向量、行业向量、信息含量。</p>
<ul>
<li>
<p>使用TF-IDF将公司审计文本向量化 <em><strong>Corp_Vec_it</strong></em></p>
</li>
<li>
<p>公司所在行业众多的 <em><strong>Corp_Vec_jt</strong></em> 的均值向量 <em><strong>Industry_Vec_t</strong></em> 。注意计算均值向量时要剔除概公司。</p>
</li>
<li>
<p><strong>余弦相似度cosine(Corp_Vec_it, Industry_Vec_t)</strong></p>
</li>
<li>
<p><code>信息含量 = -cosine(Corp_Vec_it, Industry_Vec_t)</code></p>
</li>
</ul>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="kn">from</span> <span class="nn">sklearn.metrics.pairwise</span> <span class="kn">import</span> <span class="n">cosine_similarity</span>

<span class="c1">#小实验</span>
<span class="n">year</span> <span class="o">=</span> <span class="s1">&#39;2023&#39;</span>
<span class="n">ind</span> <span class="o">=</span> <span class="s1">&#39;K70&#39;</span>
<span class="n">code</span> <span class="o">=</span> <span class="s1">&#39;A000002&#39;</span>

<span class="c1">#筛选条件</span>
<span class="n">year_mask</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;会计年度&#39;</span><span class="p">]</span><span class="o">==</span><span class="n">year</span>
<span class="n">ind_mask</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;行业代码&#39;</span><span class="p">]</span><span class="o">==</span><span class="n">ind</span>
<span class="n">corp_mask</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">]</span><span class="o">==</span><span class="n">code</span>

<span class="c1">#提取公司向量</span>
<span class="n">selected_corp_index</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">year_mask</span> <span class="o">&amp;</span> <span class="n">ind_mask</span> <span class="o">&amp;</span> <span class="n">corp_mask</span><span class="p">]</span><span class="o">.</span><span class="n">index</span>
<span class="n">corp_vec</span> <span class="o">=</span> <span class="n">dtm_df</span><span class="p">[</span><span class="n">dtm_df</span><span class="o">.</span><span class="n">index</span><span class="o">.</span><span class="n">isin</span><span class="p">(</span><span class="n">selected_corp_index</span><span class="p">)]</span><span class="o">.</span><span class="n">values</span>
<span class="n">corp_arr</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">corp_vec</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;公司向量: &#39;</span><span class="p">,</span> <span class="n">corp_arr</span><span class="p">)</span>

<span class="c1">#计算行业均值向量</span>
<span class="n">selected_ind_df</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">ind_mask</span> <span class="o">&amp;</span> <span class="n">year_mask</span><span class="p">]</span>
<span class="n">selected_indexs</span> <span class="o">=</span> <span class="n">selected_ind_df</span><span class="p">[</span><span class="n">selected_ind_df</span><span class="p">[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">]</span><span class="o">!=</span><span class="n">code</span><span class="p">]</span><span class="o">.</span><span class="n">index</span>
<span class="n">ind_vec</span> <span class="o">=</span> <span class="n">dtm_df</span><span class="p">[</span><span class="n">dtm_df</span><span class="o">.</span><span class="n">index</span><span class="o">.</span><span class="n">isin</span><span class="p">(</span><span class="n">selected_indexs</span><span class="p">)]</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span><span class="o">.</span><span class="n">values</span>
<span class="n">ind_arr</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="n">ind_vec</span><span class="p">])</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;公司向量: &#39;</span><span class="p">,</span> <span class="n">corp_arr</span><span class="p">)</span>

<span class="c1">#计算信息含量</span>
<span class="n">special_info</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span> <span class="o">*</span> <span class="n">cosine_similarity</span><span class="p">(</span><span class="n">corp_arr</span><span class="p">,</span> <span class="n">ind_arr</span><span class="p">)[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;信息含量: &#39;</span><span class="p">,</span> <span class="n">special_info</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">公司向量:  [[0.         0.01495101 0.00455808 ... 0.         0.         0.        ]]
公司向量:  [[0.         0.01495101 0.00455808 ... 0.         0.         0.        ]]
信息含量:  -0.5683186993629404
</code></pre></div><br>
<h3 id="25-批量计算信息含量">2.5 批量计算信息含量</h3>
<ol>
<li>新建 <em><strong>信息含量.csv</strong></em> ， 含字段 <code>['股票代码', '会计年度', '行业代码', '信息含量']</code></li>
<li>先按年份对 <em><strong>df</strong></em> 进行分组，得到很多个 <em><strong>y_df</strong></em>；而 <em><strong>y_df</strong></em> 含一年很多条企业mda记录</li>
<li>双层 <em><strong>for</strong></em>循环逐年(<em><strong>y_df</strong></em>)内每条企业mda记录， 构建公司向量、行业向量、信息含量</li>
<li>将相关计算结果写入到csv中。</li>
</ol>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>
<span class="kn">import</span> <span class="nn">time</span>
<span class="kn">import</span> <span class="nn">csv</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">from</span> <span class="nn">sklearn.metrics.pairwise</span> <span class="kn">import</span> <span class="n">cosine_similarity</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="kn">from</span> <span class="nn">tqdm</span> <span class="kn">import</span> <span class="n">tqdm</span>



<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s1">&#39;关键审计-信息含量01-23.csv&#39;</span><span class="p">,</span> <span class="s1">&#39;w&#39;</span><span class="p">,</span> <span class="n">newline</span><span class="o">=</span><span class="s1">&#39;&#39;</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">csvf</span><span class="p">:</span>
    <span class="n">fieldnames</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">,</span> <span class="s1">&#39;会计年度&#39;</span><span class="p">,</span> <span class="s1">&#39;行业代码&#39;</span><span class="p">,</span> <span class="s1">&#39;信息含量&#39;</span><span class="p">]</span>
    <span class="n">writer</span> <span class="o">=</span> <span class="n">csv</span><span class="o">.</span><span class="n">DictWriter</span><span class="p">(</span><span class="n">csvf</span><span class="p">,</span> <span class="n">fieldnames</span><span class="o">=</span><span class="n">fieldnames</span><span class="p">)</span>
    <span class="n">writer</span><span class="o">.</span><span class="n">writeheader</span><span class="p">()</span>
    
    <span class="k">for</span> <span class="n">year</span><span class="p">,</span> <span class="n">y_df</span> <span class="ow">in</span> <span class="n">tqdm</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">&#39;会计年度&#39;</span><span class="p">),</span> <span class="n">desc</span><span class="o">=</span><span class="s1">&#39;分析进度&#39;</span><span class="p">):</span>
        <span class="k">for</span> <span class="n">idx</span> <span class="ow">in</span> <span class="n">y_df</span><span class="o">.</span><span class="n">index</span><span class="p">:</span>
            <span class="k">try</span><span class="p">:</span>
                <span class="n">data</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span>
                <span class="n">data</span><span class="p">[</span><span class="s1">&#39;会计年度&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">year</span>
                <span class="n">code</span> <span class="o">=</span> <span class="n">y_df</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="n">idx</span><span class="p">,</span> <span class="s1">&#39;股票代码&#39;</span><span class="p">]</span>
                <span class="n">data</span><span class="p">[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">code</span>
                <span class="n">industry</span> <span class="o">=</span> <span class="n">y_df</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="n">idx</span><span class="p">,</span> <span class="s1">&#39;行业代码&#39;</span><span class="p">]</span>
                <span class="n">data</span><span class="p">[</span><span class="s1">&#39;行业代码&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">industry</span>

                <span class="c1">#筛选条件mask</span>
                <span class="n">ind_mask</span> <span class="o">=</span> <span class="n">y_df</span><span class="p">[</span><span class="s1">&#39;行业代码&#39;</span><span class="p">]</span><span class="o">==</span><span class="sa">f</span><span class="s1">&#39;</span><span class="si">{</span><span class="n">industry</span><span class="si">}</span><span class="s1">&#39;</span>
                <span class="n">corp_mask</span> <span class="o">=</span> <span class="n">y_df</span><span class="p">[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">]</span><span class="o">==</span><span class="sa">f</span><span class="s1">&#39;</span><span class="si">{</span><span class="n">code</span><span class="si">}</span><span class="s1">&#39;</span>
                <span class="n">year_mask</span> <span class="o">=</span> <span class="n">y_df</span><span class="p">[</span><span class="s1">&#39;会计年度&#39;</span><span class="p">]</span> <span class="o">==</span> <span class="sa">f</span><span class="s1">&#39;</span><span class="si">{</span><span class="n">year</span><span class="si">}</span><span class="s1">&#39;</span>
                
                
                <span class="c1">#某年某公司a</span>
                <span class="n">selected_corp_index</span> <span class="o">=</span> <span class="n">y_df</span><span class="p">[</span><span class="n">ind_mask</span> <span class="o">&amp;</span> <span class="n">corp_mask</span> <span class="o">&amp;</span> <span class="n">year_mask</span><span class="p">]</span><span class="o">.</span><span class="n">index</span>
                <span class="n">corp_vec</span> <span class="o">=</span> <span class="n">dtm_df</span><span class="p">[</span><span class="n">dtm_df</span><span class="o">.</span><span class="n">index</span><span class="o">.</span><span class="n">isin</span><span class="p">(</span><span class="n">selected_corp_index</span><span class="p">)]</span><span class="o">.</span><span class="n">values</span>
                <span class="n">corp_arr</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">corp_vec</span><span class="p">)</span>

                <span class="c1">#某year，某行业(排除公司a)</span>
                <span class="n">selected_ind_df</span> <span class="o">=</span> <span class="n">y_df</span><span class="p">[</span><span class="n">ind_mask</span> <span class="o">&amp;</span> <span class="n">year_mask</span><span class="p">]</span>
                <span class="n">selected_indexs</span> <span class="o">=</span> <span class="n">selected_ind_df</span><span class="p">[</span><span class="n">selected_ind_df</span><span class="p">[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">]</span><span class="o">!=</span><span class="n">code</span><span class="p">]</span><span class="o">.</span><span class="n">index</span>
                <span class="n">ind_vec</span> <span class="o">=</span> <span class="n">dtm_df</span><span class="p">[</span><span class="n">dtm_df</span><span class="o">.</span><span class="n">index</span><span class="o">.</span><span class="n">isin</span><span class="p">(</span><span class="n">selected_indexs</span><span class="p">)]</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span><span class="o">.</span><span class="n">values</span>
                <span class="n">ind_arr</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="n">ind_vec</span><span class="p">])</span>

                <span class="c1">#信息含量</span>
                <span class="n">special_info</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span> <span class="o">*</span> <span class="n">cosine_similarity</span><span class="p">(</span><span class="n">corp_arr</span><span class="p">,</span> <span class="n">ind_arr</span><span class="p">)[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span>
                <span class="n">data</span><span class="p">[</span><span class="s1">&#39;信息含量&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">special_info</span>
                <span class="n">writer</span><span class="o">.</span><span class="n">writerow</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
            <span class="k">except</span><span class="p">:</span>
                <span class="k">pass</span>
            
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">分析进度: 100%|█████████████████████████████████| 22/22 [01:58&lt;00:00,  5.37s/it]
CPU times: user 1min 55s, sys: 2.91 s, total: 1min 57s
Wall time: 1min 58s
</code></pre></div><p><br><br></p>
<h2 id="四查看结果">四、查看结果</h2>
<p>欣赏一下计算结果 <em><strong>关键审计-信息含量01-23.csv</strong></em></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">idf</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;关键审计-信息含量01-23.csv&#39;</span><span class="p">)</span>
<span class="n">idf</span>
</code></pre></div><p><img loading="lazy" src="img/04-df.png" alt=""  />
</p>
<br>
<br>
<h2 id="五相关内容">五、相关内容</h2>
<p>最近陆续分享了几篇<strong>文本相似度</strong>、<strong>信息含量</strong>的论文</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[1]姜富伟,胡逸驰,黄楠.央行货币政策报告文本信息、宏观经济与股票市场[J].金融研究,2021,(06):95-113.
[2]宋建波,冯晓晴.关键审计事项信息含量与公司债券发行定价——基于文本相似度视角[J].会计研究,2022,(03):174-191.
[3]孟庆斌,杨俊华,鲁冰.管理层讨论与分析披露的信息含量与股价崩盘风险——基于文本向量化方法的研究[J].中国工业经济,2017,(12):132-150. 
</code></pre></div><br>
<p>比较一下,三者均先使用了文本向量化，将本文数据转为向量。每篇论文的算法</p>
<br>
<table>
<thead>
<tr>
<th>论文</th>
<th>指标</th>
<th>算法</th>
</tr>
</thead>
<tbody>
<tr>
<td>[1]</td>
<td><a href="https://textdata.cn/blog/2023-01-10-similarity_of_cental_bank_monetary_policy/">文本相似度</a></td>
<td>将央行货币政策报告向量化， 临近的两个报告文本向量计算相似度，相似度越高，金融市场波动性越小。</td>
</tr>
<tr>
<td>[2]</td>
<td>信息含量（本文)</td>
<td>将同行业内所有企业向量Corp求均值得到行业向量Ind，求Corp与Ind的余弦相似度，并将结果乘以(-1),所得结果定义为信息向量。</td>
</tr>
<tr>
<td>[3]</td>
<td><a href="https://textdata.cn/blog/2023-01-06-mda_informative_content/">信息含量</a></td>
<td>文本向量化+计量建模，认为md&amp;a中的信息向量Norm可以由市场Norm_Market、行业Norm_Industry、企业异质性μ三种信息向量组成，通过计算 <br><code>Norm = a0 + a1*Norm_Industry +  a2*Norm_Market + μ</code> <br>，将μ 向量的绝对值和作为信息含量，而a1+a2看标准信息。</td>
</tr>
</tbody>
</table>
<br>
<p>从中可以看到两个向量的余弦相似度，在不同场景，解读含义是不同的。</p>
<ul>
<li>在货币政策中，相似度越高，表示越政策稳定，金融市场波动星越小。</li>
<li>而在关键审计场景中，特质性信息是缓解公司与投资者信息不对称的关键，公司向量Corp与行业向量Ind相似度越高，表示公司审计报告文本特质性信息越少。</li>
</ul>
<p><br><br></p>
<h2 id="六资料获取">六、资料获取</h2>
<p>数据&amp;代码创作不易， 200元， 如果需要源代码和数据， 加微信372335839， 备注「姓名-学校-专业」</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">打包价 100元
  1. 管理层讨论与分析(mda01-23.csv.gz)、上市公司基本信息2000-2023.xlsx
  2. 计算结果(关键审计-信息含量01-23.csv)
  
零卖价
- 100元  管理层讨论与分析(mda01-23.csv.gz)、上市公司基本信息2000-2023.xlsx
- 50元   计算结果(关键审计-信息含量01-23.csv)
</code></pre></div><br>
<br>
]]></content:encoded>
    </item>
    
    <item>
      <title>管理世界2024 | 使用管理层讨论与分析测量「企业人工智能指标」</title>
      <link>https://textdata.cn/blog/2024-04-19-ai-improve-firm-productivity/</link>
      <pubDate>Mon, 29 Apr 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-04-19-ai-improve-firm-productivity/</guid>
      <description>&lt;h2 id=&#34;一案例&#34;&gt;一、案例&lt;/h2&gt;
&lt;h3 id=&#34;11-文献&#34;&gt;1.1 文献&lt;/h3&gt;
&lt;p&gt;姚加权, 张锟澎, 郭李鹏, 冯绪. 人工智能如何提升企业生产效率？——基于劳动力技能结构调整的视角[J]. 管理世界, 2024, 40 (02): 101-116+133+117-122.&lt;/p&gt;
&lt;p&gt;摘要:人工智能技术对实现经济的高质量发展具有重要意义。现有研究多聚焦于人工智能对宏观经济的影响，本文从企业层面考察了人工智能技术如何影响生产效率和劳动力技能结构。&lt;strong&gt;本文运用机器学习方法生成了「人工智能词典」，并对上市公司的年报和专利进行「文本分析」，进而构建了企业层面的「人工智能指标」&lt;/strong&gt;。研究发现，人工智能显著提升了中国上市公司的生产率，并且该结论在一系列稳健性检验后依旧成立。在影响机制方面，人工智能通过促使企业减少常规低技能劳动力需求、增加非常规高技能劳动力需求的方式提升企业的生产率，这体现了企业劳动力技能结构的调整。异质性分析表明，产权性质、人才获得方式、劳动力保障、治理结构等企业层面因素对人工智能的生产率效应有较大影响。此外，企业所处的行业和地区层面因素也影响了人工智能的生产率效应。最后，本文发现人工智能提高了企业价值。本文加深了对微观企业层面人工智能在生产过程中所扮演角色的认知和理解，并为在微观企业层面推动人工智能技术发展提供了建议。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;12-指标构建步骤&#34;&gt;1.2 指标构建步骤&lt;/h3&gt;
&lt;p&gt;下图是论文中「人工智能指标」构建的流程图&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/01-steps.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;我们将步骤分成三步&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Step1. 训练Word2Vec模型构建「人工智能AI词典」, 共54个词&lt;/li&gt;
&lt;li&gt;Step2. 统计上市公司 「年报」中AI词词频m，采用自然对数处理得到指标Ln(m+1)&lt;/li&gt;
&lt;li&gt;Step3. 统计上市公司「MD&amp;amp;A」数据中AI词词频n，采用自然对数处理得到指标Ln(n+1)&lt;/li&gt;
&lt;li&gt;Step4. 根据上市公司申请专利的名称和摘要是否含AI词，统计上市公司AI专利申请数量p，采用自然对数处理得到指标Ln(p+1)&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;企业申请的人工智能专利代表企业已经拥有的人工智能技术，反映了企业人工智能技术的产出情况，能够与年报相互印证企业的人工智能技术水平&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;为了减轻阅读压力，也为了减轻制作本文的工作量， &lt;strong&gt;本文使用MD&amp;amp;A数据， 实现 Step1 、Step3(Step2、Step3算法相同)， 覆盖截图中的红色框范围内的计算方法&lt;/strong&gt;。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;13-项目结构&#34;&gt;1.3 项目结构&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;- 管理世界2024企业人工智能文件夹
    - 代码.ipynb                                    #代码文件
    
    - data                                         #数据文件夹
       - A01-23.csv.gz                             #年报
       - mda01-23.csv.gz                           #md&amp;amp;a
       - 上市公司基本信息2000-2023.csv                #基本信息
       
    - A股人工智能指标2001-2023(mda).xlsx              #计算结果
    
    - Word2Vec                                     #模型文件夹
       - mda01-23.200.6.bin
       - mda01-23.200.6.bin.syn1neg.npy
       - mda01-23.200.6.bin.wv.vectors.npy
       - 1000w专利摘要文本.100.6.bin
       - 1000w专利摘要文本.100.6.bin.syn1neg.npy
       - 1000w专利摘要文本.100.6.bin.wv.vectors.npy
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二准备ai词典&#34;&gt;二、准备AI词典&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;构造专利摘要语料、管理层讨论与分析语料，分别训练Word2Vec模型&lt;/li&gt;
&lt;li&gt;构建人工智能种子词， 使用Word2Vec模型扩展并构建「人工智能词典」&lt;/li&gt;
&lt;/ol&gt;
&lt;br&gt;
&lt;h3 id=&#34;21-训练word2vec模型&#34;&gt;2.1 训练Word2Vec模型&lt;/h3&gt;
&lt;p&gt;刚好之前分享过使用cntext库(2.0以上版本)训练Word2Vec， 相关推文&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-03-24-load-w2v-and-expand-your-concpet-dicitonary/&#34;&gt;词向量 | 使用MD&amp;amp;A2001-2023语料训练Word2Vec模型&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-11-10-training-word2vec-model-using-china-3751w-patent-application-dataset/&#34;&gt;词向量 | 使用1985年-2025年专利申请摘要训练word2vec模型&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;更多词向量资源，可点击 &lt;a href=&#34;https://github.com/hiDaDeng/Chinese-Pretrained-Word-Embeddings&#34;&gt;cntext库训练出的免费公开词向量&lt;/a&gt;&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;22-导入word2vec&#34;&gt;2.2 导入Word2Vec&lt;/h3&gt;
&lt;p&gt;以 mda01-22.200.6.bin 为例， 使用cntext2读取模型， cntext安装和使用请参考 &lt;a href=&#34;https://textdata.cn/blog/2024-04-27-cntext2x-tutorial/&#34;&gt;文本分析库cntext2.x使用说明文档&lt;/a&gt;。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;#pip3 install cntext  --upgrade&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;cntext&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;ct&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#查看cntext版本&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;__version__&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#导入管理层讨论与分析的Word2Vec模型&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;mda_w2v_m&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;load_w2v&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;Word2Vec/mda01-23.200.6.bin&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#导入专利摘要Word2Vec模型&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#pat_w2v_m = ct.load_w2v(&amp;#39;Word2Vec/1000w专利摘要文本.100.6.bin&amp;#39;)&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;mda_w2v_m&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;2.1.7

Loading word2vec model...
&amp;lt;gensim.models.word2vec.Word2Vec at 0x7dbf9afd0&amp;gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;p&gt;查看某个词的词向量&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;mda_w2v_m.wv.get_vector(&amp;#39;人工智能&amp;#39;)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;array([-3.8744571 , -0.5923845 , -1.8126943 ,  1.660894  ,  1.4194168 ,
        1.0365077 , -0.21333796, -0.60481924,  1.5012817 , -0.24060927,
       -1.7463511 , -2.1997519 , -0.66537315, -1.2665682 ,  0.14333063,
       -0.1268099 ,  2.005481  , -1.4638793 ,  3.7950375 ,  0.20866613,
        1.0281029 , -1.5495429 , -0.2518896 ,  1.4159175 ,  3.178865  ,
        .............................#省略展示..........................
       -1.2206184 ,  1.6766415 , -0.1082068 ,  0.62580353,  1.4639648 ,
        2.2743094 , -0.48386717,  1.3510187 ,  1.1698194 ,  0.72390413,
       -0.4855997 ,  1.0688399 ,  0.77217335, -1.4559731 ,  1.4391305 ,
        0.8412411 ,  2.359447  , -1.1504242 ,  1.3677332 , -0.92123735,
        1.281644  ,  0.67157453,  2.159804  ,  1.7593136 , -0.53061306,
       -0.77395666,  0.5912517 ,  1.9448034 ,  0.13023153,  0.6798518 ],
      dtype=float32)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;23-扩展词典&#34;&gt;2.3 扩展词典&lt;/h3&gt;
&lt;p&gt;我们每个人对人工智能都有所了解，脑海里首先能想到的词可以当做 「初始种子词」， 例如词语 &lt;code&gt;人工智能|人机对话|&lt;/code&gt; 等。 本部分主要展示Word2Vec模型的近义词联想能力，&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;mda_w2v_m&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;wv&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;most_similar&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;([&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;人工智能&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;人机对话&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;topn&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;100&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;[(&amp;#39;自然语言处理&amp;#39;, 0.8055953979492188),
 (&amp;#39;AI&amp;#39;, 0.8050345778465271),
 (&amp;#39;语音识别&amp;#39;, 0.804234504699707),
 (&amp;#39;NLP&amp;#39;, 0.7967724800109863),
 (&amp;#39;交互技术&amp;#39;, 0.7902386784553528),
 (&amp;#39;智能语音&amp;#39;, 0.7870553731918335),
 ..........#省略展示..........
 (&amp;#39;智能识别&amp;#39;, 0.6703209280967712),
 (&amp;#39;结合人工智能&amp;#39;, 0.6701650619506836),
 (&amp;#39;VR技术&amp;#39;, 0.6699633002281189),
 (&amp;#39;人工智能芯片&amp;#39;, 0.6690542101860046),
 (&amp;#39;人工智能数据分析&amp;#39;, 0.6689168214797974),
 (&amp;#39;AR技术&amp;#39;, 0.6688560843467712)]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;之后Word2Vec可以根据初始种子词进行扩充，再经过人工检查，最终构建「&lt;strong&gt;人工智能词典&lt;/strong&gt;」(论文附表3截图), 我将其整理为 &lt;em&gt;&lt;strong&gt;AI-Words&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/02-ai-words.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;AI_Words&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;机器翻译|机器学习|计算机视觉|人机交互|深度学习|神经网络|生物识别|数据挖掘|特征识别|语音合成|语音识别|知识图谱|智慧银行|智能保险|人机协同|智能监管|智能教育|智能客服|智能零售|智能农业|智能投顾|增强现实|虚拟现实|智能医疗|智能语音|智能政务|自动驾驶|智能运输|卷积神经网络|声纹识别|特征提取|无人驾驶|人脸识别|商业智能|循环神经网络|大数据营销|大数据分析|大数据处理|支持向量机|长短期记忆|机器人流程|自然语言|分布式计算|可穿戴产品|大数据管理|智能传感器|模式识别|边缘计算|大数据平台|语音交互|智能环保|人机对话|深度神经网络|大数据运营&amp;#39;&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;AI_Words&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;三准备数据&#34;&gt;三、准备数据&lt;/h2&gt;
&lt;p&gt;为了保证数据质量， 论文对样本进行的操作&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;1. 剔除金融行业公司；
2. 剔除信息传输、软件和信息技术 服务业以及科学研究和技术服务行业，原因在于这些行业天生使用云计算、大数据以及人工智能技术并披露 相关信息，可能无法清楚判断这些企业应用人工智能技术对其生产效率的影响；
3. 剔除当年处于 ST 和*ST 状 态的样本；
4. 剔除数据缺失的样本
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;p&gt;大邓这里有几个数据文件，经过一些操作(字段名统一、 整理会计年度、合并多源数据)，就能实现论文中的样本操作。&lt;strong&gt;文末有数据获取方式&lt;/strong&gt; 。&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;数据&lt;/th&gt;
&lt;th&gt;文件名&lt;/th&gt;
&lt;th&gt;所含字段&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-03-23-china-a-share-market-dataset-mda-from-01-to-21/&#34;&gt;2001-2023年A股上市公司年报&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;&lt;strong&gt;A01-23.csv.gz&lt;/strong&gt;&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;仅含&lt;em&gt;&lt;strong&gt;code&lt;/strong&gt;&lt;/em&gt; 、 &lt;em&gt;&lt;strong&gt;year&lt;/strong&gt;&lt;/em&gt; 、 &lt;em&gt;&lt;strong&gt;text&lt;/strong&gt;&lt;/em&gt; 三个字段&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-03-23-china-a-share-market-dataset-mda-from-01-to-21/&#34;&gt;2001-2023年A股上市公司管理层讨论与分析&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;&lt;strong&gt;mda01-23.csv.gz&lt;/strong&gt;&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;仅含&lt;em&gt;&lt;strong&gt;code&lt;/strong&gt;&lt;/em&gt; 、 &lt;em&gt;&lt;strong&gt;year&lt;/strong&gt;&lt;/em&gt; 、 &lt;em&gt;&lt;strong&gt;text&lt;/strong&gt;&lt;/em&gt; 三个字段&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href=&#34;https://textdata.cn/blog/2024-04-16-china-listed-company-information-dataset/&#34;&gt;2000-2023年A股上市公司基本信息&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;&lt;strong&gt;上市公司基本信息2000-2023.csv&lt;/strong&gt;&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;含&lt;em&gt;&lt;strong&gt;Symbol&lt;/strong&gt;&lt;/em&gt;、&lt;em&gt;&lt;strong&gt;FullName&lt;/strong&gt;&lt;/em&gt;、&lt;em&gt;&lt;strong&gt;ShortName&lt;/strong&gt;&lt;/em&gt;、&lt;em&gt;&lt;strong&gt;IndustryName&lt;/strong&gt;&lt;/em&gt;、&lt;em&gt;&lt;strong&gt;EndDate&lt;/strong&gt;&lt;/em&gt;等 39 个字段。&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;字段含义&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;[年报、管理层讨论与分析数据]
- year 会计年度
- text 年报文本 或 管理层讨论与分析文本
- code 股票代码

[A股基本信息]
- Symbol 股票代码
- ShortName 股票简称， 一般ST字符会出现在这里
- FullName 中文全称
- EndDate 统计截止日期
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;br&gt;
&lt;h3 id=&#34;31-读取数据&#34;&gt;3.1 读取数据&lt;/h3&gt;
&lt;p&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-03-23-china-a-share-market-dataset-mda-from-01-to-21/&#34;&gt;2001-2023年A股上市公司管理层讨论与分析&lt;/a&gt;&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#读取数据&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;mda_df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;data/mda01-23.csv.gz&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;compression&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;gzip&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#将year更改为字符串格式&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;mda_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;year&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;mda_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;year&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;astype&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;mda_df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/03-mda-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;&lt;a href=&#34;https://textdata.cn/blog/2024-04-16-china-listed-company-information-dataset/&#34;&gt;2000-2023年A股上市公司基本信息&lt;/a&gt; 含 行业信息、公司简称里ST等信息， 可以按条件筛选记录。同时，也要构造出 year、code字段，方便后续与mda_df 交集并表。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;ind_df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;data/上市公司基本信息2000-2023.csv&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;ind_df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ind_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ind_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;Symbol&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;!=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;股票代码&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;ind_df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/04-ind_df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;32-筛选样本&#34;&gt;3.2 筛选样本&lt;/h3&gt;
&lt;p&gt;为了保证数据质量， 论文对样本进行的操作&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;1. 剔除金融行业公司；
2. 剔除信息传输、软件和信息技术 服务业以及科学研究和技术服务行业，原因在于这些行业天生使用云计算、大数据以及人工智能技术并披露 相关信息，可能无法清楚判断这些企业应用人工智能技术对其生产效率的影响；
3. 剔除当年处于 ST 和 ``*ST`` 状态的样本；
4. 剔除数据缺失的样本
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;p&gt;筛选记录的代码&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;#行业筛选条件&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;mask1&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ind_df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;IndustryNameC&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;contains&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;金融|信息|科学研究|技术服务&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#公司名筛选条件&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;mask2&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ind_df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ShortName&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;contains&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;ST&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#剔除行业为金融、信息、科学研究、技术服务等上市公司&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#或&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#公司名含ST、*ST&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;ind_df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ind_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;mask1&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;mask2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)]&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#将ind_df中年份、股票代码相关字段改名为【year】【code】，方便与 mda_df并表&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;ind_df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;rename&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;columns&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;Symbol&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;code&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;},&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;inplace&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;ind_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;year&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ind_df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;EndDate&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;apply&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;k&#34;&gt;lambda&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;date&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;date&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)[:&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;4&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;ind_df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ind_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;year&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;code&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;FullName&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]]&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;ind_df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/05-ind_df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;以 &lt;em&gt;&lt;strong&gt;交集(inner)&lt;/strong&gt;&lt;/em&gt; 方式合并 &lt;em&gt;&lt;strong&gt;mda_df&lt;/strong&gt;&lt;/em&gt;  和  &lt;em&gt;&lt;strong&gt;ind_df&lt;/strong&gt;&lt;/em&gt;，  相当于剔除了mda数据中金融、信息、科学研究、技术服务、ST、&lt;code&gt;*ST&lt;/code&gt; 公司&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;mda_df2 = pd.merge(mda_df, ind_df, on=[&amp;#39;code&amp;#39;, &amp;#39;year&amp;#39;], how=&amp;#39;inner&amp;#39;)
mda_df2 = mda_df2[[&amp;#39;FullName&amp;#39;, &amp;#39;year&amp;#39;, &amp;#39;code&amp;#39;, &amp;#39;text&amp;#39;]]
mda_df2
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/06-mda-df2.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h2 id=&#34;四测量ai指标&#34;&gt;四、测量AI指标&lt;/h2&gt;
&lt;p&gt;测量人工智能指标代码比较简单，&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;选中 &lt;em&gt;&lt;strong&gt;text&lt;/strong&gt;&lt;/em&gt;字段, 利用字符串属性 &lt;em&gt;&lt;strong&gt;.str.count()&lt;/strong&gt;&lt;/em&gt; 测量 &lt;em&gt;&lt;strong&gt;AI-Words&lt;/strong&gt;&lt;/em&gt; 出现次数，&lt;/li&gt;
&lt;li&gt;&lt;em&gt;&lt;strong&gt;np.log&lt;/strong&gt;&lt;/em&gt; 自然对数处理&lt;/li&gt;
&lt;li&gt;选择必要的字段&lt;em&gt;&lt;strong&gt;year&lt;/strong&gt;&lt;/em&gt;、&lt;em&gt;&lt;strong&gt;code&lt;/strong&gt;&lt;/em&gt;、&lt;em&gt;&lt;strong&gt;AI&lt;/strong&gt;&lt;/em&gt; 进行保存和展示&lt;/li&gt;
&lt;/ol&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;numpy&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;np&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#测量企业人工智能指数&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#计算结果保存为字段AI&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;mda_df2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;AI&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;np&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;log&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;mda_df2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;text&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;count&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;AI_Words&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;mda_df3&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;mda_df2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;year&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;code&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;AI&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]]&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#保存为csv/xlsx&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;mda_df3&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;to_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;A股人工智能指标2001-2023(mda).csv&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;index&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;False&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;mda_df3&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;to_excel&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;A股人工智能指标2001-2023(mda).xlsx&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;index&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;False&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#展示结果&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;mda_df3&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/07-ai-index.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;h2 id=&#34;五获取资料&#34;&gt;五、获取资料&lt;/h2&gt;
&lt;h3 id=&#34;51-免费说明&#34;&gt;5.1 免费说明&lt;/h3&gt;
&lt;p&gt;阅读是免费的， 推文内的相关模型、安装包、数据是付费获取。&lt;/p&gt;
&lt;p&gt;今日推文最核心的python代码只有2行， 看到就赚到！今日推文要计算「&lt;em&gt;&lt;strong&gt;企业人工智能指数&lt;/strong&gt;&lt;/em&gt;」，&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;#AI相关词&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;AI_Words&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;机器翻译|机器学习|计算机视觉|人机交互|深度学习|神经网络|生物识别|数据挖掘|特征识别|语音合成|语音识别|知识图谱|智慧银行|智能保险|人机协同|智能监管|智能教育|智能客服|智能零售|智能农业|智能投顾|增强现实|虚拟现实|智能医疗|智能语音|智能政务|自动驾驶|智能运输|卷积神经网络|声纹识别|特征提取|无人驾驶|人脸识别|商业智能|循环神经网络|大数据营销|大数据分析|大数据处理|支持向量机|长短期记忆|机器人流程|自然语言|分布式计算|可穿戴产品|大数据管理|智能传感器|模式识别|边缘计算|大数据平台|语音交互|智能环保|人机对话|深度神经网络|大数据运营&amp;#39;&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#企业人工智能指数，保存为字段AI&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;mda_df2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;AI&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;np&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;log&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;mda_df2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;text&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;count&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;AI_Words&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;52-付费说明&#34;&gt;5.2 付费说明&lt;/h3&gt;
&lt;p&gt;内容整理不易， 想尽快复现本文的同学可以购买对应的数据、安装包、Word2Vec模型。加 &lt;em&gt;&lt;strong&gt;WeChat: 372335839&lt;/strong&gt;&lt;/em&gt; ， 备注 「&lt;strong&gt;姓名-学校-专业&lt;/strong&gt;」。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;- 100元
   - 管理层讨论与分析(mda01-23.csv.gz)、年报(A01-23.csv.gz)
   - 上市公司基本信息2000-2023.csv
   - A股人工智能指标2001-2023(mda).xlsx     #使用MD&amp;amp;A的计算结果
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;项目结构&#34;&gt;项目结构&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;- 管理世界2024企业人工智能文件夹
    - 代码.ipynb                                    #代码文件
    
    - data                                         #数据文件夹
       - A01-23.csv.gz                             #年报
       - mda01-23.csv.gz                           #md&amp;amp;a
       - 上市公司基本信息2000-2023.csv                #基本信息
       
    - A股人工智能指标2001-2023(mda).xlsx              #使用MD&amp;amp;A的计算结果
    
    - Word2Vec                                     #模型文件夹
       - mda01-23.200.6.bin
       - mda01-23.200.6.bin.syn1neg.npy
       - mda01-23.200.6.bin.wv.vectors.npy
       - 1000w专利摘要文本.100.6.bin
       - 1000w专利摘要文本.100.6.bin.syn1neg.npy
       - 1000w专利摘要文本.100.6.bin.wv.vectors.npy
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;cntext使用声明&#34;&gt;cntext使用声明&lt;/h2&gt;
&lt;p&gt;如在研究或项目中使用 cntext ，请在文中介绍并附引用声明。引用格式可参考 &lt;a href=&#34;https://textdata.cn/blog/2025-09-09-transform-text-data-into-structured-data-with-lmstudio-and-cntext/#%E4%BD%BF%E7%94%A8%E5%A3%B0%E6%98%8E&#34;&gt;cntext 推荐引用格式&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;相关内容请阅读&#34;&gt;相关内容请阅读&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&#34;https://textdata.cn/blog/2024-04-27-cntext2x-tutorial/&#34;&gt;文本分析库cntext2.x使用说明文档&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-03-23-china-a-share-market-dataset-mda-from-01-to-21/&#34;&gt;数据集 | 2001-2023年A股上市公司年报&amp;amp;管理层讨论与分析&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&#34;https://textdata.cn/blog/2024-04-16-china-listed-company-information-dataset/&#34;&gt;数据集 | 2000-2023年A股上市公司基本信息&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-03-24-load-w2v-and-expand-your-concpet-dicitonary/&#34;&gt;词向量 | 使用MD&amp;amp;A2001-2022语料训练Word2Vec模型&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-11-10-training-word2vec-model-using-china-3751w-patent-application-dataset/&#34;&gt;词向量 | 使用1985年-2022年专利申请摘要训练word2vec模型&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;br&gt;
&lt;br&gt;
</description>
      <content:encoded><![CDATA[<h2 id="一案例">一、案例</h2>
<h3 id="11-文献">1.1 文献</h3>
<p>姚加权, 张锟澎, 郭李鹏, 冯绪. 人工智能如何提升企业生产效率？——基于劳动力技能结构调整的视角[J]. 管理世界, 2024, 40 (02): 101-116+133+117-122.</p>
<p>摘要:人工智能技术对实现经济的高质量发展具有重要意义。现有研究多聚焦于人工智能对宏观经济的影响，本文从企业层面考察了人工智能技术如何影响生产效率和劳动力技能结构。<strong>本文运用机器学习方法生成了「人工智能词典」，并对上市公司的年报和专利进行「文本分析」，进而构建了企业层面的「人工智能指标」</strong>。研究发现，人工智能显著提升了中国上市公司的生产率，并且该结论在一系列稳健性检验后依旧成立。在影响机制方面，人工智能通过促使企业减少常规低技能劳动力需求、增加非常规高技能劳动力需求的方式提升企业的生产率，这体现了企业劳动力技能结构的调整。异质性分析表明，产权性质、人才获得方式、劳动力保障、治理结构等企业层面因素对人工智能的生产率效应有较大影响。此外，企业所处的行业和地区层面因素也影响了人工智能的生产率效应。最后，本文发现人工智能提高了企业价值。本文加深了对微观企业层面人工智能在生产过程中所扮演角色的认知和理解，并为在微观企业层面推动人工智能技术发展提供了建议。</p>
<br>
<h3 id="12-指标构建步骤">1.2 指标构建步骤</h3>
<p>下图是论文中「人工智能指标」构建的流程图</p>
<p><img loading="lazy" src="img/01-steps.png" alt=""  />
</p>
<p>我们将步骤分成三步</p>
<ul>
<li>Step1. 训练Word2Vec模型构建「人工智能AI词典」, 共54个词</li>
<li>Step2. 统计上市公司 「年报」中AI词词频m，采用自然对数处理得到指标Ln(m+1)</li>
<li>Step3. 统计上市公司「MD&amp;A」数据中AI词词频n，采用自然对数处理得到指标Ln(n+1)</li>
<li>Step4. 根据上市公司申请专利的名称和摘要是否含AI词，统计上市公司AI专利申请数量p，采用自然对数处理得到指标Ln(p+1)</li>
</ul>
<blockquote>
<p>企业申请的人工智能专利代表企业已经拥有的人工智能技术，反映了企业人工智能技术的产出情况，能够与年报相互印证企业的人工智能技术水平</p>
</blockquote>
<p>为了减轻阅读压力，也为了减轻制作本文的工作量， <strong>本文使用MD&amp;A数据， 实现 Step1 、Step3(Step2、Step3算法相同)， 覆盖截图中的红色框范围内的计算方法</strong>。</p>
<br>
<h3 id="13-项目结构">1.3 项目结构</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">- 管理世界2024企业人工智能文件夹
    - 代码.ipynb                                    #代码文件
    
    - data                                         #数据文件夹
       - A01-23.csv.gz                             #年报
       - mda01-23.csv.gz                           #md&amp;a
       - 上市公司基本信息2000-2023.csv                #基本信息
       
    - A股人工智能指标2001-2023(mda).xlsx              #计算结果
    
    - Word2Vec                                     #模型文件夹
       - mda01-23.200.6.bin
       - mda01-23.200.6.bin.syn1neg.npy
       - mda01-23.200.6.bin.wv.vectors.npy
       - 1000w专利摘要文本.100.6.bin
       - 1000w专利摘要文本.100.6.bin.syn1neg.npy
       - 1000w专利摘要文本.100.6.bin.wv.vectors.npy
</code></pre></div><p><br><br></p>
<h2 id="二准备ai词典">二、准备AI词典</h2>
<ol>
<li>构造专利摘要语料、管理层讨论与分析语料，分别训练Word2Vec模型</li>
<li>构建人工智能种子词， 使用Word2Vec模型扩展并构建「人工智能词典」</li>
</ol>
<br>
<h3 id="21-训练word2vec模型">2.1 训练Word2Vec模型</h3>
<p>刚好之前分享过使用cntext库(2.0以上版本)训练Word2Vec， 相关推文</p>
<ul>
<li><a href="https://textdata.cn/blog/2023-03-24-load-w2v-and-expand-your-concpet-dicitonary/">词向量 | 使用MD&amp;A2001-2023语料训练Word2Vec模型</a></li>
<li><a href="https://textdata.cn/blog/2023-11-10-training-word2vec-model-using-china-3751w-patent-application-dataset/">词向量 | 使用1985年-2025年专利申请摘要训练word2vec模型</a></li>
</ul>
<p>更多词向量资源，可点击 <a href="https://github.com/hiDaDeng/Chinese-Pretrained-Word-Embeddings">cntext库训练出的免费公开词向量</a></p>
<br>
<h3 id="22-导入word2vec">2.2 导入Word2Vec</h3>
<p>以 mda01-22.200.6.bin 为例， 使用cntext2读取模型， cntext安装和使用请参考 <a href="https://textdata.cn/blog/2024-04-27-cntext2x-tutorial/">文本分析库cntext2.x使用说明文档</a>。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#pip3 install cntext  --upgrade</span>
<span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="c1">#查看cntext版本</span>
<span class="nb">print</span><span class="p">(</span><span class="n">ct</span><span class="o">.</span><span class="n">__version__</span><span class="p">)</span>

<span class="c1">#导入管理层讨论与分析的Word2Vec模型</span>
<span class="n">mda_w2v_m</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">load_w2v</span><span class="p">(</span><span class="s1">&#39;Word2Vec/mda01-23.200.6.bin&#39;</span><span class="p">)</span>
<span class="c1">#导入专利摘要Word2Vec模型</span>
<span class="c1">#pat_w2v_m = ct.load_w2v(&#39;Word2Vec/1000w专利摘要文本.100.6.bin&#39;)</span>

<span class="n">mda_w2v_m</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">2.1.7

Loading word2vec model...
&lt;gensim.models.word2vec.Word2Vec at 0x7dbf9afd0&gt;
</code></pre></div><br>
<p>查看某个词的词向量</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">mda_w2v_m.wv.get_vector(&#39;人工智能&#39;)
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">array([-3.8744571 , -0.5923845 , -1.8126943 ,  1.660894  ,  1.4194168 ,
        1.0365077 , -0.21333796, -0.60481924,  1.5012817 , -0.24060927,
       -1.7463511 , -2.1997519 , -0.66537315, -1.2665682 ,  0.14333063,
       -0.1268099 ,  2.005481  , -1.4638793 ,  3.7950375 ,  0.20866613,
        1.0281029 , -1.5495429 , -0.2518896 ,  1.4159175 ,  3.178865  ,
        .............................#省略展示..........................
       -1.2206184 ,  1.6766415 , -0.1082068 ,  0.62580353,  1.4639648 ,
        2.2743094 , -0.48386717,  1.3510187 ,  1.1698194 ,  0.72390413,
       -0.4855997 ,  1.0688399 ,  0.77217335, -1.4559731 ,  1.4391305 ,
        0.8412411 ,  2.359447  , -1.1504242 ,  1.3677332 , -0.92123735,
        1.281644  ,  0.67157453,  2.159804  ,  1.7593136 , -0.53061306,
       -0.77395666,  0.5912517 ,  1.9448034 ,  0.13023153,  0.6798518 ],
      dtype=float32)
</code></pre></div><br>
<h3 id="23-扩展词典">2.3 扩展词典</h3>
<p>我们每个人对人工智能都有所了解，脑海里首先能想到的词可以当做 「初始种子词」， 例如词语 <code>人工智能|人机对话|</code> 等。 本部分主要展示Word2Vec模型的近义词联想能力，</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">mda_w2v_m</span><span class="o">.</span><span class="n">wv</span><span class="o">.</span><span class="n">most_similar</span><span class="p">([</span><span class="s1">&#39;人工智能&#39;</span><span class="p">,</span> <span class="s1">&#39;人机对话&#39;</span><span class="p">],</span> <span class="n">topn</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[(&#39;自然语言处理&#39;, 0.8055953979492188),
 (&#39;AI&#39;, 0.8050345778465271),
 (&#39;语音识别&#39;, 0.804234504699707),
 (&#39;NLP&#39;, 0.7967724800109863),
 (&#39;交互技术&#39;, 0.7902386784553528),
 (&#39;智能语音&#39;, 0.7870553731918335),
 ..........#省略展示..........
 (&#39;智能识别&#39;, 0.6703209280967712),
 (&#39;结合人工智能&#39;, 0.6701650619506836),
 (&#39;VR技术&#39;, 0.6699633002281189),
 (&#39;人工智能芯片&#39;, 0.6690542101860046),
 (&#39;人工智能数据分析&#39;, 0.6689168214797974),
 (&#39;AR技术&#39;, 0.6688560843467712)]
</code></pre></div><p><br>之后Word2Vec可以根据初始种子词进行扩充，再经过人工检查，最终构建「<strong>人工智能词典</strong>」(论文附表3截图), 我将其整理为 <em><strong>AI-Words</strong></em></p>
<p><img loading="lazy" src="img/02-ai-words.png" alt=""  />
</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">AI_Words</span> <span class="o">=</span> <span class="s1">&#39;机器翻译|机器学习|计算机视觉|人机交互|深度学习|神经网络|生物识别|数据挖掘|特征识别|语音合成|语音识别|知识图谱|智慧银行|智能保险|人机协同|智能监管|智能教育|智能客服|智能零售|智能农业|智能投顾|增强现实|虚拟现实|智能医疗|智能语音|智能政务|自动驾驶|智能运输|卷积神经网络|声纹识别|特征提取|无人驾驶|人脸识别|商业智能|循环神经网络|大数据营销|大数据分析|大数据处理|支持向量机|长短期记忆|机器人流程|自然语言|分布式计算|可穿戴产品|大数据管理|智能传感器|模式识别|边缘计算|大数据平台|语音交互|智能环保|人机对话|深度神经网络|大数据运营&#39;</span>
<span class="n">AI_Words</span>
</code></pre></div><p><br><br></p>
<h2 id="三准备数据">三、准备数据</h2>
<p>为了保证数据质量， 论文对样本进行的操作</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">1. 剔除金融行业公司；
2. 剔除信息传输、软件和信息技术 服务业以及科学研究和技术服务行业，原因在于这些行业天生使用云计算、大数据以及人工智能技术并披露 相关信息，可能无法清楚判断这些企业应用人工智能技术对其生产效率的影响；
3. 剔除当年处于 ST 和*ST 状 态的样本；
4. 剔除数据缺失的样本
</code></pre></div><br>
<p>大邓这里有几个数据文件，经过一些操作(字段名统一、 整理会计年度、合并多源数据)，就能实现论文中的样本操作。<strong>文末有数据获取方式</strong> 。</p>
<table>
<thead>
<tr>
<th>数据</th>
<th>文件名</th>
<th>所含字段</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://textdata.cn/blog/2023-03-23-china-a-share-market-dataset-mda-from-01-to-21/">2001-2023年A股上市公司年报</a></td>
<td><em><strong>A01-23.csv.gz</strong></em></td>
<td>仅含<em><strong>code</strong></em> 、 <em><strong>year</strong></em> 、 <em><strong>text</strong></em> 三个字段</td>
</tr>
<tr>
<td><a href="https://textdata.cn/blog/2023-03-23-china-a-share-market-dataset-mda-from-01-to-21/">2001-2023年A股上市公司管理层讨论与分析</a></td>
<td><em><strong>mda01-23.csv.gz</strong></em></td>
<td>仅含<em><strong>code</strong></em> 、 <em><strong>year</strong></em> 、 <em><strong>text</strong></em> 三个字段</td>
</tr>
<tr>
<td><a href="https://textdata.cn/blog/2024-04-16-china-listed-company-information-dataset/">2000-2023年A股上市公司基本信息</a></td>
<td><em><strong>上市公司基本信息2000-2023.csv</strong></em></td>
<td>含<em><strong>Symbol</strong></em>、<em><strong>FullName</strong></em>、<em><strong>ShortName</strong></em>、<em><strong>IndustryName</strong></em>、<em><strong>EndDate</strong></em>等 39 个字段。</td>
</tr>
</tbody>
</table>
<p>字段含义</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[年报、管理层讨论与分析数据]
- year 会计年度
- text 年报文本 或 管理层讨论与分析文本
- code 股票代码

[A股基本信息]
- Symbol 股票代码
- ShortName 股票简称， 一般ST字符会出现在这里
- FullName 中文全称
- EndDate 统计截止日期
</code></pre></div><br>
<br>
<h3 id="31-读取数据">3.1 读取数据</h3>
<p><a href="https://textdata.cn/blog/2023-03-23-china-a-share-market-dataset-mda-from-01-to-21/">2001-2023年A股上市公司管理层讨论与分析</a></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="c1">#读取数据</span>
<span class="n">mda_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;data/mda01-23.csv.gz&#39;</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">)</span>
<span class="c1">#将year更改为字符串格式</span>
<span class="n">mda_df</span><span class="p">[</span><span class="s1">&#39;year&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">mda_df</span><span class="p">[</span><span class="s1">&#39;year&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="nb">str</span><span class="p">)</span>
<span class="n">mda_df</span>
</code></pre></div><p><img loading="lazy" src="img/03-mda-df.png" alt=""  />
</p>
<br>
<p><a href="https://textdata.cn/blog/2024-04-16-china-listed-company-information-dataset/">2000-2023年A股上市公司基本信息</a> 含 行业信息、公司简称里ST等信息， 可以按条件筛选记录。同时，也要构造出 year、code字段，方便后续与mda_df 交集并表。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">ind_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;data/上市公司基本信息2000-2023.csv&#39;</span><span class="p">)</span>
<span class="n">ind_df</span> <span class="o">=</span> <span class="n">ind_df</span><span class="p">[</span><span class="n">ind_df</span><span class="p">[</span><span class="s1">&#39;Symbol&#39;</span><span class="p">]</span><span class="o">!=</span><span class="s1">&#39;股票代码&#39;</span><span class="p">]</span>
<span class="n">ind_df</span>
</code></pre></div><p><img loading="lazy" src="img/04-ind_df.png" alt=""  />
</p>
<br>
<h3 id="32-筛选样本">3.2 筛选样本</h3>
<p>为了保证数据质量， 论文对样本进行的操作</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">1. 剔除金融行业公司；
2. 剔除信息传输、软件和信息技术 服务业以及科学研究和技术服务行业，原因在于这些行业天生使用云计算、大数据以及人工智能技术并披露 相关信息，可能无法清楚判断这些企业应用人工智能技术对其生产效率的影响；
3. 剔除当年处于 ST 和 ``*ST`` 状态的样本；
4. 剔除数据缺失的样本
</code></pre></div><br>
<p>筛选记录的代码</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#行业筛选条件</span>
<span class="n">mask1</span> <span class="o">=</span> <span class="n">ind_df</span><span class="o">.</span><span class="n">IndustryNameC</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">&#39;金融|信息|科学研究|技术服务&#39;</span><span class="p">)</span>
<span class="c1">#公司名筛选条件</span>
<span class="n">mask2</span> <span class="o">=</span> <span class="n">ind_df</span><span class="o">.</span><span class="n">ShortName</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">&#39;ST&#39;</span><span class="p">)</span>

<span class="c1">#剔除行业为金融、信息、科学研究、技术服务等上市公司</span>
<span class="c1">#或</span>
<span class="c1">#公司名含ST、*ST</span>
<span class="n">ind_df</span> <span class="o">=</span> <span class="n">ind_df</span><span class="p">[</span><span class="o">-</span><span class="p">(</span><span class="n">mask1</span> <span class="o">|</span> <span class="n">mask2</span><span class="p">)]</span>

<span class="c1">#将ind_df中年份、股票代码相关字段改名为【year】【code】，方便与 mda_df并表</span>
<span class="n">ind_df</span><span class="o">.</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="s1">&#39;Symbol&#39;</span><span class="p">:</span> <span class="s1">&#39;code&#39;</span><span class="p">},</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">ind_df</span><span class="p">[</span><span class="s1">&#39;year&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">ind_df</span><span class="o">.</span><span class="n">EndDate</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">date</span><span class="p">:</span> <span class="nb">str</span><span class="p">(</span><span class="n">date</span><span class="p">)[:</span><span class="mi">4</span><span class="p">])</span>
<span class="n">ind_df</span> <span class="o">=</span> <span class="n">ind_df</span><span class="p">[[</span><span class="s1">&#39;year&#39;</span><span class="p">,</span> <span class="s1">&#39;code&#39;</span><span class="p">,</span> <span class="s1">&#39;FullName&#39;</span><span class="p">]]</span>
<span class="n">ind_df</span>
</code></pre></div><p><img loading="lazy" src="img/05-ind_df.png" alt=""  />
</p>
<br>
<p>以 <em><strong>交集(inner)</strong></em> 方式合并 <em><strong>mda_df</strong></em>  和  <em><strong>ind_df</strong></em>，  相当于剔除了mda数据中金融、信息、科学研究、技术服务、ST、<code>*ST</code> 公司</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">mda_df2 = pd.merge(mda_df, ind_df, on=[&#39;code&#39;, &#39;year&#39;], how=&#39;inner&#39;)
mda_df2 = mda_df2[[&#39;FullName&#39;, &#39;year&#39;, &#39;code&#39;, &#39;text&#39;]]
mda_df2
</code></pre></div><p><img loading="lazy" src="img/06-mda-df2.png" alt=""  />
</p>
<br>
<h2 id="四测量ai指标">四、测量AI指标</h2>
<p>测量人工智能指标代码比较简单，</p>
<ol>
<li>选中 <em><strong>text</strong></em>字段, 利用字符串属性 <em><strong>.str.count()</strong></em> 测量 <em><strong>AI-Words</strong></em> 出现次数，</li>
<li><em><strong>np.log</strong></em> 自然对数处理</li>
<li>选择必要的字段<em><strong>year</strong></em>、<em><strong>code</strong></em>、<em><strong>AI</strong></em> 进行保存和展示</li>
</ol>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>

<span class="c1">#测量企业人工智能指数</span>
<span class="c1">#计算结果保存为字段AI</span>
<span class="n">mda_df2</span><span class="p">[</span><span class="s1">&#39;AI&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="n">mda_df2</span><span class="p">[</span><span class="s1">&#39;text&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">count</span><span class="p">(</span><span class="n">AI_Words</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">mda_df3</span> <span class="o">=</span> <span class="n">mda_df2</span><span class="p">[[</span><span class="s1">&#39;year&#39;</span><span class="p">,</span> <span class="s1">&#39;code&#39;</span><span class="p">,</span> <span class="s1">&#39;AI&#39;</span><span class="p">]]</span>

<span class="c1">#保存为csv/xlsx</span>
<span class="n">mda_df3</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s1">&#39;A股人工智能指标2001-2023(mda).csv&#39;</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
<span class="n">mda_df3</span><span class="o">.</span><span class="n">to_excel</span><span class="p">(</span><span class="s1">&#39;A股人工智能指标2001-2023(mda).xlsx&#39;</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>

<span class="c1">#展示结果</span>
<span class="n">mda_df3</span>
</code></pre></div><p><img loading="lazy" src="img/07-ai-index.png" alt=""  />
</p>
<br>
<br>
<h2 id="五获取资料">五、获取资料</h2>
<h3 id="51-免费说明">5.1 免费说明</h3>
<p>阅读是免费的， 推文内的相关模型、安装包、数据是付费获取。</p>
<p>今日推文最核心的python代码只有2行， 看到就赚到！今日推文要计算「<em><strong>企业人工智能指数</strong></em>」，</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#AI相关词</span>
<span class="n">AI_Words</span> <span class="o">=</span> <span class="s1">&#39;机器翻译|机器学习|计算机视觉|人机交互|深度学习|神经网络|生物识别|数据挖掘|特征识别|语音合成|语音识别|知识图谱|智慧银行|智能保险|人机协同|智能监管|智能教育|智能客服|智能零售|智能农业|智能投顾|增强现实|虚拟现实|智能医疗|智能语音|智能政务|自动驾驶|智能运输|卷积神经网络|声纹识别|特征提取|无人驾驶|人脸识别|商业智能|循环神经网络|大数据营销|大数据分析|大数据处理|支持向量机|长短期记忆|机器人流程|自然语言|分布式计算|可穿戴产品|大数据管理|智能传感器|模式识别|边缘计算|大数据平台|语音交互|智能环保|人机对话|深度神经网络|大数据运营&#39;</span>

<span class="c1">#企业人工智能指数，保存为字段AI</span>
<span class="n">mda_df2</span><span class="p">[</span><span class="s1">&#39;AI&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="n">mda_df2</span><span class="p">[</span><span class="s1">&#39;text&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">count</span><span class="p">(</span><span class="n">AI_Words</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span>
</code></pre></div><br>
<h3 id="52-付费说明">5.2 付费说明</h3>
<p>内容整理不易， 想尽快复现本文的同学可以购买对应的数据、安装包、Word2Vec模型。加 <em><strong>WeChat: 372335839</strong></em> ， 备注 「<strong>姓名-学校-专业</strong>」。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">- 100元
   - 管理层讨论与分析(mda01-23.csv.gz)、年报(A01-23.csv.gz)
   - 上市公司基本信息2000-2023.csv
   - A股人工智能指标2001-2023(mda).xlsx     #使用MD&amp;A的计算结果
</code></pre></div><br>
<h3 id="项目结构">项目结构</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">- 管理世界2024企业人工智能文件夹
    - 代码.ipynb                                    #代码文件
    
    - data                                         #数据文件夹
       - A01-23.csv.gz                             #年报
       - mda01-23.csv.gz                           #md&amp;a
       - 上市公司基本信息2000-2023.csv                #基本信息
       
    - A股人工智能指标2001-2023(mda).xlsx              #使用MD&amp;A的计算结果
    
    - Word2Vec                                     #模型文件夹
       - mda01-23.200.6.bin
       - mda01-23.200.6.bin.syn1neg.npy
       - mda01-23.200.6.bin.wv.vectors.npy
       - 1000w专利摘要文本.100.6.bin
       - 1000w专利摘要文本.100.6.bin.syn1neg.npy
       - 1000w专利摘要文本.100.6.bin.wv.vectors.npy
</code></pre></div><p><br><br></p>
<h2 id="cntext使用声明">cntext使用声明</h2>
<p>如在研究或项目中使用 cntext ，请在文中介绍并附引用声明。引用格式可参考 <a href="https://textdata.cn/blog/2025-09-09-transform-text-data-into-structured-data-with-lmstudio-and-cntext/#%E4%BD%BF%E7%94%A8%E5%A3%B0%E6%98%8E">cntext 推荐引用格式</a></p>
<p><br><br></p>
<h2 id="相关内容请阅读">相关内容请阅读</h2>
<ul>
<li>
<p><a href="https://textdata.cn/blog/2024-04-27-cntext2x-tutorial/">文本分析库cntext2.x使用说明文档</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-03-23-china-a-share-market-dataset-mda-from-01-to-21/">数据集 | 2001-2023年A股上市公司年报&amp;管理层讨论与分析</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2024-04-16-china-listed-company-information-dataset/">数据集 | 2000-2023年A股上市公司基本信息</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-03-24-load-w2v-and-expand-your-concpet-dicitonary/">词向量 | 使用MD&amp;A2001-2022语料训练Word2Vec模型</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-11-10-training-word2vec-model-using-china-3751w-patent-application-dataset/">词向量 | 使用1985年-2022年专利申请摘要训练word2vec模型</a></p>
</li>
</ul>
<br>
<br>
]]></content:encoded>
    </item>
    
    <item>
      <title>代码 | 使用 MD&amp;A文本测量「企业不确定性感知FEPU」</title>
      <link>https://textdata.cn/blog/2024-04-25-firm-economic-policy-uncertainty/</link>
      <pubDate>Thu, 25 Apr 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-04-25-firm-economic-policy-uncertainty/</guid>
      <description>&lt;p&gt;本文使用的缩写&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;&lt;strong&gt;EPU&lt;/strong&gt;&lt;/em&gt;   经济政策不确定性(Economic Policy Uncertainty)&lt;/li&gt;
&lt;li&gt;&lt;em&gt;&lt;strong&gt;FEPU&lt;/strong&gt;&lt;/em&gt; 企业不确定性感知( Subjective perception of economic policy uncertainty)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;一背景&#34;&gt;一、背景&lt;/h2&gt;
&lt;p&gt;「&lt;em&gt;&lt;strong&gt;经济政策不确定性&lt;/strong&gt;&lt;/em&gt;(EPU)」 通常是用来衡量经济中政策不确定性水平的一种度量方式。企业作为一个理性的经济主体， 需要根据未来的期望成本和收益进行决策 。政府的经济政策会在很大程度上影响企业的预期成本和收益 ， 如果经济政策频繁变化 ， 会给企业带来困扰 。现有文献经济政策不确定性测量思路大概有&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;股票市场隐含波动率VIX衡量宏观层面经济不确定性。&lt;/li&gt;
&lt;li&gt;利用外生变量，并结合企业对这些外生变量的依赖程度衡量企业面临的不确定性 。如政治事件、能源价格、汇率波动、贸易协定签订。&lt;/li&gt;
&lt;li&gt;利用新闻文本测量的经济不确定性。&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;但 经济政策不确定性指标(EPU)存在两个问题&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;EPU是宏观指标， 同期所有企业的EPU有且仅有一个观测值。&lt;/li&gt;
&lt;li&gt;EPU默认所有企业是同质， 对经济政策不确定性的感知是相同的。&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;本推文参考聂辉华等(2020)内的算法,  实现利用 &lt;em&gt;&lt;strong&gt;经营讨论与分析(MD&amp;amp;A)文本数据&lt;/strong&gt;&lt;/em&gt;  测量企业「&lt;em&gt;&lt;strong&gt;企业不确定性感知FEPU&lt;/strong&gt;&lt;/em&gt;」(FEPU,  Subjective perception of economic policy uncertainty) 。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二epufepu&#34;&gt;二、EPU&amp;amp;FEPU&lt;/h2&gt;
&lt;h3 id=&#34;21-epu&#34;&gt;2.1 EPU&lt;/h3&gt;
&lt;p&gt;在复现「&lt;em&gt;&lt;strong&gt;企业不确定性感知FEPU&lt;/strong&gt;&lt;/em&gt;」前，我们先了解利用新闻数据测量 &lt;em&gt;&lt;strong&gt;EPU&lt;/strong&gt;&lt;/em&gt; 的算法，这样更容易理解 &lt;em&gt;&lt;strong&gt;FEPU&lt;/strong&gt;&lt;/em&gt; 的原理。参考Huang、Yun&amp;amp; Paul(2020)，大邓在前段时间分享了一个代码教程 &lt;a href=&#34;https://textdata.cn/blog/2023-12-20-measure-china-economic-policy-uncertainty/&#34;&gt;代码 | 使用「新闻数据」计算 「经济政策不确定性」指数&lt;/a&gt; 。 &lt;br&gt;&lt;/p&gt;
&lt;p&gt;新闻数据计算EPU的算法&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;Step-1. 选择了114家中国大陆的报纸，其中包括北京、上海、广州和天津等主要城市的报纸。
Step-2. 对于每家报纸，搜索包含以下三个关键词之一的文章：经济、不确定性和政策。这些关键词的中文和英文对照可以在论文的表格1中找到。
Step-3. 将每个月的文章数量按照满足第一个关键词的文章数量进行缩放。
Step-4. 将时间序列标准化，使其在2000年1月至2011年12月期间的标准差为1。 保证所有媒体计算得到的epu是可比的。
Step-5. 对十家报纸的月度序列进行简单平均，并将指标归一化，使其在2000年1月至2011年12月期间的平均值为100。
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;p&gt;文献中算法内容长， 结构化不足， 理解起来需要一些脑力。 大邓换种描述方式&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;EPU_t = m/n

- m  时期 t 同时含经济Economic、政策Policy、不确定Uncertainty三类词的新闻条数m
- n  时期 t 总的新闻条数n
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h3 id=&#34;22-fepu&#34;&gt;2.2 FEPU&lt;/h3&gt;
&lt;p&gt;理解了 EPU， 就能类比理解「&lt;em&gt;&lt;strong&gt;企业不确定性感知FEPU&lt;/strong&gt;&lt;/em&gt;」的算法。&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;算法&lt;/th&gt;
&lt;th&gt;数据&lt;/th&gt;
&lt;th&gt;层次&lt;/th&gt;
&lt;th&gt;n&lt;/th&gt;
&lt;th&gt;m&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;EPU&lt;/td&gt;
&lt;td&gt;新闻媒体文本&lt;/td&gt;
&lt;td&gt;新闻&lt;/td&gt;
&lt;td&gt;时期t新闻总条数n&lt;/td&gt;
&lt;td&gt;时期t同时存在E、P、U三类词的新闻条数m&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FEPU(word)&lt;/td&gt;
&lt;td&gt;管理层讨论与分析(md&amp;amp;a)&lt;/td&gt;
&lt;td&gt;词语&lt;/td&gt;
&lt;td&gt;将时期t的企业i的 md&amp;amp;a 文本词语个数n。&lt;/td&gt;
&lt;td&gt;1. 对md&amp;amp;a进行分句&lt;br/&gt;2. 同时含EP、U两类词的句子中， 统计这些句子中 U 的词语出现次数之和m&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FEPU(sentence)&lt;/td&gt;
&lt;td&gt;管理层讨论与分析(md&amp;amp;a)&lt;/td&gt;
&lt;td&gt;句子&lt;/td&gt;
&lt;td&gt;将时期t的企业i的 md&amp;amp;a 文本进行分句，得到句子个数n&lt;/td&gt;
&lt;td&gt;1. 对md&amp;amp;a进行分句&lt;br/&gt;2. 同时含EP、U两类词的句子中， 统计这类句子个数m&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;三-准备cntext&#34;&gt;三、 准备cntext&lt;/h2&gt;
&lt;p&gt;EPU 和 FEPU 于今日刚刚封装到 cntext 中， 再计算这两个指数， 就变得容易多了。&lt;/p&gt;
&lt;h3 id=&#34;31-安装cntext&#34;&gt;3.1 安装cntext&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;pip3 install pdfdocx
pip3 install distinctiveness
pip3 install pandarallel
pip3 install cntext
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h3 id=&#34;32-内置词典&#34;&gt;3.2 内置词典&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;内置文件&lt;/th&gt;
&lt;th&gt;词典&lt;/th&gt;
&lt;th&gt;参考文献&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;zh_common_EPU.yaml&lt;/td&gt;
&lt;td&gt;经济E、政策P、不确定U&lt;/td&gt;
&lt;td&gt;Huang, Yun, and Paul Luk（2020）&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;zh_common_FEPU.yaml&lt;/td&gt;
&lt;td&gt;经济政策EP、不确定性U&lt;/td&gt;
&lt;td&gt;聂辉华, 阮睿&amp;amp;沈吉（2020）&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;br&gt;
&lt;h4 id=&#34;31-查看内置词典&#34;&gt;3.1 查看内置词典&lt;/h4&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;cntext&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;ct&lt;/span&gt;

&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;__version__&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;get_dict_list&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;2.1.2

[&amp;#39;zh_common_NTUSD.yaml&amp;#39;,
 &amp;#39;zh_common_DUTIR.yaml&amp;#39;,
 &amp;#39;enzh_common_StopWords.yaml&amp;#39;,
 &amp;#39;en_valence_Concreteness.yaml&amp;#39;,
 &amp;#39;en_common_LoughranMcDonald.yaml&amp;#39;,
 &amp;#39;zh_common_FinanceSenti.yaml&amp;#39;,
 &amp;#39;zh_common_TsinghuaPraiseDegrade.yaml&amp;#39;,
 &amp;#39;zh_common_FEPU.yaml&amp;#39;,    聂辉华, 阮睿&amp;amp;沈吉（2020）
 &amp;#39;en_common_ANEW.yaml&amp;#39;,
 &amp;#39;en_common_NRC.yaml&amp;#39;,
 &amp;#39;zh_valence_ChineseEmoBank.yaml&amp;#39;,
 &amp;#39;zh_valence_SixSemanticDimensionDatabase.yaml&amp;#39;,
 &amp;#39;zh_common_FinacialFormalUnformal.yaml&amp;#39;,
 &amp;#39;zh_common_LoughranMcDonald.yaml&amp;#39;,
 &amp;#39;enzh_common_AdvConj.yaml&amp;#39;,
 &amp;#39;en_common_SentiWS.yaml&amp;#39;,
 &amp;#39;zh_common_Digitalization.yaml&amp;#39;,
 &amp;#39;en_common_LSD2015.yaml&amp;#39;,
 &amp;#39;zh_common_HowNet.yaml&amp;#39;,
 &amp;#39;zh_common_EPU.yaml&amp;#39;]      #Huang, Yun, and Paul Luk（2020）
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h4 id=&#34;312-导入词典&#34;&gt;3.1.2 导入词典&lt;/h4&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;cntext&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;ct&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;FEPU_infos&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_yaml_dict&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;zh_common_FEPU.yaml&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;FEPU_infos&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;{&amp;#39;Name&amp;#39;: &amp;#39;中文经济政策不确定性词典&amp;#39;, 
&amp;#39;Desc&amp;#39;: &amp;#39;中文经济政策不确定性词典, 含经济政策EconomicPolicy、不确定性Uncertainty两个词表&amp;#39;, 
&amp;#39;Refer&amp;#39;: &amp;#39;聂辉华, 阮睿, 沈吉. 企业不确定性感知、投资决策和金融资产配置[J]. 世界经济, 2020, 43 (06): 77-98.&amp;#39;, 
&amp;#39;Category&amp;#39;: [&amp;#39;经济政策&amp;#39;, &amp;#39;不确定&amp;#39;], 
&amp;#39;Dictionary&amp;#39;: 
    {&amp;#39;经济政策&amp;#39;: [&amp;#39;市政&amp;#39;, &amp;#39;政策&amp;#39;, &amp;#39;货币政策&amp;#39;, &amp;#39;政策鼓励&amp;#39;, &amp;#39;国家&amp;#39;, &amp;#39;扩内需&amp;#39;, &amp;#39;保增长&amp;#39;, &amp;#39;促发展&amp;#39;, &amp;#39;产业发展&amp;#39;, &amp;#39;法律&amp;#39;, &amp;#39;法规&amp;#39;, &amp;#39;行业政策&amp;#39;, &amp;#39;产业政策&amp;#39;, &amp;#39;宏观政策&amp;#39;, &amp;#39;国民经济&amp;#39;, &amp;#39;有关部门&amp;#39;, &amp;#39;产业结构调整&amp;#39;, &amp;#39;产业结构&amp;#39;, &amp;#39;当地政府&amp;#39;, &amp;#39;政府&amp;#39;, &amp;#39;经济政策&amp;#39;, &amp;#39;经济走势&amp;#39;, &amp;#39;所得税&amp;#39;, &amp;#39;税收减免&amp;#39;, &amp;#39;刺激政策&amp;#39;, &amp;#39;限贷令&amp;#39;, &amp;#39;限购令&amp;#39;, &amp;#39;保障房&amp;#39;, &amp;#39;宏观调控&amp;#39;, &amp;#39;产业发展&amp;#39;, &amp;#39;证监会&amp;#39;, &amp;#39;国家政策&amp;#39;, &amp;#39;政治&amp;#39;, &amp;#39;军事&amp;#39;, &amp;#39;政策环境&amp;#39;, &amp;#39;宏观&amp;#39;, &amp;#39;政府补助政策&amp;#39;, &amp;#39;调控政策&amp;#39;, &amp;#39;税收政策&amp;#39;, &amp;#39;政策扶持&amp;#39;], 
    &amp;#39;不确定&amp;#39;: [&amp;#39;风险&amp;#39;, &amp;#39;经营风险&amp;#39;, &amp;#39;市场风险&amp;#39;, &amp;#39;信用风险&amp;#39;, &amp;#39;不确定&amp;#39;, &amp;#39;波动&amp;#39;, &amp;#39;变化&amp;#39;, &amp;#39;改变&amp;#39;, &amp;#39;徘徊&amp;#39;, &amp;#39;不稳&amp;#39;, &amp;#39;不稳定&amp;#39;, &amp;#39;不寻常&amp;#39;, &amp;#39;错综复杂&amp;#39;, &amp;#39;非常复杂&amp;#39;, &amp;#39;纷繁复杂&amp;#39;, &amp;#39;纷纭复杂&amp;#39;, &amp;#39;十分复杂&amp;#39;, &amp;#39;变得复杂&amp;#39;, &amp;#39;风云突变&amp;#39;, &amp;#39;矛盾突出&amp;#39;, &amp;#39;突变&amp;#39;, &amp;#39;复杂多变&amp;#39;, &amp;#39;诡谲多变&amp;#39;, &amp;#39;阵痛&amp;#39;, &amp;#39;过渡&amp;#39;, &amp;#39;问责&amp;#39;, &amp;#39;整顿&amp;#39;, &amp;#39;危险&amp;#39;, &amp;#39;动荡&amp;#39;, &amp;#39;多变性&amp;#39;, &amp;#39;震荡&amp;#39;, &amp;#39;难以确定&amp;#39;, &amp;#39;难以预测&amp;#39;, &amp;#39;难以语料&amp;#39;, &amp;#39;难以琢磨&amp;#39;, &amp;#39;难以捉摸&amp;#39;, &amp;#39;接受考验&amp;#39;, &amp;#39;混乱&amp;#39;, &amp;#39;时而&amp;#39;, &amp;#39;随机&amp;#39;]}
    }
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h3 id=&#34;33-内置函数&#34;&gt;3.3 内置函数&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;ct.epu(df,  freq=&amp;#39;Y&amp;#39;,e_pattern=&amp;#39;&amp;#39;, p_pattern=&amp;#39;&amp;#39;, u_pattern=&amp;#39;&amp;#39;)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;ul&gt;
&lt;li&gt;&lt;em&gt;&lt;strong&gt;df&lt;/strong&gt;&lt;/em&gt;  新闻DataFrame；  DataFrame必须含date和text两个字段；每行一条记录，含所有时期所有的新闻。&lt;/li&gt;
&lt;li&gt;&lt;em&gt;&lt;strong&gt;freq&lt;/strong&gt;&lt;/em&gt; 字符串；决定EPU的时间粒度， 年Y、月M、天D， 默认freq=&amp;lsquo;Y&amp;rsquo;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;&lt;strong&gt;e_pattern&lt;/strong&gt;&lt;/em&gt;  字符串；经济类词典，用&lt;code&gt;|&lt;/code&gt;间隔词语，形如 &lt;strong&gt;e_pattern = &amp;lsquo;经济|金融&amp;rsquo;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;&lt;strong&gt;p_pattern&lt;/strong&gt;&lt;/em&gt;  字符串；政策词典，用&lt;code&gt;|&lt;/code&gt;间隔词语，形如 &lt;strong&gt;p_pattern = &amp;lsquo;政策|治理|行政&amp;rsquo;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;&lt;strong&gt;u_pattern&lt;/strong&gt;&lt;/em&gt; 字符串；不确定性词典，用&lt;code&gt;|&lt;/code&gt;间隔词语，形如 &lt;strong&gt;u_pattern = &amp;lsquo;风险|危机|难以预测&amp;rsquo;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;返回epu时间序列数据，格式为DataFrame&lt;/p&gt;
&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;ct.fepu(text,  ep_pattern=&amp;#39;&amp;#39;, u_pattern=&amp;#39;&amp;#39;)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;ul&gt;
&lt;li&gt;&lt;em&gt;&lt;strong&gt;text&lt;/strong&gt;&lt;/em&gt;  ；某时期t某企业i的管理层讨论与分析md&amp;amp;a文本&lt;/li&gt;
&lt;li&gt;&lt;em&gt;&lt;strong&gt;ep_pattern&lt;/strong&gt;&lt;/em&gt;  字符串；经济政策类词典，用&lt;code&gt;|&lt;/code&gt;间隔词语，形如 &lt;strong&gt;ep_pattern = &amp;lsquo;经济|金融|政策|治理|行政&amp;rsquo;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;&lt;strong&gt;u_pattern&lt;/strong&gt;&lt;/em&gt; 字符串；不确定性词典，用&lt;code&gt;|&lt;/code&gt;间隔词语，形如 &lt;strong&gt;u_pattern = &amp;lsquo;风险|危机|难以预测&amp;rsquo;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;四测量fepu&#34;&gt;四、测量FEPU&lt;/h2&gt;
&lt;h3 id=&#34;41-读取数据&#34;&gt;4.1 读取数据&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;&lt;strong&gt;mda01-23.csv.gz&lt;/strong&gt;&lt;/em&gt;   管理层讨论与分析2001-2023文本数据&lt;/li&gt;
&lt;li&gt;&lt;em&gt;&lt;strong&gt;行业代码00-23.xlsx&lt;/strong&gt;&lt;/em&gt;  含股票名称、股票代码、行业等字段。&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;mda01-23.csv.gz&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;compression&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;gzip&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;columns&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;会计年度&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;股票代码&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;经营讨论与分析内容&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#上市公司行业信息&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;ind_info_df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_excel&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;行业代码00-23.xlsx&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#合并数据&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;merge&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ind_info_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;on&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;股票代码&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;会计年度&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;how&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;inner&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#剔除ST和金融类企业&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[(&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;股票简称&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;contains&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;ST&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;行业代码&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;contains&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;J&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))]&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;sort_values&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;会计年度&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ignore_index&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;inplace&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;42-批量计算fepu&#34;&gt;4.2 批量计算FEPU&lt;/h3&gt;
&lt;p&gt;选中字段 「&lt;em&gt;&lt;strong&gt;经营讨论与分析内容&lt;/strong&gt;&lt;/em&gt;」， 对该字段 .apply 运行函数 &lt;em&gt;&lt;strong&gt;ct.fepu&lt;/strong&gt;&lt;/em&gt; ，得到企业感知经济不确定性风险FEPU(含词语和句子两个FEPU)&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;o&#34;&gt;%%&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;time&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#常规速度代码&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#import cntext as ct&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#fepu_df = df[&amp;#39;经营讨论与分析内容&amp;#39;].apply(ct.fepu)&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#res_df = pd.concat([df[[&amp;#39;会计年度&amp;#39;, &amp;#39;股票代码&amp;#39;]], fepu_df],   axis=1)&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#res_df.to_csv(&amp;#39;result.csv&amp;#39;, index=False)&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#res_df&lt;/span&gt;


&lt;span class=&#34;c1&#34;&gt;#加速版代码&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;cntext&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;ct&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandarallel&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pandarallel&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;pandarallel&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;initialize&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;fepu_df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;经营讨论与分析内容&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;parallel_apply&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fepu&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;res_df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;concat&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;([&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;会计年度&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;股票代码&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]],&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;fepu_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;axis&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;res_df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;to_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;企业感知不确定性FEPU指数2001-2023.csv&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;index&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;False&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;res_df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;CPU times: user 1.35 s, sys: 1.2 s, total: 2.54 s
Wall time: 5min 9s
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/df2.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;`&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h3 id=&#34;43-可视化&#34;&gt;4.3 可视化&lt;/h3&gt;
&lt;p&gt;根据 FEPUw 和 FEPUs 的年度均值， 绘制2001-2022期间的经济政策不确定性变化折线图&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;plt&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;matplotlib&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;scienceplots&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;platform&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;matplotlib_inline&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;matplotlib_inline&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;backend_inline&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;set_matplotlib_formats&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;png&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;svg&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;style&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;use&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;([&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;science&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;no-latex&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;cjk-sc-font&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;system&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;platform&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;system&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;  &lt;span class=&#34;c1&#34;&gt;# 获取操作系统类型&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;if&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;system&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;==&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;Windows&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;font&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;family&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;SimHei&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;elif&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;system&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;==&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;Darwin&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;font&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;family&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;Arial Unicode MS&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;else&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;font&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;family&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;sans-serif&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;matplotlib&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;rc&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;font&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;**&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;font&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;  &lt;span class=&#34;c1&#34;&gt;# 设置全局字体&lt;/span&gt;


&lt;span class=&#34;n&#34;&gt;years&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;range&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;2001&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2024&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;FEPUw_s&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[]&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;FEPUs_s&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[]&lt;/span&gt;

&lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;year&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;year_df&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;res_df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;groupby&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;会计年度&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;FEPUw_s&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;append&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;year_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;FEPUw&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;mean&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;())&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;FEPUs_s&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;append&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;year_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;FEPUs&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;mean&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;())&lt;/span&gt;
    
    
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;figure&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;figsize&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;10&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;5&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;plot&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;years&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;FEPUw_s&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;plot&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;years&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;FEPUs_s&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;scatter&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;years&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;FEPUw_s&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;label&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;SEPUw&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;scatter&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;years&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;FEPUs_s&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;label&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;SEPUs&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;title&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;感知经济政策不确定性FEPU年度均值&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;12&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;xlabel&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;年份&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;13&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ylabel&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;FEPU均值&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;13&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;legend&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;show&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/plot.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;五参考文献&#34;&gt;五、参考文献&lt;/h2&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;[1]聂辉华, 阮睿, 沈吉. 企业不确定性感知、投资决策和金融资产配置[J]. 世界经济, 2020, 43 (06): 77-98.
[2]Li, Jing, Huihua Nie, Rui Ruan, and Xinyi Shen. &amp;#34;Subjective perception of economic policy uncertainty and corporate social responsibility: Evidence from China.&amp;#34; International Review of Financial Analysis 91 (2024): 103022.
[3]Huang, Yun, and Paul Luk. &amp;#34;Measuring economic policy uncertainty in China.&amp;#34; China Economic Review 59 (2020): 10136
[4]Caldara, Dario, Matteo Iacoviello, Patrick Molligo, Andrea Prestipino, and Andrea Raffo. &amp;#34;The economic effects of trade policy uncertainty.&amp;#34; Journal of Monetary Economics 109 (2020): 38-59.
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;六获取资料&#34;&gt;六、获取资料&lt;/h2&gt;
&lt;p&gt;内容原创不易，&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;- 100元 
   - mda01-23.csv.gz
   - A01-23.csv.gz 
   - 企业感知不确定性FEPU指数
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;加微信 &lt;strong&gt;372335839&lt;/strong&gt;， 备注「姓名-学校-专业」。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;精选内容&#34;&gt;精选内容&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/datasets_available_for_management_science/&#34;&gt;LIST | 可供社科(经管)领域使用的数据集汇总&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/the_text_analysis_list_about_ms/&#34;&gt;LIST | 社科(经管)数据挖掘文献资料汇总&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/&#34;&gt;推荐 | 文本分析库 cntext 使用手册&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/management_python_course/&#34;&gt;付费视频课 | Python实证指标构建与文本分析&lt;/a&gt;
&lt;br&gt;
&lt;br&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
      <content:encoded><![CDATA[<p>本文使用的缩写</p>
<ul>
<li><em><strong>EPU</strong></em>   经济政策不确定性(Economic Policy Uncertainty)</li>
<li><em><strong>FEPU</strong></em> 企业不确定性感知( Subjective perception of economic policy uncertainty)</li>
</ul>
<p><br><br></p>
<h2 id="一背景">一、背景</h2>
<p>「<em><strong>经济政策不确定性</strong></em>(EPU)」 通常是用来衡量经济中政策不确定性水平的一种度量方式。企业作为一个理性的经济主体， 需要根据未来的期望成本和收益进行决策 。政府的经济政策会在很大程度上影响企业的预期成本和收益 ， 如果经济政策频繁变化 ， 会给企业带来困扰 。现有文献经济政策不确定性测量思路大概有</p>
<ol>
<li>股票市场隐含波动率VIX衡量宏观层面经济不确定性。</li>
<li>利用外生变量，并结合企业对这些外生变量的依赖程度衡量企业面临的不确定性 。如政治事件、能源价格、汇率波动、贸易协定签订。</li>
<li>利用新闻文本测量的经济不确定性。</li>
</ol>
<p>但 经济政策不确定性指标(EPU)存在两个问题</p>
<ol>
<li>EPU是宏观指标， 同期所有企业的EPU有且仅有一个观测值。</li>
<li>EPU默认所有企业是同质， 对经济政策不确定性的感知是相同的。</li>
</ol>
<p>本推文参考聂辉华等(2020)内的算法,  实现利用 <em><strong>经营讨论与分析(MD&amp;A)文本数据</strong></em>  测量企业「<em><strong>企业不确定性感知FEPU</strong></em>」(FEPU,  Subjective perception of economic policy uncertainty) 。</p>
<p><br><br></p>
<h2 id="二epufepu">二、EPU&amp;FEPU</h2>
<h3 id="21-epu">2.1 EPU</h3>
<p>在复现「<em><strong>企业不确定性感知FEPU</strong></em>」前，我们先了解利用新闻数据测量 <em><strong>EPU</strong></em> 的算法，这样更容易理解 <em><strong>FEPU</strong></em> 的原理。参考Huang、Yun&amp; Paul(2020)，大邓在前段时间分享了一个代码教程 <a href="https://textdata.cn/blog/2023-12-20-measure-china-economic-policy-uncertainty/">代码 | 使用「新闻数据」计算 「经济政策不确定性」指数</a> 。 <br></p>
<p>新闻数据计算EPU的算法</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">Step-1. 选择了114家中国大陆的报纸，其中包括北京、上海、广州和天津等主要城市的报纸。
Step-2. 对于每家报纸，搜索包含以下三个关键词之一的文章：经济、不确定性和政策。这些关键词的中文和英文对照可以在论文的表格1中找到。
Step-3. 将每个月的文章数量按照满足第一个关键词的文章数量进行缩放。
Step-4. 将时间序列标准化，使其在2000年1月至2011年12月期间的标准差为1。 保证所有媒体计算得到的epu是可比的。
Step-5. 对十家报纸的月度序列进行简单平均，并将指标归一化，使其在2000年1月至2011年12月期间的平均值为100。
</code></pre></div><br>
<p>文献中算法内容长， 结构化不足， 理解起来需要一些脑力。 大邓换种描述方式</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">EPU_t = m/n

- m  时期 t 同时含经济Economic、政策Policy、不确定Uncertainty三类词的新闻条数m
- n  时期 t 总的新闻条数n
</code></pre></div><p><br><br></p>
<h3 id="22-fepu">2.2 FEPU</h3>
<p>理解了 EPU， 就能类比理解「<em><strong>企业不确定性感知FEPU</strong></em>」的算法。</p>
<table>
<thead>
<tr>
<th>算法</th>
<th>数据</th>
<th>层次</th>
<th>n</th>
<th>m</th>
</tr>
</thead>
<tbody>
<tr>
<td>EPU</td>
<td>新闻媒体文本</td>
<td>新闻</td>
<td>时期t新闻总条数n</td>
<td>时期t同时存在E、P、U三类词的新闻条数m</td>
</tr>
<tr>
<td>FEPU(word)</td>
<td>管理层讨论与分析(md&amp;a)</td>
<td>词语</td>
<td>将时期t的企业i的 md&amp;a 文本词语个数n。</td>
<td>1. 对md&amp;a进行分句<br/>2. 同时含EP、U两类词的句子中， 统计这些句子中 U 的词语出现次数之和m</td>
</tr>
<tr>
<td>FEPU(sentence)</td>
<td>管理层讨论与分析(md&amp;a)</td>
<td>句子</td>
<td>将时期t的企业i的 md&amp;a 文本进行分句，得到句子个数n</td>
<td>1. 对md&amp;a进行分句<br/>2. 同时含EP、U两类词的句子中， 统计这类句子个数m</td>
</tr>
</tbody>
</table>
<p><br><br></p>
<h2 id="三-准备cntext">三、 准备cntext</h2>
<p>EPU 和 FEPU 于今日刚刚封装到 cntext 中， 再计算这两个指数， 就变得容易多了。</p>
<h3 id="31-安装cntext">3.1 安装cntext</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">pip3 install pdfdocx
pip3 install distinctiveness
pip3 install pandarallel
pip3 install cntext
</code></pre></div><p><br><br></p>
<h3 id="32-内置词典">3.2 内置词典</h3>
<table>
<thead>
<tr>
<th>内置文件</th>
<th>词典</th>
<th>参考文献</th>
</tr>
</thead>
<tbody>
<tr>
<td>zh_common_EPU.yaml</td>
<td>经济E、政策P、不确定U</td>
<td>Huang, Yun, and Paul Luk（2020）</td>
</tr>
<tr>
<td>zh_common_FEPU.yaml</td>
<td>经济政策EP、不确定性U</td>
<td>聂辉华, 阮睿&amp;沈吉（2020）</td>
</tr>
</tbody>
</table>
<br>
<h4 id="31-查看内置词典">3.1 查看内置词典</h4>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="nb">print</span><span class="p">(</span><span class="n">ct</span><span class="o">.</span><span class="n">__version__</span><span class="p">)</span>
<span class="n">ct</span><span class="o">.</span><span class="n">get_dict_list</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">2.1.2

[&#39;zh_common_NTUSD.yaml&#39;,
 &#39;zh_common_DUTIR.yaml&#39;,
 &#39;enzh_common_StopWords.yaml&#39;,
 &#39;en_valence_Concreteness.yaml&#39;,
 &#39;en_common_LoughranMcDonald.yaml&#39;,
 &#39;zh_common_FinanceSenti.yaml&#39;,
 &#39;zh_common_TsinghuaPraiseDegrade.yaml&#39;,
 &#39;zh_common_FEPU.yaml&#39;,    聂辉华, 阮睿&amp;沈吉（2020）
 &#39;en_common_ANEW.yaml&#39;,
 &#39;en_common_NRC.yaml&#39;,
 &#39;zh_valence_ChineseEmoBank.yaml&#39;,
 &#39;zh_valence_SixSemanticDimensionDatabase.yaml&#39;,
 &#39;zh_common_FinacialFormalUnformal.yaml&#39;,
 &#39;zh_common_LoughranMcDonald.yaml&#39;,
 &#39;enzh_common_AdvConj.yaml&#39;,
 &#39;en_common_SentiWS.yaml&#39;,
 &#39;zh_common_Digitalization.yaml&#39;,
 &#39;en_common_LSD2015.yaml&#39;,
 &#39;zh_common_HowNet.yaml&#39;,
 &#39;zh_common_EPU.yaml&#39;]      #Huang, Yun, and Paul Luk（2020）
</code></pre></div><br>
<h4 id="312-导入词典">3.1.2 导入词典</h4>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>
<span class="n">FEPU_infos</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">read_yaml_dict</span><span class="p">(</span><span class="s1">&#39;zh_common_FEPU.yaml&#39;</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">FEPU_infos</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">{&#39;Name&#39;: &#39;中文经济政策不确定性词典&#39;, 
&#39;Desc&#39;: &#39;中文经济政策不确定性词典, 含经济政策EconomicPolicy、不确定性Uncertainty两个词表&#39;, 
&#39;Refer&#39;: &#39;聂辉华, 阮睿, 沈吉. 企业不确定性感知、投资决策和金融资产配置[J]. 世界经济, 2020, 43 (06): 77-98.&#39;, 
&#39;Category&#39;: [&#39;经济政策&#39;, &#39;不确定&#39;], 
&#39;Dictionary&#39;: 
    {&#39;经济政策&#39;: [&#39;市政&#39;, &#39;政策&#39;, &#39;货币政策&#39;, &#39;政策鼓励&#39;, &#39;国家&#39;, &#39;扩内需&#39;, &#39;保增长&#39;, &#39;促发展&#39;, &#39;产业发展&#39;, &#39;法律&#39;, &#39;法规&#39;, &#39;行业政策&#39;, &#39;产业政策&#39;, &#39;宏观政策&#39;, &#39;国民经济&#39;, &#39;有关部门&#39;, &#39;产业结构调整&#39;, &#39;产业结构&#39;, &#39;当地政府&#39;, &#39;政府&#39;, &#39;经济政策&#39;, &#39;经济走势&#39;, &#39;所得税&#39;, &#39;税收减免&#39;, &#39;刺激政策&#39;, &#39;限贷令&#39;, &#39;限购令&#39;, &#39;保障房&#39;, &#39;宏观调控&#39;, &#39;产业发展&#39;, &#39;证监会&#39;, &#39;国家政策&#39;, &#39;政治&#39;, &#39;军事&#39;, &#39;政策环境&#39;, &#39;宏观&#39;, &#39;政府补助政策&#39;, &#39;调控政策&#39;, &#39;税收政策&#39;, &#39;政策扶持&#39;], 
    &#39;不确定&#39;: [&#39;风险&#39;, &#39;经营风险&#39;, &#39;市场风险&#39;, &#39;信用风险&#39;, &#39;不确定&#39;, &#39;波动&#39;, &#39;变化&#39;, &#39;改变&#39;, &#39;徘徊&#39;, &#39;不稳&#39;, &#39;不稳定&#39;, &#39;不寻常&#39;, &#39;错综复杂&#39;, &#39;非常复杂&#39;, &#39;纷繁复杂&#39;, &#39;纷纭复杂&#39;, &#39;十分复杂&#39;, &#39;变得复杂&#39;, &#39;风云突变&#39;, &#39;矛盾突出&#39;, &#39;突变&#39;, &#39;复杂多变&#39;, &#39;诡谲多变&#39;, &#39;阵痛&#39;, &#39;过渡&#39;, &#39;问责&#39;, &#39;整顿&#39;, &#39;危险&#39;, &#39;动荡&#39;, &#39;多变性&#39;, &#39;震荡&#39;, &#39;难以确定&#39;, &#39;难以预测&#39;, &#39;难以语料&#39;, &#39;难以琢磨&#39;, &#39;难以捉摸&#39;, &#39;接受考验&#39;, &#39;混乱&#39;, &#39;时而&#39;, &#39;随机&#39;]}
    }
</code></pre></div><p><br><br></p>
<h3 id="33-内置函数">3.3 内置函数</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">ct.epu(df,  freq=&#39;Y&#39;,e_pattern=&#39;&#39;, p_pattern=&#39;&#39;, u_pattern=&#39;&#39;)
</code></pre></div><ul>
<li><em><strong>df</strong></em>  新闻DataFrame；  DataFrame必须含date和text两个字段；每行一条记录，含所有时期所有的新闻。</li>
<li><em><strong>freq</strong></em> 字符串；决定EPU的时间粒度， 年Y、月M、天D， 默认freq=&lsquo;Y&rsquo;</li>
<li><em><strong>e_pattern</strong></em>  字符串；经济类词典，用<code>|</code>间隔词语，形如 <strong>e_pattern = &lsquo;经济|金融&rsquo;</strong></li>
<li><em><strong>p_pattern</strong></em>  字符串；政策词典，用<code>|</code>间隔词语，形如 <strong>p_pattern = &lsquo;政策|治理|行政&rsquo;</strong></li>
<li><em><strong>u_pattern</strong></em> 字符串；不确定性词典，用<code>|</code>间隔词语，形如 <strong>u_pattern = &lsquo;风险|危机|难以预测&rsquo;</strong></li>
</ul>
<p>返回epu时间序列数据，格式为DataFrame</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">ct.fepu(text,  ep_pattern=&#39;&#39;, u_pattern=&#39;&#39;)
</code></pre></div><ul>
<li><em><strong>text</strong></em>  ；某时期t某企业i的管理层讨论与分析md&amp;a文本</li>
<li><em><strong>ep_pattern</strong></em>  字符串；经济政策类词典，用<code>|</code>间隔词语，形如 <strong>ep_pattern = &lsquo;经济|金融|政策|治理|行政&rsquo;</strong></li>
<li><em><strong>u_pattern</strong></em> 字符串；不确定性词典，用<code>|</code>间隔词语，形如 <strong>u_pattern = &lsquo;风险|危机|难以预测&rsquo;</strong></li>
</ul>
<p><br><br></p>
<h2 id="四测量fepu">四、测量FEPU</h2>
<h3 id="41-读取数据">4.1 读取数据</h3>
<ul>
<li><em><strong>mda01-23.csv.gz</strong></em>   管理层讨论与分析2001-2023文本数据</li>
<li><em><strong>行业代码00-23.xlsx</strong></em>  含股票名称、股票代码、行业等字段。</li>
</ul>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;mda01-23.csv.gz&#39;</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;会计年度&#39;</span><span class="p">,</span> <span class="s1">&#39;股票代码&#39;</span><span class="p">,</span> <span class="s1">&#39;经营讨论与分析内容&#39;</span><span class="p">]</span>

<span class="c1">#上市公司行业信息</span>
<span class="n">ind_info_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_excel</span><span class="p">(</span><span class="s1">&#39;行业代码00-23.xlsx&#39;</span><span class="p">)</span>

<span class="c1">#合并数据</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">merge</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">ind_info_df</span><span class="p">,</span> <span class="n">on</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">,</span> <span class="s1">&#39;会计年度&#39;</span><span class="p">],</span> <span class="n">how</span><span class="o">=</span><span class="s1">&#39;inner&#39;</span><span class="p">)</span>

<span class="c1">#剔除ST和金融类企业</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">[(</span><span class="o">-</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;股票简称&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">&#39;ST&#39;</span><span class="p">))</span> <span class="o">&amp;</span> <span class="p">(</span><span class="o">-</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;行业代码&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">&#39;J&#39;</span><span class="p">))]</span>
<span class="n">df</span><span class="o">.</span><span class="n">sort_values</span><span class="p">(</span><span class="s1">&#39;会计年度&#39;</span><span class="p">,</span> <span class="n">ignore_index</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">df</span>
</code></pre></div><p><img loading="lazy" src="img/df.png" alt=""  />
</p>
<br>
<h3 id="42-批量计算fepu">4.2 批量计算FEPU</h3>
<p>选中字段 「<em><strong>经营讨论与分析内容</strong></em>」， 对该字段 .apply 运行函数 <em><strong>ct.fepu</strong></em> ，得到企业感知经济不确定性风险FEPU(含词语和句子两个FEPU)</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>
<span class="c1">#常规速度代码</span>
<span class="c1">#import cntext as ct</span>
<span class="c1">#fepu_df = df[&#39;经营讨论与分析内容&#39;].apply(ct.fepu)</span>
<span class="c1">#res_df = pd.concat([df[[&#39;会计年度&#39;, &#39;股票代码&#39;]], fepu_df],   axis=1)</span>
<span class="c1">#res_df.to_csv(&#39;result.csv&#39;, index=False)</span>
<span class="c1">#res_df</span>


<span class="c1">#加速版代码</span>
<span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>
<span class="kn">from</span> <span class="nn">pandarallel</span> <span class="kn">import</span> <span class="n">pandarallel</span>
<span class="n">pandarallel</span><span class="o">.</span><span class="n">initialize</span><span class="p">()</span>
<span class="n">fepu_df</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;经营讨论与分析内容&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">parallel_apply</span><span class="p">(</span><span class="n">ct</span><span class="o">.</span><span class="n">fepu</span><span class="p">)</span>
<span class="n">res_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">concat</span><span class="p">([</span><span class="n">df</span><span class="p">[[</span><span class="s1">&#39;会计年度&#39;</span><span class="p">,</span> <span class="s1">&#39;股票代码&#39;</span><span class="p">]],</span> <span class="n">fepu_df</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">res_df</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s1">&#39;企业感知不确定性FEPU指数2001-2023.csv&#39;</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
<span class="n">res_df</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">CPU times: user 1.35 s, sys: 1.2 s, total: 2.54 s
Wall time: 5min 9s
</code></pre></div><p><img loading="lazy" src="img/df2.png" alt=""  />
</p>
<p>`</p>
<p><br><br></p>
<h3 id="43-可视化">4.3 可视化</h3>
<p>根据 FEPUw 和 FEPUs 的年度均值， 绘制2001-2022期间的经济政策不确定性变化折线图</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">import</span> <span class="nn">matplotlib</span>
<span class="kn">import</span> <span class="nn">scienceplots</span>
<span class="kn">import</span> <span class="nn">platform</span>
<span class="kn">import</span> <span class="nn">matplotlib_inline</span>
<span class="n">matplotlib_inline</span><span class="o">.</span><span class="n">backend_inline</span><span class="o">.</span><span class="n">set_matplotlib_formats</span><span class="p">(</span><span class="s1">&#39;png&#39;</span><span class="p">,</span> <span class="s1">&#39;svg&#39;</span><span class="p">)</span>

<span class="n">plt</span><span class="o">.</span><span class="n">style</span><span class="o">.</span><span class="n">use</span><span class="p">([</span><span class="s1">&#39;science&#39;</span><span class="p">,</span> <span class="s1">&#39;no-latex&#39;</span><span class="p">,</span> <span class="s1">&#39;cjk-sc-font&#39;</span><span class="p">])</span>
<span class="n">system</span> <span class="o">=</span> <span class="n">platform</span><span class="o">.</span><span class="n">system</span><span class="p">()</span>  <span class="c1"># 获取操作系统类型</span>
<span class="k">if</span> <span class="n">system</span> <span class="o">==</span> <span class="s1">&#39;Windows&#39;</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;SimHei&#39;</span><span class="p">}</span>
<span class="k">elif</span> <span class="n">system</span> <span class="o">==</span> <span class="s1">&#39;Darwin&#39;</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;Arial Unicode MS&#39;</span><span class="p">}</span>
<span class="k">else</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;sans-serif&#39;</span><span class="p">}</span>
<span class="n">matplotlib</span><span class="o">.</span><span class="n">rc</span><span class="p">(</span><span class="s1">&#39;font&#39;</span><span class="p">,</span> <span class="o">**</span><span class="n">font</span><span class="p">)</span>  <span class="c1"># 设置全局字体</span>


<span class="n">years</span> <span class="o">=</span> <span class="nb">range</span><span class="p">(</span><span class="mi">2001</span><span class="p">,</span> <span class="mi">2024</span><span class="p">)</span>
<span class="n">FEPUw_s</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">FEPUs_s</span> <span class="o">=</span> <span class="p">[]</span>

<span class="k">for</span> <span class="n">year</span><span class="p">,</span> <span class="n">year_df</span> <span class="ow">in</span> <span class="n">res_df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">&#39;会计年度&#39;</span><span class="p">):</span>
    <span class="n">FEPUw_s</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">year_df</span><span class="p">[</span><span class="s1">&#39;FEPUw&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">mean</span><span class="p">())</span>
    <span class="n">FEPUs_s</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">year_df</span><span class="p">[</span><span class="s1">&#39;FEPUs&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">mean</span><span class="p">())</span>
    
    
<span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">years</span><span class="p">,</span> <span class="n">FEPUw_s</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">years</span><span class="p">,</span> <span class="n">FEPUs_s</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">years</span><span class="p">,</span> <span class="n">FEPUw_s</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s1">&#39;SEPUw&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">years</span><span class="p">,</span> <span class="n">FEPUs_s</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s1">&#39;SEPUs&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">&#39;感知经济政策不确定性FEPU年度均值&#39;</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">12</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s1">&#39;年份&#39;</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">13</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s1">&#39;FEPU均值&#39;</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">13</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">legend</span><span class="p">()</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/plot.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="五参考文献">五、参考文献</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[1]聂辉华, 阮睿, 沈吉. 企业不确定性感知、投资决策和金融资产配置[J]. 世界经济, 2020, 43 (06): 77-98.
[2]Li, Jing, Huihua Nie, Rui Ruan, and Xinyi Shen. &#34;Subjective perception of economic policy uncertainty and corporate social responsibility: Evidence from China.&#34; International Review of Financial Analysis 91 (2024): 103022.
[3]Huang, Yun, and Paul Luk. &#34;Measuring economic policy uncertainty in China.&#34; China Economic Review 59 (2020): 10136
[4]Caldara, Dario, Matteo Iacoviello, Patrick Molligo, Andrea Prestipino, and Andrea Raffo. &#34;The economic effects of trade policy uncertainty.&#34; Journal of Monetary Economics 109 (2020): 38-59.
</code></pre></div><p><br><br></p>
<h2 id="六获取资料">六、获取资料</h2>
<p>内容原创不易，</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">- 100元 
   - mda01-23.csv.gz
   - A01-23.csv.gz 
   - 企业感知不确定性FEPU指数
</code></pre></div><p>加微信 <strong>372335839</strong>， 备注「姓名-学校-专业」。</p>
<p><br><br></p>
<h2 id="精选内容">精选内容</h2>
<ul>
<li><a href="https://textdata.cn/blog/datasets_available_for_management_science/">LIST | 可供社科(经管)领域使用的数据集汇总</a></li>
<li><a href="https://textdata.cn/blog/the_text_analysis_list_about_ms/">LIST | 社科(经管)数据挖掘文献资料汇总</a></li>
<li><a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/">推荐 | 文本分析库 cntext 使用手册</a></li>
<li><a href="https://textdata.cn/blog/management_python_course/">付费视频课 | Python实证指标构建与文本分析</a>
<br>
<br></li>
</ul>
]]></content:encoded>
    </item>
    
    <item>
      <title>管理世界 | 使用md&amp;a数据中计算 「企业融资约束指标」</title>
      <link>https://textdata.cn/blog/2024-12-31-using-regex-to-compute-the-financial_constraints/</link>
      <pubDate>Wed, 24 Apr 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-12-31-using-regex-to-compute-the-financial_constraints/</guid>
      <description>：本文采用文本分析方法构建了融资约束指标，在此基础上，实证检验了多个大股 东对企业融资约束的影响以及相应的作用机理。 我们发现，多个大股东的公司有着较低的 融资约束水平。 该结论在控制内生性情况下依然成立。 中介效应模型的检验结果表明，其 他大股东通过抑制控股股东的掏空行为降低了企业融资约束。 进一步的研究结果表明，在 其他大股东具有较强的监督动机和监督能力（大股东数量更多、持股数量之和更大、大股东 之间不容易合谋）、及更好的外部环境（信息环境、法律环境）时，公司的融资约束水平更低， 这些发现在逻辑上为其他大股东的监督假说提供证据支持的同时，也表明大股东发挥监督 作用降低企业融资约束需要一定条件。 本文为完善中国情景下的融资约束指标构建、更好 度量中国企业融资约束提供了有益参考；同时，为股权结构安排的经济后果提供了新的证据 支持。</description>
      <content:encoded><![CDATA[<h2 id="技术路线">技术路线</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[工作量]
  1. 代码130+行
  2. 调试时间 3 小时， 运行时间 20 小时
  
  
[内容]
  1. 设计正则表达式， 识别企业融资约束
  2. 构建企业管理层讨论与分析文本向量(标准化) Vec_it
  3. 构建板块(沪、深)文本向量(标准化)BoardVec_bt
  4. 构建行业文本向量(标准化) IndustryVec_it
  5. 构建融资约束样本集的文本均值向量(标准化) ConstrainedVec_it
  6. 基于前面几个变量，计算得到
     - BoardScore_bt 、 InstryScore_it
     - 得到5w多个csv文件(中间运算结果), 存储在 fin_constrain_output/{year}/{code}.csv
     
  7. [融资约束FC指标计量建模]
    - ConstrainedScore_it =β0 + β1 * BoardScore_bt + β2 * IndustryScore_it + E_it
    - BoardScore_bt  交易所引发的融资约束相似度
    - IndustryScore_it  行业特征引发的融资约束相似度
    - E_it  残差就是本文要计算的[融资约束指标FC]
</code></pre></div><p><br><br></p>
<h2 id="一识别融资约束样本">一、识别融资约束样本</h2>
<p><strong>在获取 MD&amp;A 的基础上，采用正则表达式（Regular Expression） 检索出隐含融资约束信息的文本，并把相应的 MD&amp;A 进行标记，纳入对应年度的融资约束文本集中</strong>。 其中，在检索并标记融资约束文本的过程中，本文参考 Hoberg 和 Maksimovic （2015）、Buehlmaier 和 Whited（2016）的研究方法。</p>
<p>Hoberg 和 Maksimovic（2015）认为，融资约束体现为<strong>投资计划、项目的推迟、搁置乃至放弃</strong>，因此，他们构造了两组“<strong>推迟投资</strong>”词语列表，一组是有推迟、延期、搁置含义的动词词表; 另一组是与投资、 项目、计划等意思相近的名词词表。 若在待识别文本中，动词词表和名词词表中的词语、词组同时出现，且相隔不超过 12 词，则将其判定为有推迟投资含义的融资约束文本。</p>
<p>Buehlmaier 和 Whited（2016） 在构建股权融资约束文本集的过程中，直接引用了前者的“推迟投资”词表，同时，为了确定投资的推迟确实是由股权融资方面的问题引起的，还计算了距“推迟投资”语句 12 词以内股权融资相关词语出现的频率，最终只把频率排行前 250 的观测加入股权融资约束文本集。</p>
<br>
<h3 id="11-前人不足">1.1 前人不足</h3>
<p>需要说明的是，尽管本文采用的方法借鉴了 Hoberg 和 Maksimovic（2015）和 Buehlmaier 和 Whited （2016）的做法，但与其存在着两个方面的差异。</p>
<ul>
<li>第一，<strong>本文没有通过“推迟投资”界定融资约束，而是通过公司对资金状况的描述去识别，相较而言这一做法更为直接</strong>。 例如，若公司明确表明融资能力有限，资金紧张，则被视为融资约束样本。</li>
<li>第二，<strong>我们认为，即便“推迟投资”词表中的动词和名词在相隔 12 词以内出现，两个词之间也未必有关联，12词的窗口长度容易引起大量误判</strong>。 尤其考虑到汉语使用较为灵活，不同公司在表述上也存在着较大的差异，因此，本文使用了可覆盖更多表述形式、更加灵活的正则表达式进行检索，并根据数次检索结果排除了很多容易导致误判的情形，查准率较高。</li>
</ul>
<br>
<h3 id="12-本文完善">1.2 本文完善</h3>
<p>具体地，为了在 MD&amp;A 文本集中检索出融资约束文本，我们在设计正则表达式时将能显示公司有融资约束的各种文字表达，以词语组合的形式进行提炼。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">regex1</span> <span class="o">=</span> <span class="s2">&#34;[^。]*?(融资|资金|筹资)[^。]{0, 6}?(难以|不能|无法|不足以)[^。]*&#34;</span>
<span class="c1">#能在 MD&amp;A 文本中匹配出以下形式的句子：（除句 号以外的任意长度字符串）+融资/资金/筹资+（六个 字符长度以内的任意字符串）+难以/不能/无法满足/不足以+（除句号以外的任意长度字符串）；</span>

<span class="n">regex2</span> <span class="o">=</span> <span class="s2">&#34;[^。]*?(融资|资金|筹资)[^。]{0, 6}?(成本|压力|难度)[^。]{0, 4}?(升|增|高|大)[^。]*&#34;</span>
<span class="c1">#可在句号以外的任意长度字符串）+融资/资金/筹资+（六 个字符长度以内的任意字符串）+成本/压力/难度+ （4 个字符长度以内的任意字符串）+升/高/增/大+ （除句号以外的任意长度字符串）。</span>
</code></pre></div><p>仅仅考虑融资约束文本的各种可能表述是不够的，会出现大量误判，例如，机械地将“资金”之后 4 个字符以内出现“不足”的语句识别为融资约束语句，非常容易 造成误判，因为部分 MD&amp;A 提及公司“资金管理水平不足”，而资金管理水平反映的是公司运营能力， 和融资约束无直接关系。 诸如此类的匹配应视作误判而排除，因此我们利用正则表达式灵活的语法规则，同时构造了排除性条件。 <strong>在此基础上，将这些对应着不同判断逻辑的“规则字符串”合并至同一个正则表达式中</strong>。 如果难以合并，则利用程序语言的条件判断逻辑，对正则表达式组进行组合使用。 <strong>在具体操作中，本文就使用了正则表达式组。</strong></p>
<p><br><br></p>
<h2 id="二-构建中文融资约束样本识别代码">二、 构建中文融资约束样本识别代码</h2>
<p><strong>前面的样本识别都是论文原文，接下来是大邓对该论文的融资约束样本识别算法的复现</strong>。</p>
<h3 id="21-融资约束文本的场景">2.1 融资约束文本的场景</h3>
<p>这是一个相对复杂的需求，需要综合考虑多种情况， 对于每种情况，都构建一个单独的正则表达式，用于匹配对应的文本。可以使用“或”运算符， 合并为一个更大的正则表达式。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">re</span>


<span class="c1">#融资不足情况</span>
<span class="n">regex1</span> <span class="o">=</span> <span class="sa">r</span><span class="s2">&#34;(?:融资|资金|筹资)[^。]{0,6}?(?:难以|不能|无法|不足以)[^。]*&#34;</span>
<span class="c1">#融资成本或压力过大情况</span>
<span class="n">regex2</span> <span class="o">=</span> <span class="sa">r</span><span class="s2">&#34;(?:融资|资金|筹资)[^。]{0,6}?(?:成本|压力|难度)[^。]{0,4}?(?:升|增|高|大)[^。]*&#34;</span>

<span class="c1">#可以使用“或”运算符， 合并为一个更大的正则表达式</span>
<span class="n">pattern</span> <span class="o">=</span> <span class="sa">r</span><span class="s2">&#34;(&#34;</span> <span class="o">+</span> <span class="n">regex1</span> <span class="o">+</span> <span class="sa">r</span><span class="s2">&#34;)|(&#34;</span> <span class="o">+</span> <span class="n">regex2</span> <span class="o">+</span> <span class="sa">r</span><span class="s2">&#34;)&#34;</span>


<span class="c1">#实验数据</span>
<span class="n">text1</span> <span class="o">=</span> <span class="s2">&#34;公司在过去几年中进行了大量的投资，导致资金短缺，难以支持公司未来的发展计划。&#34;</span>
<span class="n">text2</span> <span class="o">=</span> <span class="s2">&#34;公司在过去几年中进行了大量的投资计划，资金状况良好，没有融资压力。&#34;</span>

<span class="c1">#实验结果</span>
<span class="n">matches1</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">findall</span><span class="p">(</span><span class="n">pattern</span><span class="p">,</span> <span class="n">text1</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">matches1</span><span class="p">)</span>
<span class="n">matches2</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">findall</span><span class="p">(</span><span class="n">pattern</span><span class="p">,</span> <span class="n">text2</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">matches2</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">    [(&#39;资金短缺，难以支持公司未来的发展计划&#39;, &#39;&#39;)]
    []
</code></pre></div><br>
<p>在上面的例子中，pattern能识别出文本是否含有融资约束。</p>
<ul>
<li>text1<strong>有融资约束</strong>，所以返回带 <strong>有内容</strong> 的 <strong>matches1</strong></li>
<li>text2<strong>没有融资约束</strong>，所以返回 <strong>没有内容</strong> 的 <strong>matches2</strong></li>
</ul>
<br>
<h3 id="22-识别中文融资约束样本的最终代码">2.2 识别中文融资约束样本的最终代码</h3>
<p>前面的内容都是算法逐步实现的过程，现在咱们合并为一个函数代码</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">re</span>

<span class="k">def</span> <span class="nf">is_financial_constraint</span><span class="p">(</span><span class="n">text</span><span class="p">):</span>
    <span class="c1">#正则表达式组</span>
    <span class="n">regex1</span> <span class="o">=</span> <span class="sa">r</span><span class="s2">&#34;(?:融资|资金|筹资)[^。]{0,6}?(?:难以|不能|无法|不足以)[^。]*&#34;</span>
    <span class="n">regex2</span> <span class="o">=</span> <span class="sa">r</span><span class="s2">&#34;(?:融资|资金|筹资)[^。]{0,6}?(?:成本|压力|难度)[^。]{0,4}?(?:升|增|高|大)[^。]*&#34;</span>
    <span class="n">pattern</span> <span class="o">=</span> <span class="sa">r</span><span class="s2">&#34;(&#34;</span> <span class="o">+</span> <span class="n">regex1</span> <span class="o">+</span> <span class="sa">r</span><span class="s2">&#34;)|(&#34;</span> <span class="o">+</span> <span class="n">regex2</span> <span class="o">+</span> <span class="sa">r</span><span class="s2">&#34;)&#34;</span>
    
    <span class="c1">#带内容的结果为融资约束，为True；反之，为False</span>
    <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">re</span><span class="o">.</span><span class="n">findall</span><span class="p">(</span><span class="n">pattern</span><span class="p">,</span> <span class="n">text</span><span class="p">))</span><span class="o">&gt;=</span><span class="mi">1</span><span class="p">:</span>
        <span class="k">return</span> <span class="kc">True</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="k">return</span> <span class="kc">False</span>
    
    
<span class="c1">#实验数据</span>
<span class="n">text1</span> <span class="o">=</span> <span class="s2">&#34;公司在过去几年中进行了大量的投资，导致资金短缺，难以支持公司未来的发展计划。&#34;</span>
<span class="n">text2</span> <span class="o">=</span> <span class="s2">&#34;公司在过去几年中进行了大量的投资计划，资金状况良好，没有融资压力。&#34;</span>

<span class="c1">#实验结果</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;text1文本是否为融资约束: &#39;</span><span class="p">,</span> <span class="n">is_financial_constraint</span><span class="p">(</span><span class="n">text1</span><span class="p">))</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;text2文本是否为融资约束: &#39;</span><span class="p">,</span> <span class="n">is_financial_constraint</span><span class="p">(</span><span class="n">text2</span><span class="p">))</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">    text1文本是否为融资约束:  True
    text2文本是否为融资约束:  False
</code></pre></div><p><br><br></p>
<h2 id="三批量识别融资约束样本">三、批量识别融资约束样本</h2>
<p>接下来对对 <em><strong>data/mda01-23.csv.gz</strong></em> 数据集所有md&amp;a进行识别。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="c1">#读取md&amp;a</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;data/mda01-23.csv.gz&#39;</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;会计年度&#39;</span><span class="p">,</span> <span class="s1">&#39;股票代码&#39;</span><span class="p">,</span> <span class="s1">&#39;经营讨论与分析内容&#39;</span><span class="p">]</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;会计年度&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;会计年度&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="nb">str</span><span class="p">)</span>


<span class="c1">#上市公司行业信息</span>
<span class="n">ind_info_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_excel</span><span class="p">(</span><span class="s1">&#39;data/上市公司基本信息2000-2023.xlsx&#39;</span><span class="p">,</span> <span class="n">usecols</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;Symbol&#39;</span><span class="p">,</span> <span class="s1">&#39;EndDate&#39;</span><span class="p">,</span> <span class="s1">&#39;IndustryCodeC&#39;</span><span class="p">])</span>
<span class="n">ind_info_df</span> <span class="o">=</span> <span class="n">ind_info_df</span><span class="p">[</span><span class="n">ind_info_df</span><span class="o">.</span><span class="n">Symbol</span><span class="o">!=</span><span class="s1">&#39;股票代码&#39;</span><span class="p">]</span>
<span class="n">ind_info_df</span><span class="p">[</span><span class="s1">&#39;会计年度&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">ind_info_df</span><span class="o">.</span><span class="n">EndDate</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">date</span><span class="p">:</span> <span class="n">date</span><span class="p">[:</span><span class="mi">4</span><span class="p">])</span>
<span class="n">ind_info_df</span><span class="o">.</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="s1">&#39;Symbol&#39;</span><span class="p">:</span> <span class="s1">&#39;股票代码&#39;</span><span class="p">,</span> <span class="s1">&#39;IndustryCodeC&#39;</span><span class="p">:</span><span class="s1">&#39;行业代码&#39;</span><span class="p">},</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">ind_info_df</span> <span class="o">=</span> <span class="n">ind_info_df</span><span class="p">[[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">,</span> <span class="s1">&#39;会计年度&#39;</span><span class="p">,</span> <span class="s1">&#39;行业代码&#39;</span><span class="p">]]</span>

<span class="c1">#合并数据</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">merge</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">ind_info_df</span><span class="p">,</span> <span class="n">on</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">,</span> <span class="s1">&#39;会计年度&#39;</span><span class="p">],</span> <span class="n">how</span><span class="o">=</span><span class="s1">&#39;inner&#39;</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">57545
</code></pre></div><p><img loading="lazy" src="img/df1.png" alt=""  />
</p>
<br>
<p>新建板块字段， 上海证券交易所股票大多以 6、9开头， 而深圳证券交易所以0、3开头</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">plate</span><span class="p">(</span><span class="n">code</span><span class="p">):</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">code</span><span class="p">[:</span><span class="mi">2</span><span class="p">]</span><span class="o">==</span><span class="s1">&#39;A6&#39;</span><span class="p">)</span> <span class="ow">or</span> <span class="p">(</span><span class="n">code</span><span class="p">[:</span><span class="mi">2</span><span class="p">]</span><span class="o">==</span><span class="s1">&#39;A9&#39;</span><span class="p">):</span>
        <span class="k">return</span> <span class="s1">&#39;上海&#39;</span>
    <span class="k">elif</span> <span class="p">(</span><span class="n">code</span><span class="p">[:</span><span class="mi">2</span><span class="p">]</span><span class="o">==</span><span class="s1">&#39;A0&#39;</span><span class="p">)</span> <span class="ow">or</span> <span class="p">(</span><span class="n">code</span><span class="p">[:</span><span class="mi">2</span><span class="p">]</span><span class="o">==</span><span class="s1">&#39;A3&#39;</span><span class="p">):</span>
        <span class="k">return</span> <span class="s1">&#39;深圳&#39;</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="k">return</span> <span class="s1">&#39;其他&#39;</span>
    
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;板块&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">plate</span><span class="p">)</span>

<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>  
</code></pre></div><p><img loading="lazy" src="img/df2.png" alt=""  />
</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s1">&#39;融资约束&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;经营讨论与分析内容&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">is_financial_constraint</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/df3.png" alt=""  />
</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#融资约束样本占比</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;融资约束&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span><span class="o">/</span><span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
</code></pre></div><pre><code>0.10631679555130767
</code></pre>
<br>
<h3 id="注意">注意</h3>
<p>设计的 <em><strong>函数is_financial_constraint</strong></em> 应该要检查， 检查的目的是改良正则表达式组， 这里假装我们检查完了，没什么问题。</p>
<p><br><br></p>
<h2 id="四构建融资约束指标">四、构建融资约束指标</h2>
<p>前面的融资约束样本识别，只是识别出融资约束是否存在，信息的颗粒度比较粗糙。<strong>这篇论文使用文本相似度算法，构建了每家企业的融资约束指标</strong>。</p>
<p>本文同样参照 Hoberg 和 Maksimovic（2015）的研究方法，我们认为，融资约束程度相近的公司，其在“管理层讨论与分析”中的用词和表述也会趋于一致。 因此，通过采用余弦相似度的方法，能够在识别出全体样本的融资约束程度，并以连续变量的形式进行呈现。</p>
<p>具体实现算法步骤</p>
<ol>
<li>
<p>给每个 md&amp;a 文本转化为向量 <em><strong>Vec_it</strong></em></p>
</li>
<li>
<p>当年所有属于融资约束样本的 <em><strong>Vec_it</strong></em> ， 求均值得到 <em><strong>ConstrainedVec_t</strong></em></p>
</li>
<li>
<p>每家企业当年融资约束水平(程度) 由 <em><strong>Vec_it</strong></em> 与 <em><strong>ConstrainedVec_t</strong></em> 之积 , 即 <em><strong>ConstrainedScore_it</strong></em> 所体现。</p>
</li>
<li>
<p>考虑到市场板块、行业性因素对融资约束的影响，不能直接使用 <em><strong>ConstrainedScore_it</strong></em>。</p>
<ul>
<li>对历年隶属于各个板块的公司 MD&amp;A，求标准化词频向量的均值并做标准化处理，记为 BoardVectb_bt ，该向量反映了上市板 b 在 t 年的共同性信息披露内容。</li>
<li><em><strong>Vec_it</strong></em> 与对应板块 <em><strong>BoardVec_bt</strong></em> 之积，即为因 MD&amp;A 共性内容导致的相似度， 记作 <em><strong>BoilerplateScore_i</strong></em>。</li>
<li>利用相同方法，计算出因行业特征引发的相似度，记作 <em><strong>IndustryScore_it</strong></em> 。</li>
</ul>
</li>
<li>
<p><code>ConstrainedScore_it = β0 + β1 * BoardScore_bt + β2 * IndustryScore_it + E_it</code></p>
<ul>
<li><em><strong>BoardScore_bt</strong></em>  交易所引发的融资约束相似度</li>
<li><em><strong>IndustryScore_it</strong></em>  行业特征引发的融资约束相似度</li>
<li><em><strong>E_it</strong></em>  残差就是本文要计算的[融资约束指标FC]</li>
</ul>
</li>
</ol>
<p><br><br></p>
<h3 id="41-计算2023年的vec_it">4.1 计算2023年的Vec_it</h3>
<p>计算量太大，先以2023为例写代码。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df_per_year</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;会计年度&#39;</span><span class="p">]</span><span class="o">==</span><span class="s1">&#39;2023&#39;</span><span class="p">]</span>
<span class="n">df_per_year</span><span class="o">.</span><span class="n">reset_index</span><span class="p">(</span><span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">df_per_year</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/df4.png" alt=""  />
</p>
<br>
<p>处理2023年的 「<em><strong>经营讨论与分析内容</strong></em>」字段内容，使其:</p>
<ol>
<li>只保留中文内容</li>
<li>剔除停用词</li>
<li>整理为用空格间隔的字符串(类西方语言文本格式)</li>
<li>将本文转为向量后，标准化。</li>
<li>合并一些需要的字段，如 <code>['股票代码', '会计年度', '板块', '行业代码', '融资约束']</code></li>
</ol>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>
<span class="kn">from</span> <span class="nn">sklearn.feature_extraction.text</span> <span class="kn">import</span> <span class="n">CountVectorizer</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>
<span class="kn">import</span> <span class="nn">jieba</span>
<span class="kn">import</span> <span class="nn">re</span>


<span class="c1">#cntext1.9.2</span>
<span class="c1">#stopwords = ct.load_pkl_dict(&#39;STOPWORDS.pkl&#39;)[&#39;STOPWORDS&#39;][&#39;chinese&#39;]</span>

<span class="c1">#cntext2.1.7</span>
<span class="n">stopwords</span><span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">read_yaml_dict</span><span class="p">(</span><span class="s1">&#39;enzh_common_StopWords.yaml&#39;</span><span class="p">)[</span><span class="s1">&#39;Dictionary&#39;</span><span class="p">][</span><span class="s1">&#39;chinese&#39;</span><span class="p">]</span>


<span class="k">def</span> <span class="nf">transform</span><span class="p">(</span><span class="n">text</span><span class="p">):</span>
    <span class="c1">#只保留md&amp;a中的中文内容</span>
    <span class="n">text</span> <span class="o">=</span> <span class="s1">&#39;&#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">re</span><span class="o">.</span><span class="n">findall</span><span class="p">(</span><span class="s1">&#39;[</span><span class="se">\u4e00</span><span class="s1">-</span><span class="se">\u9fa5</span><span class="s1">]+&#39;</span><span class="p">,</span> <span class="n">text</span><span class="p">))</span>
    <span class="c1">#剔除停用词</span>
    <span class="n">words</span> <span class="o">=</span> <span class="p">[</span><span class="n">w</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">jieba</span><span class="o">.</span><span class="n">cut</span><span class="p">(</span><span class="n">text</span><span class="p">)</span> <span class="k">if</span> <span class="n">w</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">stopwords</span><span class="p">]</span>
    <span class="c1">#整理为用空格间隔的字符串(类西方语言文本格式)</span>
    <span class="k">return</span> <span class="s1">&#39; &#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">words</span><span class="p">)</span>


<span class="n">df_per_year</span><span class="p">[</span><span class="s1">&#39;clean_text&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df_per_year</span><span class="p">[</span><span class="s1">&#39;经营讨论与分析内容&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">transform</span><span class="p">)</span>
<span class="n">cv</span> <span class="o">=</span> <span class="n">CountVectorizer</span><span class="p">(</span><span class="n">min_df</span><span class="o">=</span><span class="mf">0.05</span><span class="p">,</span> <span class="n">max_df</span><span class="o">=</span><span class="mf">0.5</span><span class="p">)</span> 
<span class="c1"># 生成稀疏bow矩阵</span>
<span class="c1">#dtm 文档-词频-矩阵</span>
<span class="n">dtm_per_year</span> <span class="o">=</span> <span class="n">cv</span><span class="o">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">df_per_year</span><span class="p">[</span><span class="s1">&#39;clean_text&#39;</span><span class="p">])</span> 
<span class="n">dtm_per_year</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">dtm_per_year</span><span class="o">.</span><span class="n">toarray</span><span class="p">(),</span> <span class="n">index</span><span class="o">=</span><span class="n">dtm_per_year</span><span class="o">.</span><span class="n">index</span><span class="p">)</span>

<span class="c1">#向量标准化normalize</span>
<span class="n">dtm_per_year</span> <span class="o">=</span> <span class="n">dtm_per_year</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">row</span><span class="p">:</span> <span class="n">row</span><span class="o">/</span><span class="n">np</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="n">row</span><span class="p">),</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>

<span class="c1">#合并多个字段为新的df</span>
<span class="n">dtm_per_year</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">concat</span><span class="p">([</span><span class="n">df_per_year</span><span class="p">[[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">,</span> <span class="s1">&#39;会计年度&#39;</span><span class="p">,</span> <span class="s1">&#39;板块&#39;</span><span class="p">,</span> <span class="s1">&#39;行业代码&#39;</span><span class="p">,</span> <span class="s1">&#39;融资约束&#39;</span><span class="p">]],</span> <span class="n">dtm_per_year</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">dtm_per_year</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">    CPU times: user 5.88 s, sys: 901 ms, total: 6.78 s
    Wall time: 49.7 s
</code></pre></div><p><img loading="lazy" src="img/df5.png" alt=""  />
</p>
<br>
<h3 id="42--2023年的板块评分行业评分">4.2  2023年的板块评分、行业评分</h3>
<p>计算2023年所有公司的 <strong>板块评分BoardScore</strong>、<strong>行业评分IndustrySocre</strong>。该部分代码运行较慢，运行下来大约2小时。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>

<span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>


<span class="n">year</span> <span class="o">=</span> <span class="mi">2023</span>

<span class="k">if</span> <span class="ow">not</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">exists</span><span class="p">(</span><span class="s1">&#39;fin_constrain_output&#39;</span><span class="p">):</span>
    <span class="n">os</span><span class="o">.</span><span class="n">mkdir</span><span class="p">(</span><span class="s1">&#39;fin_constrain_output&#39;</span><span class="p">)</span>
        

<span class="k">for</span> <span class="n">idx</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">dtm_per_year</span><span class="p">)):</span>
    <span class="n">code</span> <span class="o">=</span> <span class="n">dtm_per_year</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="n">idx</span><span class="p">,</span> <span class="s1">&#39;股票代码&#39;</span><span class="p">]</span>
    <span class="n">ind</span> <span class="o">=</span> <span class="n">dtm_per_year</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="n">idx</span><span class="p">,</span> <span class="s1">&#39;行业代码&#39;</span><span class="p">]</span>
    <span class="n">year</span> <span class="o">=</span> <span class="n">dtm_per_year</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="n">idx</span><span class="p">,</span> <span class="s1">&#39;会计年度&#39;</span><span class="p">]</span>
    <span class="n">board</span> <span class="o">=</span> <span class="n">dtm_per_year</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="n">idx</span><span class="p">,</span> <span class="s1">&#39;板块&#39;</span><span class="p">]</span>
    
    
    <span class="n">Vec</span> <span class="o">=</span> <span class="n">dtm_per_year</span><span class="o">.</span><span class="n">iloc</span><span class="p">[</span><span class="n">idx</span><span class="p">,</span> <span class="mi">5</span><span class="p">:]</span>
    <span class="n">Ind_Vec</span> <span class="o">=</span> <span class="n">dtm_per_year</span><span class="p">[</span><span class="n">dtm_per_year</span><span class="p">[</span><span class="s1">&#39;行业代码&#39;</span><span class="p">]</span><span class="o">==</span><span class="n">ind</span><span class="p">][</span><span class="n">dtm_per_year</span><span class="p">[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">]</span><span class="o">!=</span><span class="n">code</span><span class="p">]</span><span class="o">.</span><span class="n">iloc</span><span class="p">[:,</span> <span class="mi">5</span><span class="p">:]</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
    <span class="n">Ind_Score</span> <span class="o">=</span> <span class="n">Vec</span> <span class="o">*</span> <span class="p">(</span><span class="n">Ind_Vec</span><span class="o">/</span><span class="n">np</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="n">Ind_Vec</span><span class="p">))</span>
    <span class="n">FinConstrain_Vec</span> <span class="o">=</span> <span class="n">dtm_per_year</span><span class="p">[</span><span class="n">dtm_per_year</span><span class="p">[</span><span class="s1">&#39;融资约束&#39;</span><span class="p">]</span><span class="o">==</span><span class="kc">True</span><span class="p">]</span><span class="o">.</span><span class="n">iloc</span><span class="p">[:,</span> <span class="mi">5</span><span class="p">:]</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
    <span class="n">FinConstrain_Score</span> <span class="o">=</span> <span class="n">Vec</span> <span class="o">*</span> <span class="p">(</span><span class="n">FinConstrain_Vec</span><span class="o">/</span><span class="n">np</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="n">FinConstrain_Vec</span><span class="p">))</span>
    <span class="n">Board_Vec</span> <span class="o">=</span> <span class="n">dtm_per_year</span><span class="p">[</span><span class="n">dtm_per_year</span><span class="p">[</span><span class="s1">&#39;板块&#39;</span><span class="p">]</span><span class="o">==</span><span class="n">board</span><span class="p">][</span><span class="n">dtm_per_year</span><span class="p">[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">]</span><span class="o">!=</span><span class="n">code</span><span class="p">]</span><span class="o">.</span><span class="n">iloc</span><span class="p">[:,</span> <span class="mi">5</span><span class="p">:]</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
    <span class="n">Board_Score</span> <span class="o">=</span> <span class="n">Vec</span> <span class="o">*</span> <span class="p">(</span><span class="n">Board_Vec</span><span class="o">/</span><span class="n">np</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="n">Board_Vec</span><span class="p">))</span>
    

    <span class="n">dtm_per_year_melted</span> <span class="o">=</span> <span class="n">dtm_per_year</span><span class="o">.</span><span class="n">melt</span><span class="p">(</span><span class="n">id_vars</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">,</span> <span class="s1">&#39;会计年度&#39;</span><span class="p">,</span> <span class="s1">&#39;行业代码&#39;</span><span class="p">,</span> <span class="s1">&#39;板块&#39;</span><span class="p">,</span> <span class="s1">&#39;融资约束&#39;</span><span class="p">],</span>
                                            <span class="n">var_name</span><span class="o">=</span><span class="s1">&#39;word_id&#39;</span><span class="p">,</span> 
                                            <span class="n">value_name</span><span class="o">=</span><span class="s1">&#39;word_freq&#39;</span><span class="p">)</span>
    

    <span class="n">corporate_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="s1">&#39;word_id&#39;</span><span class="p">:</span> <span class="n">dtm_per_year_melted</span><span class="p">[</span><span class="n">dtm_per_year_melted</span><span class="p">[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">]</span><span class="o">==</span><span class="n">code</span><span class="p">][</span><span class="s1">&#39;word_id&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">values</span><span class="p">,</span>
                                 <span class="s1">&#39;word_freq&#39;</span><span class="p">:</span> <span class="n">dtm_per_year_melted</span><span class="p">[</span><span class="n">dtm_per_year_melted</span><span class="p">[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">]</span><span class="o">==</span><span class="n">code</span><span class="p">][</span><span class="s1">&#39;word_freq&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">values</span><span class="p">,</span>
                                 <span class="s1">&#39;ind_freq&#39;</span><span class="p">:</span> <span class="n">Ind_Score</span><span class="p">,</span>
                                 <span class="s1">&#39;board_freq&#39;</span><span class="p">:</span> <span class="n">Board_Score</span><span class="p">,</span>
                                 <span class="s1">&#39;fin_constrain_freq&#39;</span><span class="p">:</span> <span class="n">FinConstrain_Score</span><span class="p">})</span>
    <span class="n">corporate_df</span><span class="p">[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">code</span>
    <span class="n">corporate_df</span><span class="p">[</span><span class="s1">&#39;行业代码&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">ind</span>
    <span class="n">corporate_df</span><span class="p">[</span><span class="s1">&#39;板块&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">board</span>
    <span class="n">corporate_df</span><span class="p">[</span><span class="s1">&#39;会计年度&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">year</span>
    
    <span class="n">corporate_df</span><span class="o">.</span><span class="n">reset_index</span><span class="p">(</span><span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
    <span class="n">corporate_df</span> <span class="o">=</span> <span class="n">corporate_df</span><span class="p">[[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">,</span> <span class="s1">&#39;行业代码&#39;</span><span class="p">,</span> <span class="s1">&#39;会计年度&#39;</span><span class="p">,</span> <span class="s1">&#39;板块&#39;</span><span class="p">,</span> <span class="s1">&#39;word_id&#39;</span><span class="p">,</span> <span class="s1">&#39;word_freq&#39;</span><span class="p">,</span> <span class="s1">&#39;ind_freq&#39;</span><span class="p">,</span> <span class="s1">&#39;board_freq&#39;</span><span class="p">,</span> <span class="s1">&#39;fin_constrain_freq&#39;</span><span class="p">]]</span>
    
    <span class="k">if</span> <span class="ow">not</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">exists</span><span class="p">(</span><span class="s1">&#39;fin_constrain_output/</span><span class="si">{year}</span><span class="s1">&#39;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">year</span><span class="o">=</span><span class="n">year</span><span class="p">)):</span>
        <span class="n">os</span><span class="o">.</span><span class="n">mkdir</span><span class="p">(</span><span class="s1">&#39;fin_constrain_output/</span><span class="si">{year}</span><span class="s1">&#39;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">year</span><span class="o">=</span><span class="n">year</span><span class="p">))</span>
    
    <span class="n">corporate_df</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s1">&#39;fin_constrain_output/</span><span class="si">{year}</span><span class="s1">/</span><span class="si">{code}</span><span class="s1">.csv&#39;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">year</span><span class="o">=</span><span class="n">year</span><span class="p">,</span> <span class="n">code</span><span class="o">=</span><span class="n">code</span><span class="p">),</span> <span class="n">index</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">mode</span><span class="o">=</span><span class="s1">&#39;w&#39;</span><span class="p">)</span>
  
</code></pre></div><br>
<h3 id="43-计算所有年份板块评分行业评分">4.3 计算所有年份板块评分、行业评分</h3>
<p><strong>这部分代码，全部运行下来，耗时 20 小时。</strong></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>
<span class="kn">from</span> <span class="nn">sklearn.feature_extraction.text</span> <span class="kn">import</span> <span class="n">CountVectorizer</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">import</span> <span class="nn">re</span>
<span class="kn">import</span> <span class="nn">os</span>
<span class="kn">from</span> <span class="nn">tqdm</span> <span class="kn">import</span> <span class="n">tqdm</span>
<span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>
<span class="kn">import</span> <span class="nn">jieba</span>



<span class="k">if</span> <span class="ow">not</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">exists</span><span class="p">(</span><span class="s1">&#39;fin_constrain_output&#39;</span><span class="p">):</span>
    <span class="n">os</span><span class="o">.</span><span class="n">mkdir</span><span class="p">(</span><span class="s1">&#39;fin_constrain_output&#39;</span><span class="p">)</span>
    
    
    
<span class="c1">#cntext1.x</span>
<span class="c1">#stopwords = ct.load_pkl_dict(&#39;STOPWORDS.pkl&#39;)[&#39;STOPWORDS&#39;][&#39;chinese&#39;]</span>
<span class="c1">#cntext2.x</span>
<span class="n">stopwords</span><span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">read_yaml_dict</span><span class="p">(</span><span class="s1">&#39;enzh_common_StopWords.yaml&#39;</span><span class="p">)[</span><span class="s1">&#39;Dictionary&#39;</span><span class="p">][</span><span class="s1">&#39;chinese&#39;</span><span class="p">]</span>



<span class="k">def</span> <span class="nf">is_financial_constraint</span><span class="p">(</span><span class="n">text</span><span class="p">):</span>
    <span class="c1">#正则表达式组</span>
    <span class="n">regex1</span> <span class="o">=</span> <span class="sa">r</span><span class="s2">&#34;(?:融资|资金|筹资)[^。]{0,6}?(?:难以|不能|无法|不足以)[^。]*&#34;</span>
    <span class="n">regex2</span> <span class="o">=</span> <span class="sa">r</span><span class="s2">&#34;(?:融资|资金|筹资)[^。]{0,6}?(?:成本|压力|难度)[^。]{0,4}?(?:升|增|高|大)[^。]*&#34;</span>
    <span class="n">pattern</span> <span class="o">=</span> <span class="sa">r</span><span class="s2">&#34;(&#34;</span> <span class="o">+</span> <span class="n">regex1</span> <span class="o">+</span> <span class="sa">r</span><span class="s2">&#34;)|(&#34;</span> <span class="o">+</span> <span class="n">regex2</span> <span class="o">+</span> <span class="sa">r</span><span class="s2">&#34;)&#34;</span>
    
    <span class="c1">#带内容的结果为融资约束，为True；反之，为False</span>
    <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">re</span><span class="o">.</span><span class="n">findall</span><span class="p">(</span><span class="n">pattern</span><span class="p">,</span> <span class="n">text</span><span class="p">))</span><span class="o">&gt;=</span><span class="mi">1</span><span class="p">:</span>
        <span class="k">return</span> <span class="kc">True</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="k">return</span> <span class="kc">False</span>
    


<span class="k">def</span> <span class="nf">transform</span><span class="p">(</span><span class="n">text</span><span class="p">):</span>
    <span class="c1">#只保留md&amp;a中的中文内容</span>
    <span class="n">text</span> <span class="o">=</span> <span class="s1">&#39;&#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">re</span><span class="o">.</span><span class="n">findall</span><span class="p">(</span><span class="s1">&#39;[</span><span class="se">\u4e00</span><span class="s1">-</span><span class="se">\u9fa5</span><span class="s1">]+&#39;</span><span class="p">,</span> <span class="n">text</span><span class="p">))</span>
    <span class="c1">#剔除停用词</span>
    <span class="n">words</span> <span class="o">=</span> <span class="p">[</span><span class="n">w</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">jieba</span><span class="o">.</span><span class="n">cut</span><span class="p">(</span><span class="n">text</span><span class="p">)</span> <span class="k">if</span> <span class="n">w</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">stopwords</span><span class="p">]</span>
    <span class="c1">#整理为用空格间隔的字符串(类西方语言文本格式)</span>
    <span class="k">return</span> <span class="s1">&#39; &#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">words</span><span class="p">)</span>

    
<span class="k">def</span> <span class="nf">plate</span><span class="p">(</span><span class="n">code</span><span class="p">):</span>
    <span class="c1">#判断股票是在上海证券交易所还是深圳证券交易所</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">code</span><span class="p">[:</span><span class="mi">2</span><span class="p">]</span><span class="o">==</span><span class="s1">&#39;A6&#39;</span><span class="p">)</span> <span class="ow">or</span> <span class="p">(</span><span class="n">code</span><span class="p">[:</span><span class="mi">2</span><span class="p">]</span><span class="o">==</span><span class="s1">&#39;A9&#39;</span><span class="p">):</span>
        <span class="k">return</span> <span class="s1">&#39;上海&#39;</span>
    <span class="k">elif</span> <span class="p">(</span><span class="n">code</span><span class="p">[:</span><span class="mi">2</span><span class="p">]</span><span class="o">==</span><span class="s1">&#39;A0&#39;</span><span class="p">)</span> <span class="ow">or</span> <span class="p">(</span><span class="n">code</span><span class="p">[:</span><span class="mi">2</span><span class="p">]</span><span class="o">==</span><span class="s1">&#39;A3&#39;</span><span class="p">):</span>
        <span class="k">return</span> <span class="s1">&#39;深圳&#39;</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="k">return</span> <span class="s1">&#39;其他&#39;</span>

    
    

<span class="c1">#读取md&amp;a</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;data/mda01-23.csv.gz&#39;</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;会计年度&#39;</span><span class="p">,</span> <span class="s1">&#39;股票代码&#39;</span><span class="p">,</span> <span class="s1">&#39;经营讨论与分析内容&#39;</span><span class="p">]</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;会计年度&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;会计年度&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="nb">str</span><span class="p">)</span>


<span class="c1">#上市公司行业信息</span>
<span class="n">ind_info_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_excel</span><span class="p">(</span><span class="s1">&#39;data/上市公司基本信息2000-2023.xlsx&#39;</span><span class="p">,</span> <span class="n">usecols</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;Symbol&#39;</span><span class="p">,</span> <span class="s1">&#39;EndDate&#39;</span><span class="p">,</span> <span class="s1">&#39;IndustryCodeC&#39;</span><span class="p">])</span>
<span class="n">ind_info_df</span> <span class="o">=</span> <span class="n">ind_info_df</span><span class="p">[</span><span class="n">ind_info_df</span><span class="o">.</span><span class="n">Symbol</span><span class="o">!=</span><span class="s1">&#39;股票代码&#39;</span><span class="p">]</span>
<span class="n">ind_info_df</span><span class="p">[</span><span class="s1">&#39;会计年度&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">ind_info_df</span><span class="o">.</span><span class="n">EndDate</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">date</span><span class="p">:</span> <span class="n">date</span><span class="p">[:</span><span class="mi">4</span><span class="p">])</span>
<span class="n">ind_info_df</span><span class="o">.</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="s1">&#39;Symbol&#39;</span><span class="p">:</span> <span class="s1">&#39;股票代码&#39;</span><span class="p">,</span> <span class="s1">&#39;IndustryCodeC&#39;</span><span class="p">:</span><span class="s1">&#39;行业代码&#39;</span><span class="p">},</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">ind_info_df</span> <span class="o">=</span> <span class="n">ind_info_df</span><span class="p">[[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">,</span> <span class="s1">&#39;会计年度&#39;</span><span class="p">,</span> <span class="s1">&#39;行业代码&#39;</span><span class="p">]]</span>

<span class="c1">#合并数据</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">merge</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">ind_info_df</span><span class="p">,</span> <span class="n">on</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">,</span> <span class="s1">&#39;会计年度&#39;</span><span class="p">],</span> <span class="n">how</span><span class="o">=</span><span class="s1">&#39;inner&#39;</span><span class="p">)</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;板块&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">plate</span><span class="p">)</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;板块&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">isin</span><span class="p">([</span><span class="s1">&#39;上海&#39;</span><span class="p">,</span> <span class="s1">&#39;深圳&#39;</span><span class="p">])]</span>


    
<span class="c1">#识别融资约束</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;融资约束&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;经营讨论与分析内容&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">is_financial_constraint</span><span class="p">)</span>




<span class="k">for</span> <span class="n">year</span> <span class="ow">in</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;会计年度&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">unique</span><span class="p">():</span>
    <span class="n">df_per_year</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;会计年度&#39;</span><span class="p">]</span><span class="o">==</span><span class="n">year</span><span class="p">]</span>
    <span class="n">df_per_year</span><span class="o">.</span><span class="n">reset_index</span><span class="p">(</span><span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>

    <span class="n">df_per_year</span><span class="p">[</span><span class="s1">&#39;clean_text&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df_per_year</span><span class="p">[</span><span class="s1">&#39;经营讨论与分析内容&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">transform</span><span class="p">)</span>
    <span class="n">cv</span> <span class="o">=</span> <span class="n">CountVectorizer</span><span class="p">(</span><span class="n">min_df</span><span class="o">=</span><span class="mf">0.05</span><span class="p">,</span> <span class="n">max_df</span><span class="o">=</span><span class="mf">0.5</span><span class="p">)</span> 
    <span class="c1"># 生成稀疏bow矩阵</span>
    <span class="c1">#dtm 文档-词频-矩阵</span>
    <span class="n">dtm_per_year</span> <span class="o">=</span> <span class="n">cv</span><span class="o">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">df_per_year</span><span class="p">[</span><span class="s1">&#39;clean_text&#39;</span><span class="p">])</span> 
    <span class="n">dtm_per_year</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">dtm_per_year</span><span class="o">.</span><span class="n">toarray</span><span class="p">(),</span> <span class="n">index</span><span class="o">=</span><span class="n">dtm_per_year</span><span class="o">.</span><span class="n">index</span><span class="p">)</span>
    <span class="c1">#向量标准化normalize</span>
    <span class="n">dtm_per_year</span> <span class="o">=</span> <span class="n">dtm_per_year</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">row</span><span class="p">:</span> <span class="n">row</span><span class="o">/</span><span class="n">np</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="n">row</span><span class="p">),</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
    <span class="c1">#合并多个字段为新的df</span>
    <span class="n">dtm_per_year</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">concat</span><span class="p">([</span><span class="n">df_per_year</span><span class="p">[[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">,</span> <span class="s1">&#39;会计年度&#39;</span><span class="p">,</span> <span class="s1">&#39;板块&#39;</span><span class="p">,</span> <span class="s1">&#39;行业代码&#39;</span><span class="p">,</span> <span class="s1">&#39;融资约束&#39;</span><span class="p">]],</span> <span class="n">dtm_per_year</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
    
    
    <span class="k">for</span> <span class="n">idx</span> <span class="ow">in</span> <span class="n">tqdm</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">dtm_per_year</span><span class="p">)),</span> <span class="n">desc</span><span class="o">=</span><span class="sa">f</span><span class="s1">&#39;</span><span class="si">{</span><span class="n">year</span><span class="si">}</span><span class="s1">进度&#39;</span><span class="p">):</span>
        <span class="n">code</span> <span class="o">=</span> <span class="n">dtm_per_year</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="n">idx</span><span class="p">,</span> <span class="s1">&#39;股票代码&#39;</span><span class="p">]</span>
        <span class="n">ind</span> <span class="o">=</span> <span class="n">dtm_per_year</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="n">idx</span><span class="p">,</span> <span class="s1">&#39;行业代码&#39;</span><span class="p">]</span>
        <span class="n">year</span> <span class="o">=</span> <span class="n">dtm_per_year</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="n">idx</span><span class="p">,</span> <span class="s1">&#39;会计年度&#39;</span><span class="p">]</span>
        <span class="n">board</span> <span class="o">=</span> <span class="n">dtm_per_year</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="n">idx</span><span class="p">,</span> <span class="s1">&#39;板块&#39;</span><span class="p">]</span>


        <span class="n">Vec</span> <span class="o">=</span> <span class="n">dtm_per_year</span><span class="o">.</span><span class="n">iloc</span><span class="p">[</span><span class="n">idx</span><span class="p">,</span> <span class="mi">5</span><span class="p">:]</span>
        <span class="n">Ind_Vec</span> <span class="o">=</span> <span class="n">dtm_per_year</span><span class="p">[</span><span class="n">dtm_per_year</span><span class="p">[</span><span class="s1">&#39;行业代码&#39;</span><span class="p">]</span><span class="o">==</span><span class="n">ind</span><span class="p">][</span><span class="n">dtm_per_year</span><span class="p">[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">]</span><span class="o">!=</span><span class="n">code</span><span class="p">]</span><span class="o">.</span><span class="n">iloc</span><span class="p">[:,</span> <span class="mi">5</span><span class="p">:]</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
        <span class="n">Ind_Score</span> <span class="o">=</span> <span class="n">Vec</span> <span class="o">*</span> <span class="p">(</span><span class="n">Ind_Vec</span><span class="o">/</span><span class="n">np</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="n">Ind_Vec</span><span class="p">))</span>
        <span class="n">FinConstrain_Vec</span> <span class="o">=</span> <span class="n">dtm_per_year</span><span class="p">[</span><span class="n">dtm_per_year</span><span class="p">[</span><span class="s1">&#39;融资约束&#39;</span><span class="p">]</span><span class="o">==</span><span class="kc">True</span><span class="p">]</span><span class="o">.</span><span class="n">iloc</span><span class="p">[:,</span> <span class="mi">5</span><span class="p">:]</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
        <span class="n">FinConstrain_Score</span> <span class="o">=</span> <span class="n">Vec</span> <span class="o">*</span> <span class="p">(</span><span class="n">FinConstrain_Vec</span><span class="o">/</span><span class="n">np</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="n">FinConstrain_Vec</span><span class="p">))</span>
        <span class="n">Board_Vec</span> <span class="o">=</span> <span class="n">dtm_per_year</span><span class="p">[</span><span class="n">dtm_per_year</span><span class="p">[</span><span class="s1">&#39;板块&#39;</span><span class="p">]</span><span class="o">==</span><span class="n">board</span><span class="p">][</span><span class="n">dtm_per_year</span><span class="p">[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">]</span><span class="o">!=</span><span class="n">code</span><span class="p">]</span><span class="o">.</span><span class="n">iloc</span><span class="p">[:,</span> <span class="mi">5</span><span class="p">:]</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
        <span class="n">Board_Score</span> <span class="o">=</span> <span class="n">Vec</span> <span class="o">*</span> <span class="p">(</span><span class="n">Board_Vec</span><span class="o">/</span><span class="n">np</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="n">Board_Vec</span><span class="p">))</span>


        <span class="n">dtm_per_year_melted</span> <span class="o">=</span> <span class="n">dtm_per_year</span><span class="o">.</span><span class="n">melt</span><span class="p">(</span><span class="n">id_vars</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">,</span> <span class="s1">&#39;会计年度&#39;</span><span class="p">,</span> <span class="s1">&#39;行业代码&#39;</span><span class="p">,</span> <span class="s1">&#39;板块&#39;</span><span class="p">,</span> <span class="s1">&#39;融资约束&#39;</span><span class="p">],</span>
                                                <span class="n">var_name</span><span class="o">=</span><span class="s1">&#39;word_id&#39;</span><span class="p">,</span> 
                                                <span class="n">value_name</span><span class="o">=</span><span class="s1">&#39;word_freq&#39;</span><span class="p">)</span>


        <span class="n">corporate_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="s1">&#39;word_id&#39;</span><span class="p">:</span> <span class="n">dtm_per_year_melted</span><span class="p">[</span><span class="n">dtm_per_year_melted</span><span class="p">[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">]</span><span class="o">==</span><span class="n">code</span><span class="p">][</span><span class="s1">&#39;word_id&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">values</span><span class="p">,</span>
                                     <span class="s1">&#39;word_freq&#39;</span><span class="p">:</span> <span class="n">dtm_per_year_melted</span><span class="p">[</span><span class="n">dtm_per_year_melted</span><span class="p">[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">]</span><span class="o">==</span><span class="n">code</span><span class="p">][</span><span class="s1">&#39;word_freq&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">values</span><span class="p">,</span>
                                     <span class="s1">&#39;ind_freq&#39;</span><span class="p">:</span> <span class="n">Ind_Score</span><span class="p">,</span>
                                     <span class="s1">&#39;board_freq&#39;</span><span class="p">:</span> <span class="n">Board_Score</span><span class="p">,</span>
                                     <span class="s1">&#39;fin_constrain_freq&#39;</span><span class="p">:</span> <span class="n">FinConstrain_Score</span><span class="p">})</span>
        <span class="n">corporate_df</span><span class="p">[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">code</span>
        <span class="n">corporate_df</span><span class="p">[</span><span class="s1">&#39;行业代码&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">ind</span>
        <span class="n">corporate_df</span><span class="p">[</span><span class="s1">&#39;板块&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">board</span>
        <span class="n">corporate_df</span><span class="p">[</span><span class="s1">&#39;会计年度&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">year</span>

        <span class="n">corporate_df</span><span class="o">.</span><span class="n">reset_index</span><span class="p">(</span><span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
        <span class="n">corporate_df</span> <span class="o">=</span> <span class="n">corporate_df</span><span class="p">[[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">,</span> <span class="s1">&#39;行业代码&#39;</span><span class="p">,</span> <span class="s1">&#39;会计年度&#39;</span><span class="p">,</span> <span class="s1">&#39;板块&#39;</span><span class="p">,</span> <span class="s1">&#39;word_id&#39;</span><span class="p">,</span> <span class="s1">&#39;word_freq&#39;</span><span class="p">,</span> <span class="s1">&#39;ind_freq&#39;</span><span class="p">,</span> <span class="s1">&#39;board_freq&#39;</span><span class="p">,</span> <span class="s1">&#39;fin_constrain_freq&#39;</span><span class="p">]]</span>
        
        <span class="k">if</span> <span class="ow">not</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">exists</span><span class="p">(</span><span class="s1">&#39;fin_constrain_output/</span><span class="si">{year}</span><span class="s1">&#39;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">year</span><span class="o">=</span><span class="n">year</span><span class="p">)):</span>
            <span class="n">os</span><span class="o">.</span><span class="n">mkdir</span><span class="p">(</span><span class="s1">&#39;fin_constrain_output/</span><span class="si">{year}</span><span class="s1">&#39;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">year</span><span class="o">=</span><span class="n">year</span><span class="p">))</span>
        
        <span class="n">corporate_df</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s1">&#39;fin_constrain_output/</span><span class="si">{year}</span><span class="s1">/</span><span class="si">{code}</span><span class="s1">.csv&#39;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">year</span><span class="o">=</span><span class="n">year</span><span class="p">,</span> <span class="n">code</span><span class="o">=</span><span class="n">code</span><span class="p">),</span> <span class="n">index</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">mode</span><span class="o">=</span><span class="s1">&#39;w&#39;</span><span class="p">)</span>
             
</code></pre></div><p><br><br></p>
<h3 id="44-融资约束2023">4.4 融资约束2023</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">    - ConstrainedScore_it =β0 + β1 * BoardScore_bt + β2 * IndustryScore_it + E_it
    - BoardScore_bt  交易所引发的融资约束相似度
    - IndustryScore_it  行业特征引发的融资约束相似度
    - E_it  残差就是本文要计算的[融资约束指标FC]
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">csv_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;fin_constrain_output/2023/A000001.csv&#39;</span><span class="p">,</span>  <span class="n">converters</span><span class="o">=</span><span class="p">{</span><span class="s1">&#39;股票代码&#39;</span><span class="p">:</span> <span class="nb">str</span><span class="p">})</span>
<span class="n">csv_df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/df6.png" alt=""  />
</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#更改字段名。</span>
<span class="n">csv_df</span><span class="o">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">,</span> <span class="s1">&#39;行业代码&#39;</span><span class="p">,</span> <span class="s1">&#39;会计年度&#39;</span><span class="p">,</span> <span class="s1">&#39;板块&#39;</span><span class="p">,</span> <span class="s1">&#39;word_id&#39;</span><span class="p">,</span> <span class="s1">&#39;Vec&#39;</span><span class="p">,</span> <span class="s1">&#39;IndustryScore&#39;</span><span class="p">,</span> <span class="s1">&#39;BoardScore&#39;</span><span class="p">,</span> <span class="s1">&#39;ConstrainedScore&#39;</span><span class="p">]</span>
<span class="n">csv_df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/df7.png" alt=""  />
</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">statsmodels.formula.api</span> <span class="k">as</span> <span class="nn">smf</span>

<span class="c1">#因变量ConstrainedScore</span>
<span class="c1">#解释变量IndustryScore、 BoardScore</span>
<span class="n">formula</span> <span class="o">=</span> <span class="s1">&#39;ConstrainedScore ~ IndustryScore + BoardScore&#39;</span>

<span class="n">model</span> <span class="o">=</span> <span class="n">smf</span><span class="o">.</span><span class="n">ols</span><span class="p">(</span><span class="n">formula</span><span class="p">,</span> <span class="n">data</span><span class="o">=</span><span class="n">csv_df</span><span class="p">)</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">()</span>
<span class="nb">print</span><span class="p">(</span><span class="n">result</span><span class="o">.</span><span class="n">summary</span><span class="p">())</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">OLS Regression Results                            
==============================================================================
Dep. Variable:       ConstrainedScore   R-squared:                       0.986
Model:                            OLS   Adj. R-squared:                  0.986
Method:                 Least Squares   F-statistic:                 1.612e+05
Date:                Sat, 27 Jul 2024   Prob (F-statistic):               0.00
Time:                        14:12:31   Log-Likelihood:                 64496.
No. Observations:                4703   AIC:                        -1.290e+05
Df Residuals:                    4700   BIC:                        -1.290e+05
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P&gt;|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept     -1.534e-08   3.92e-09     -3.914      0.000    -2.3e-08   -7.65e-09
IndustryScore     0.1173      0.002     60.638      0.000       0.114       0.121
BoardScore        1.0034      0.007    139.246      0.000       0.989       1.018
==============================================================================
Omnibus:                     9389.385   Durbin-Watson:                   1.795
Prob(Omnibus):                  0.000   Jarque-Bera (JB):         35835031.254
Skew:                         -15.930   Prob(JB):                         0.00
Kurtosis:                     429.445   Cond. No.                     1.90e+06
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.9e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#融资约束FC</span>
<span class="n">FC</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">(</span><span class="nb">abs</span><span class="p">(</span><span class="n">result</span><span class="o">.</span><span class="n">resid</span><span class="p">))</span>

<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;2023年 A000001融资约束指标 FC: </span><span class="si">{}</span><span class="s1">&#39;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">FC</span><span class="p">))</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">2023年 A000001融资约束指标 FC: 0.00020066158329792454
</code></pre></div><br>
<h3 id="45-融资约束2001-2023">4.5 融资约束2001-2023</h3>
<p>根据步骤4.4我们成功计算出了2023的融资约束FC指标，现在推广到2001-2023， 并将计算结果存储到 <em><strong>fin_constrain2001-2023.csv</strong></em>， csv 含 <em><strong>code</strong></em>、<em><strong>year</strong></em>、<em><strong>FC</strong></em> 三个字段。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>

<span class="kn">import</span> <span class="nn">glob</span>
<span class="kn">import</span> <span class="nn">csv</span>
<span class="kn">import</span> <span class="nn">statsmodels.formula.api</span> <span class="k">as</span> <span class="nn">smf</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s1">&#39;fin_constrain2001-2023.csv&#39;</span><span class="p">,</span> <span class="s1">&#39;w&#39;</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8&#39;</span><span class="p">,</span> <span class="n">newline</span><span class="o">=</span><span class="s1">&#39;&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">csvf</span><span class="p">:</span>
    <span class="n">fieldnames</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;code&#39;</span><span class="p">,</span> <span class="s1">&#39;year&#39;</span><span class="p">,</span> <span class="s1">&#39;FC&#39;</span><span class="p">]</span>
    <span class="n">writer</span> <span class="o">=</span> <span class="n">csv</span><span class="o">.</span><span class="n">DictWriter</span><span class="p">(</span><span class="n">csvf</span><span class="p">,</span> <span class="n">fieldnames</span><span class="o">=</span><span class="n">fieldnames</span><span class="p">)</span>
    <span class="n">writer</span><span class="o">.</span><span class="n">writeheader</span><span class="p">()</span>
    <span class="k">for</span> <span class="n">file</span> <span class="ow">in</span> <span class="n">glob</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s1">&#39;fin_constrain_output/*/*.csv&#39;</span><span class="p">):</span>
        <span class="k">try</span><span class="p">:</span>
            <span class="n">df_</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="n">file</span><span class="p">)</span>
            <span class="n">df_</span><span class="o">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">,</span> <span class="s1">&#39;行业代码&#39;</span><span class="p">,</span> <span class="s1">&#39;会计年度&#39;</span><span class="p">,</span> <span class="s1">&#39;板块&#39;</span><span class="p">,</span> <span class="s1">&#39;word_id&#39;</span><span class="p">,</span> <span class="s1">&#39;Vec&#39;</span><span class="p">,</span> <span class="s1">&#39;IndustryScore&#39;</span><span class="p">,</span> <span class="s1">&#39;BoardScore&#39;</span><span class="p">,</span> <span class="s1">&#39;ConstrainedScore&#39;</span><span class="p">]</span>
            <span class="n">formula</span> <span class="o">=</span> <span class="s1">&#39;ConstrainedScore ~ IndustryScore + BoardScore&#39;</span>
            <span class="n">model</span> <span class="o">=</span> <span class="n">smf</span><span class="o">.</span><span class="n">ols</span><span class="p">(</span><span class="n">formula</span><span class="p">,</span> <span class="n">data</span><span class="o">=</span><span class="n">df_</span><span class="p">)</span>
            <span class="n">result</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">()</span>
            <span class="n">FC</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">(</span><span class="n">result</span><span class="o">.</span><span class="n">resid</span><span class="p">)</span>
            <span class="n">FC</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">(</span><span class="nb">abs</span><span class="p">(</span><span class="n">result</span><span class="o">.</span><span class="n">resid</span><span class="p">))</span>
            <span class="n">data</span> <span class="o">=</span> <span class="p">{</span>
                    <span class="s1">&#39;code&#39;</span><span class="p">:</span> <span class="n">df_</span><span class="p">[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">unique</span><span class="p">()[</span><span class="mi">0</span><span class="p">],</span>
                    <span class="s1">&#39;year&#39;</span><span class="p">:</span> <span class="n">df_</span><span class="p">[</span><span class="s1">&#39;会计年度&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">unique</span><span class="p">()[</span><span class="mi">0</span><span class="p">],</span>
                    <span class="s1">&#39;FC&#39;</span><span class="p">:</span> <span class="n">FC</span>
                <span class="p">}</span>
            <span class="n">writer</span><span class="o">.</span><span class="n">writerow</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
        <span class="k">except</span><span class="p">:</span>
            <span class="k">pass</span>


</code></pre></div><br>
<p>最后查看(欣赏)这个融资约束数据 <em><strong>fin_constrain2001-2023.csv</strong></em></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">fc_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;fin_constrain2001-2023.csv&#39;</span><span class="p">)</span>
<span class="n">fc_df</span>
</code></pre></div><p><img loading="lazy" src="img/df8.png" alt=""  />
</p>
<br>
<br>
<h2 id="五获取资料">五、获取资料</h2>
<p>数据&amp;代码创作不易，如果需要源代码和数据， 加微信372335839， 备注「姓名-学校-专业」</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">打包100元, 含
   - 管理层讨论与分析(mda01-23.csv.gz)、上市公司基本信息2000-2023.xlsx
   - 计算结果(fin_constrain2001-2023.csv)


零卖价
  - 100元  管理层讨论与分析(mda01-23.csv.gz)、上市公司基本信息2000-2023.xlsx
  - 50元   计算结果(fin_constrain2001-2023.csv)
</code></pre></div><br>
<p><img loading="lazy" src="img/screen.png" alt=""  />
</p>
<p><img loading="lazy" src="img/size.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="相关内容">相关内容</h2>
<ul>
<li>
<p><a href="https://textdata.cn/blog/2023-03-23-china-a-share-market-dataset-mda-from-01-to-21/">数据集 | 2001-2023年A股上市公司年报&amp;管理层讨论与分析</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2024-04-16-china-listed-company-information-dataset/">数据集 | A股上市公司基本信息2000-2022</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-01-13-information-content-of-critical-audit/">金融研究 | 使用Python构建「关键审计事项信息含量」</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-01-06-mda_informative_content/">中国工业经济 | MD&amp;A信息含量指标构建代码实现</a></p>
</li>
</ul>
]]></content:encoded>
    </item>
    
    <item>
      <title>中国工业经济 | 使用Python测量MD&amp;A信息含量指标</title>
      <link>https://textdata.cn/blog/2023-01-06-mda_informative_content/</link>
      <pubDate>Sun, 21 Apr 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-01-06-mda_informative_content/</guid>
      <description>每个上市公司 MD&amp;amp;A 信息不可避免地在某种程度上与同行业其他上市公司以及市场其他行业上市公司存在一定的相似性， 甚至某些公司可能直接参考其他公司 MD&amp;amp;A 的表述。 可以将与行业其他公司或其他行业的公司重复或相似的信息定义为不具有信息含量的内容，同时将不同的信息定义为真正具有信息含量的内容，简称为信息含量。In this paper ， we discuss the impact of informative content of Management Discussion and Analysis （ MD&amp;amp;A ） on stock price crash risk using the method of text vectorization. Using the MD&amp;amp;A in annual reports of China A -share listed firms from 2007 to 2015 ， we find that the informative content of MD&amp;amp;A can reduce future stock price crash risk ， and the informative content of preview section has significant effects on stock price crash risk ， while that of review section does not. After controlling endogeneity ， the conclusions still stand. Further ， we study the influence of informative content of preview section on crash risk from the aspects of readability and information opaqueness. The results show that the higher readability and higher information opaqueness ， the greater impact of informative content has on stock price crash risk. Finally ， after changing the calculation of crash risk ， and controlling the impact of stock price synchronicity ， the informative content of preview section still reduces stock price crash risk. This paper enriches the influencing factors of stock price crash risk and improves the study of the usefulness of MD&amp;amp;A from the perspective of incremental information ， which has important theoretical and practical significance.</description>
      <content:encoded><![CDATA[<p>由于任何一个行为主体都会受到 <strong>周围环境</strong> 和 <strong>自身经历</strong>(认知) 影响，所发表的信息必然包含通 <strong>环境信息</strong> 和 <strong>特异性信息</strong> 。如何通过文本，表征文本的通用信息和特意性信息，如何测量行为主体发表内容的信息含量，带着这些疑问， 一起读这篇17年的论文的方法论部分，并用Python将其实现。</p>
<p><br><br></p>
<h2 id="一信息含量">一、信息含量</h2>
<p>由于每个公司的 MD&amp;A 中不仅包括公司经营状况等历史信息， 也包括与其他公司相似的信息， 如外部环境、市场格局、风险因素等内容。 因此， 本文参考 Hanley and Hoberg （ 2010 ）， 从行业和市场两个维度来考察和定义公司 MD&amp;A 中的信息含量。</p>
<ul>
<li><strong>市场因素</strong>， 所有上市公司都处于相同的宏观经济环境、风险因素和政治、政策背景之下；</li>
<li><strong>行业因素</strong>， 同一行业中的各上市公司又面临着相似的产业政策、竞争环境和市场特征。</li>
</ul>
<p>由此可见， 每个上市公司 MD&amp;A 信息不可避免地在某种程度上与同行业其他上市公司以及市场其他行业上市公司存在一定的相似性， 甚至某些公司可能直接参考其他公司 MD&amp;A 的表述。 <strong>可以将与行业其他公司或其他行业的公司重复或相似的信息定义为不具有信息含量的内容，同时将不同的信息定义为真正具有信息含量的内容，简称为信息含量</strong>。</p>
<br>
<blockquote>
<p>孟庆斌, 杨俊华, and 鲁冰. &ldquo;管理层讨论与分析披露的信息含量与股价崩盘风险——基于文本向量化方法的研究.&rdquo; 中国工业经济 12 (2017): 132-150.</p>
</blockquote>
<br>
<h3 id="11-摘要">1.1 摘要</h3>
<p>本文采用文本向量化的方法， 对 2007—2015 年中国 A 股上市公司年报的管理层讨论与分析（MD&amp;A）所披露的信息含量加以度量， 研究其对股价崩盘风险的影响。 研究发现， MD&amp;A 的信息含量越高，未来股价崩盘风险越低。 将 MD&amp;A 进一步划分为回顾部分和展望部分后发现，仅有展望部分中的信息含量能够显著降低未来股价崩盘风险。 在控制内生性问题之后，本文的结论依然成立。 本文还分别从文本可读性和信息不对称的角度出发，研究它们对二者关系的影响。 结果表明，信息的可读性越高，信息不对称程度越高，展望部分的信息含量对股价崩盘风险的降低作用越大。 在重新定义股价崩盘风险的计算区间以及控制股价同步性之后， MD&amp;A 展望部分的信息含量依然能够显著降低股价崩盘风险， 表明本文的结论是稳健的。 本文从文本信息的角度丰富了股价崩盘风险影响因素的研究， 同时也从增量信息的角度完善了 MD&amp;A 信息有用性的研究，具有重要的理论和现实意义。</p>
<br>
<h3 id="12-样本选择和处理">1.2 样本选择和处理</h3>
<p>本文选取 2007 — 2015 年中国上市公司年报中的 MD&amp;A 信息作为研究样本。 之所以选取 2007 年作为样本的起点， 是因为从 2007 年开始， MD&amp;A 在企业定期报告中的披露要求已经较为完善， 而且 2007 年是中国会计准则国际趋同的重要时点， 新制定的《企业会计准则》已经开始实施， 为避免前后会计准则差异而产生的影响， 因此选取 2007 年作为样本区间的起点。</p>
<p>本文所使用的上市公司年度报告均来自于巨潮资讯网。 数据处理过程如下：</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">（ 1 ）剔除金融行业、 ST 和 *ST 类企业， 以及上市时间不足一年的企业。

（ 2 ）从 MD&amp;A 的内容中分别提取回顾和展望部分， 保存为回顾信息文件和展望信息文件， 部分无法抓取出的年报通过手工收集处理。

（ 3 ）文本处理-文本向量化。 借鉴 Hanley and Hoberg （ 2010 ）的研究思路， 将每个 MD&amp;A 文本通过向量的 形式表示出来， 其每个元素为文本中的每个词语出现的频率。 例如， 假设某 MD&amp;A 文本中包含 10000 个词， 则该文本对应一个 10000×1 维的向量。 举一个简单的例子来描述文本向量化的过程： 在两个简化的 MD&amp;A 文本中， 一个包含“我们生产土豆和生产玉米”， 另一个包含“我们生产家具”， 剔除连词“和”、代词“我们”之后， 只剩下“生产”、“土豆”、“玉米”、“家具”这 4 个词。 那么， 在第一个 MD&amp;A 文本中， “生产”、“土豆”和“玉米”分别出现了 2 次、 1 次和 1 次， 而“家具”出现 0 次， 所以该 文本的向量为 {2 ， 1 ， 1 ， 0} ， 同样得到第二个文本的向量为 {1 ， 0 ， 0 ， 1} 。

（ 4 ）向量标准化。 对于向量化的文本， 仍需解决文本长度不同导致的结果不可比问题。 一般来说， 某一个词在长文本中重复出现的次数较多， 在短文本中重复出现的次数较少， 但并不能因此说 长文本比短文本的信息量大。 为此， 本文进一步将这些向量进行标准化处理， 即将该向量除以文本 中单词的总数， 得到标准化后的向量。 在上面的例子中， 两个公司的标准化之后的向量就成为了 {0.50 ， 0.25 ， 0.25 ， 0} 和 {0.50 ， 0 ， 0 ， 0.50} 。
</code></pre></div><br>
<h3 id="13-文件目录">1.3 文件目录</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">管理层讨论信息含量/
├── 代码.ipynb
├── data/
│   ├── 上市公司基本信息2000-2023.xlsx
│   └── mda01-23.csv.gz
├── mda_infor2001-2023.csv
├── mda_infor_output/
│   └── 2023/
│       ├── A000002.csv
│       ├── A000004.csv
│       ├── A000005.csv
│       ├── A000006.csv
│       ├── ...
│   └── 2022/
│       ├── A000002.csv
│       ├── A000004.csv
│       ├── A000005.csv
│       ├── A000006.csv
│       ├── ...
│   └── 2021/
│       ├── A000002.csv
│       ├── A000004.csv
│       ├── A000005.csv
│       ├── A000006.csv
│       ├── ...
│   └── ...
</code></pre></div><p><br><br></p>
<h2 id="二导入数据">二、导入数据</h2>
<p>这里准备了2001-2023年A股经营讨论与分析内容和行业代码数据。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="c1">#读取md&amp;a</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;data/mda01-23.csv.gz&#39;</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;会计年度&#39;</span><span class="p">,</span> <span class="s1">&#39;股票代码&#39;</span><span class="p">,</span> <span class="s1">&#39;经营讨论与分析内容&#39;</span><span class="p">]</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;会计年度&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;会计年度&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="nb">str</span><span class="p">)</span>


<span class="c1">#上市公司行业信息</span>
<span class="n">ind_info_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_excel</span><span class="p">(</span><span class="s1">&#39;data/上市公司基本信息2000-2023.xlsx&#39;</span><span class="p">,</span> <span class="n">usecols</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;Symbol&#39;</span><span class="p">,</span> <span class="s1">&#39;EndDate&#39;</span><span class="p">,</span> <span class="s1">&#39;IndustryCodeC&#39;</span><span class="p">,</span> <span class="s1">&#39;ShortName&#39;</span><span class="p">])</span>
<span class="n">ind_info_df</span> <span class="o">=</span> <span class="n">ind_info_df</span><span class="p">[</span><span class="n">ind_info_df</span><span class="o">.</span><span class="n">Symbol</span><span class="o">!=</span><span class="s1">&#39;股票代码&#39;</span><span class="p">]</span>
<span class="n">ind_info_df</span><span class="p">[</span><span class="s1">&#39;会计年度&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">ind_info_df</span><span class="o">.</span><span class="n">EndDate</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">date</span><span class="p">:</span> <span class="n">date</span><span class="p">[:</span><span class="mi">4</span><span class="p">])</span>
<span class="n">ind_info_df</span><span class="o">.</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="s1">&#39;Symbol&#39;</span><span class="p">:</span> <span class="s1">&#39;股票代码&#39;</span><span class="p">,</span> <span class="s1">&#39;IndustryCodeC&#39;</span><span class="p">:</span><span class="s1">&#39;行业代码&#39;</span><span class="p">,</span> <span class="s1">&#39;ShortName&#39;</span><span class="p">:</span> <span class="s1">&#39;股票简称&#39;</span><span class="p">},</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">ind_info_df</span> <span class="o">=</span> <span class="n">ind_info_df</span><span class="p">[[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">,</span> <span class="s1">&#39;会计年度&#39;</span><span class="p">,</span> <span class="s1">&#39;行业代码&#39;</span><span class="p">,</span> <span class="s1">&#39;股票简称&#39;</span><span class="p">]]</span>

<span class="c1">#合并数据</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">merge</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">ind_info_df</span><span class="p">,</span> <span class="n">on</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">,</span> <span class="s1">&#39;会计年度&#39;</span><span class="p">],</span> <span class="n">how</span><span class="o">=</span><span class="s1">&#39;inner&#39;</span><span class="p">)</span>

<span class="c1"># 剔除金融行业处理</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="o">~</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;行业代码&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s2">&#34;J&#34;</span><span class="p">)]</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="o">~</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;股票简称&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s2">&#34;ST&#34;</span><span class="p">)]</span>

<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/df2.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="三以2023年为例">三、以2023年为例</h2>
<p>写代码先局部后整体，以2023年为例，如果2023年可以成功计算出信息含量，则可以for循环推广到所有股票所有年份。本章节需要做</p>
<ol>
<li>选定某年份，以2023年为例</li>
<li>定义transform函数，用于处理「经营讨论与分析内容」字段内的内容。</li>
<li>文本向量化，向量标准化。</li>
</ol>
<br>
<h3 id="31-选定2023年">3.1 选定2023年</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df_per_year</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;会计年度&#39;</span><span class="p">]</span><span class="o">==</span><span class="s1">&#39;2023&#39;</span><span class="p">]</span>
<span class="n">df_per_year</span><span class="o">.</span><span class="n">reset_index</span><span class="p">(</span><span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">df_per_year</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/df3.png" alt=""  />
</p>
<br>
<h3 id="32-定义transform函数">3.2 定义transform函数</h3>
<p>定义 <em><strong>transform</strong></em> 函数，该函数可以处理「<em><strong>经营讨论与分析内容</strong></em>」字段内容，使其:</p>
<ol>
<li>只保留中文内容</li>
<li>剔除停用词</li>
<li>整理为用空格间隔的字符串(类西方语言文本格式)</li>
</ol>
<p>之后应用 <em><strong>transform</strong></em>函数， 使用 <strong>apply</strong> 方法， 处理  <em><strong>df_per_year[&lsquo;经营讨论与分析内容&rsquo;]</strong></em> 。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">re</span>
<span class="kn">import</span> <span class="nn">jieba</span>
<span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>
    
<span class="c1">#cntext1.9.2</span>
<span class="c1">#stopwords = ct.load_pkl_dict(&#39;STOPWORDS.pkl&#39;)[&#39;STOPWORDS&#39;][&#39;chinese&#39;]</span>

<span class="c1">#cntext2.1.7</span>
<span class="n">stopwords</span><span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">read_yaml_dict</span><span class="p">(</span><span class="s1">&#39;enzh_common_StopWords.yaml&#39;</span><span class="p">)[</span><span class="s1">&#39;Dictionary&#39;</span><span class="p">][</span><span class="s1">&#39;chinese&#39;</span><span class="p">]</span>

    


<span class="k">def</span> <span class="nf">transform</span><span class="p">(</span><span class="n">text</span><span class="p">):</span>
    <span class="c1">#只保留md&amp;a中的中文内容</span>
    <span class="n">text</span> <span class="o">=</span> <span class="s1">&#39;&#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">re</span><span class="o">.</span><span class="n">findall</span><span class="p">(</span><span class="s1">&#39;[</span><span class="se">\u4e00</span><span class="s1">-</span><span class="se">\u9fa5</span><span class="s1">]+&#39;</span><span class="p">,</span> <span class="n">text</span><span class="p">))</span>
    <span class="c1">#剔除停用词</span>
    <span class="n">words</span> <span class="o">=</span> <span class="p">[</span><span class="n">w</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">jieba</span><span class="o">.</span><span class="n">cut</span><span class="p">(</span><span class="n">text</span><span class="p">)</span> <span class="k">if</span> <span class="n">w</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">stopwords</span><span class="p">]</span>
    <span class="c1">#整理为用空格间隔的字符串(类西方语言文本格式)</span>
    <span class="k">return</span> <span class="s1">&#39; &#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">words</span><span class="p">)</span>


<span class="n">df_per_year</span><span class="p">[</span><span class="s1">&#39;clean_text&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df_per_year</span><span class="p">[</span><span class="s1">&#39;经营讨论与分析内容&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">transform</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">    Building prefix dict from the default dictionary ...
    Loading model from cache /var/folders/sc/3mnt5tgs419_hk7s16gq61p80000gn/T/jieba.cache
    Loading model cost 0.556 seconds.
    Prefix dict has been built successfully.
</code></pre></div><br>
<h3 id="33-文本向量化">3.3 文本向量化</h3>
<p>本小节要做:</p>
<ol>
<li>文本向量化</li>
<li>向量标准化</li>
<li>合并多个字段为新的df</li>
</ol>
<p>先将df_per_year[&lsquo;clean_text&rsquo;] 向量化，代码如下</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>
<span class="kn">from</span> <span class="nn">sklearn.feature_extraction.text</span> <span class="kn">import</span> <span class="n">CountVectorizer</span>

<span class="n">cv</span> <span class="o">=</span> <span class="n">CountVectorizer</span><span class="p">(</span><span class="n">min_df</span><span class="o">=</span><span class="mf">0.05</span><span class="p">,</span> <span class="n">max_df</span><span class="o">=</span><span class="mf">0.5</span><span class="p">)</span> 
<span class="c1"># 生成稀疏bow矩阵</span>
<span class="n">dtm_per_year</span> <span class="o">=</span> <span class="n">cv</span><span class="o">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">df_per_year</span><span class="p">[</span><span class="s1">&#39;clean_text&#39;</span><span class="p">])</span> 
<span class="n">dtm_per_year</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">dtm_per_year</span><span class="o">.</span><span class="n">toarray</span><span class="p">(),</span> <span class="n">index</span><span class="o">=</span><span class="n">df_per_year</span><span class="o">.</span><span class="n">index</span><span class="p">)</span>
<span class="n">dtm_per_year</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">CPU times: user 4.09 s, sys: 109 ms, total: 4.2 s
Wall time: 4.2 s
</code></pre></div><p><img loading="lazy" src="img/df4.png" alt=""  />
</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>

<span class="c1">#向量标准化</span>
<span class="n">dtm_per_year</span> <span class="o">=</span> <span class="n">dtm_per_year</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">row</span><span class="p">:</span> <span class="n">row</span><span class="o">/</span><span class="n">np</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="n">row</span><span class="p">),</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">dtm_per_year</span>
</code></pre></div><p><img loading="lazy" src="img/df5.png" alt=""  />
</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#合并多个字段为新的df</span>
<span class="n">dtm_per_year</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">concat</span><span class="p">([</span><span class="n">df_per_year</span><span class="p">[[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">,</span> <span class="s1">&#39;会计年度&#39;</span><span class="p">,</span> <span class="s1">&#39;行业代码&#39;</span><span class="p">]],</span> <span class="n">dtm_per_year</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">dtm_per_year</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/df6.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="四计算2023年行业向量市场向量">四、计算2023年行业向量、市场向量</h2>
<p>计算2023年所有公司的市场向量、行业向量。这里</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>

<span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">from</span> <span class="nn">tqdm</span> <span class="kn">import</span> <span class="n">tqdm</span>

<span class="k">if</span> <span class="ow">not</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">exists</span><span class="p">(</span><span class="s1">&#39;mda_infor_output&#39;</span><span class="p">):</span>
    <span class="n">os</span><span class="o">.</span><span class="n">mkdir</span><span class="p">(</span><span class="s1">&#39;mda_infor_output&#39;</span><span class="p">)</span>
    

<span class="k">for</span> <span class="n">idx</span> <span class="ow">in</span> <span class="n">tqdm</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">dtm_per_year</span><span class="p">)),</span> <span class="n">desc</span><span class="o">=</span><span class="s2">&#34;会计年度2023进度&#34;</span><span class="p">):</span>
    <span class="n">code</span> <span class="o">=</span> <span class="n">dtm_per_year</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="n">idx</span><span class="p">,</span> <span class="s1">&#39;股票代码&#39;</span><span class="p">]</span>
    <span class="n">ind</span> <span class="o">=</span> <span class="n">dtm_per_year</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="n">idx</span><span class="p">,</span> <span class="s1">&#39;行业代码&#39;</span><span class="p">]</span>
    <span class="n">year</span> <span class="o">=</span> <span class="n">dtm_per_year</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="n">idx</span><span class="p">,</span> <span class="s1">&#39;会计年度&#39;</span><span class="p">]</span>
    
    <span class="n">ind_freq</span> <span class="o">=</span> <span class="n">dtm_per_year</span><span class="p">[</span><span class="n">dtm_per_year</span><span class="p">[</span><span class="s1">&#39;行业代码&#39;</span><span class="p">]</span><span class="o">==</span><span class="n">ind</span><span class="p">][</span><span class="n">dtm_per_year</span><span class="p">[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">]</span><span class="o">==</span><span class="n">code</span><span class="p">]</span><span class="o">.</span><span class="n">iloc</span><span class="p">[:,</span> <span class="mi">3</span><span class="p">:]</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
    <span class="n">market_freq</span> <span class="o">=</span> <span class="n">dtm_per_year</span><span class="p">[</span><span class="n">dtm_per_year</span><span class="p">[</span><span class="s1">&#39;行业代码&#39;</span><span class="p">]</span><span class="o">!=</span><span class="n">ind</span><span class="p">]</span><span class="o">.</span><span class="n">iloc</span><span class="p">[:,</span> <span class="mi">3</span><span class="p">:]</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
    
    <span class="n">dtm_per_year_melted</span> <span class="o">=</span> <span class="n">dtm_per_year</span><span class="o">.</span><span class="n">melt</span><span class="p">(</span><span class="n">id_vars</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">,</span> <span class="s1">&#39;会计年度&#39;</span><span class="p">,</span> <span class="s1">&#39;行业代码&#39;</span><span class="p">],</span>
                                            <span class="n">var_name</span><span class="o">=</span><span class="s1">&#39;word_id&#39;</span><span class="p">,</span> 
                                            <span class="n">value_name</span><span class="o">=</span><span class="s1">&#39;word_freq&#39;</span><span class="p">)</span>
    <span class="n">corporate_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="s1">&#39;word_id&#39;</span><span class="p">:</span> <span class="n">dtm_per_year_melted</span><span class="p">[</span><span class="n">dtm_per_year_melted</span><span class="p">[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">]</span><span class="o">==</span><span class="n">code</span><span class="p">][</span><span class="s1">&#39;word_id&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">values</span><span class="p">,</span>
                                 <span class="s1">&#39;word_freq&#39;</span><span class="p">:</span> <span class="n">dtm_per_year_melted</span><span class="p">[</span><span class="n">dtm_per_year_melted</span><span class="p">[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">]</span><span class="o">==</span><span class="n">code</span><span class="p">][</span><span class="s1">&#39;word_freq&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">values</span><span class="p">,</span>
                                 <span class="s1">&#39;ind_freq&#39;</span><span class="p">:</span> <span class="n">ind_freq</span><span class="p">,</span>
                                 <span class="s1">&#39;market_freq&#39;</span><span class="p">:</span><span class="n">market_freq</span><span class="p">})</span>
    <span class="n">corporate_df</span><span class="p">[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">code</span>
    <span class="n">corporate_df</span><span class="p">[</span><span class="s1">&#39;行业代码&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">ind</span>
    <span class="n">corporate_df</span><span class="p">[</span><span class="s1">&#39;会计年度&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">year</span>
    <span class="n">corporate_df</span><span class="o">.</span><span class="n">reset_index</span><span class="p">(</span><span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
    <span class="n">corporate_df</span> <span class="o">=</span> <span class="n">corporate_df</span><span class="p">[[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">,</span> <span class="s1">&#39;行业代码&#39;</span><span class="p">,</span> <span class="s1">&#39;会计年度&#39;</span><span class="p">,</span> <span class="s1">&#39;word_id&#39;</span><span class="p">,</span> <span class="s1">&#39;word_freq&#39;</span><span class="p">,</span> <span class="s1">&#39;ind_freq&#39;</span><span class="p">,</span> <span class="s1">&#39;market_freq&#39;</span><span class="p">]]</span>

    
    <span class="k">if</span> <span class="ow">not</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">exists</span><span class="p">(</span><span class="s1">&#39;mda_infor_output/</span><span class="si">{year}</span><span class="s1">&#39;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">year</span><span class="o">=</span><span class="n">year</span><span class="p">)):</span>
        <span class="n">os</span><span class="o">.</span><span class="n">mkdir</span><span class="p">(</span><span class="s1">&#39;mda_infor_output/</span><span class="si">{year}</span><span class="s1">&#39;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">year</span><span class="o">=</span><span class="n">year</span><span class="p">))</span>
    <span class="n">corporate_df</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s1">&#39;mda_infor_output/</span><span class="si">{year}</span><span class="s1">/</span><span class="si">{code}</span><span class="s1">.csv&#39;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">year</span><span class="o">=</span><span class="n">year</span><span class="p">,</span> <span class="n">code</span><span class="o">=</span><span class="n">code</span><span class="p">),</span> <span class="n">index</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">会计年度2023进度: 100%|███████████████████| 2699/2699 [1:00:41&lt;00:00,  1.35s/it]
CPU times: user 55min 56s, sys: 4min 33s, total: 1h 29s
Wall time: 1h 41s
</code></pre></div><p>从运行的进度条可知2023 年符合规则的记录有2699 条， 运行时间 1 小时 35 分钟。</p>
<p><br><br></p>
<h2 id="五计算2001-2023年所有公司行业向量市场向量">五、计算2001-2023年所有公司行业向量、市场向量</h2>
<p>信息含量的定义。 由于每个公司的 MD&amp;A 中不仅包括公司经营状况等历史信息， 也包括与其他公司相似的信息， 如外部环境、市场格局、风险因素等内容。 因此， 本文参考 Hanley and Hoberg （ 2010 ）， 从行业和市场两个维度来考察和定义公司 MD&amp;A 中的信息含量。</p>
<ul>
<li><strong>市场因素</strong>， 所有上市公司都处于相同的宏观经济环境、风险因素和政治、政策背景之下；</li>
<li><strong>行业因素</strong>， 同一行业中的各上市公司又面临着相似的产业政策、竞争环境和市场特征。</li>
</ul>
<p>由此可见， 每个上市公司 MD&amp;A 信息不可避免地在某种程度上与同行业其他上市公司以及市场其他行业上市公司存在一定的相似性， 甚至某些公司可能直接参考其他公司 MD&amp;A 的表述。</p>
<p><img loading="lazy" src="img/norm_ind_market.png" alt=""  />
</p>
<p>参考文中截图行业向量、市场向量计算方法，有如下代码。<strong>该部分代码运行较慢，全部运行下来大约10小时。</strong></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">sklearn.feature_extraction.text</span> <span class="kn">import</span> <span class="n">CountVectorizer</span>
<span class="kn">from</span> <span class="nn">tqdm</span> <span class="kn">import</span> <span class="n">tqdm</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">re</span>
<span class="kn">import</span> <span class="nn">jieba</span>
<span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>


<span class="c1">#检查是否有文件夹mda_infor_output，如果没有就新建一个</span>
<span class="k">if</span> <span class="ow">not</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">exists</span><span class="p">(</span><span class="s1">&#39;mda_infor_output&#39;</span><span class="p">):</span>
    <span class="n">os</span><span class="o">.</span><span class="n">mkdir</span><span class="p">(</span><span class="s1">&#39;mda_infor_output&#39;</span><span class="p">)</span>
    
    
<span class="c1">#cntext1.9.2</span>
<span class="c1">#stopwords = ct.load_pkl_dict(&#39;STOPWORDS.pkl&#39;)[&#39;STOPWORDS&#39;][&#39;chinese&#39;]</span>

<span class="c1">#cntext2.1.7</span>
<span class="n">stopwords</span><span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">read_yaml_dict</span><span class="p">(</span><span class="s1">&#39;enzh_common_StopWords.yaml&#39;</span><span class="p">)[</span><span class="s1">&#39;Dictionary&#39;</span><span class="p">][</span><span class="s1">&#39;chinese&#39;</span><span class="p">]</span>


<span class="k">def</span> <span class="nf">transform</span><span class="p">(</span><span class="n">text</span><span class="p">):</span>
    <span class="c1">#只保留md&amp;a中的中文内容</span>
    <span class="n">text</span> <span class="o">=</span> <span class="s1">&#39;&#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">re</span><span class="o">.</span><span class="n">findall</span><span class="p">(</span><span class="s1">&#39;[</span><span class="se">\u4e00</span><span class="s1">-</span><span class="se">\u9fa5</span><span class="s1">]+&#39;</span><span class="p">,</span> <span class="n">text</span><span class="p">))</span>
    <span class="c1">#剔除停用词</span>
    <span class="n">words</span> <span class="o">=</span> <span class="p">[</span><span class="n">w</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">jieba</span><span class="o">.</span><span class="n">cut</span><span class="p">(</span><span class="n">text</span><span class="p">)</span> <span class="k">if</span> <span class="n">w</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">stopwords</span><span class="p">]</span>
    <span class="c1">#整理为用空格间隔的字符串(类西方语言文本格式)</span>
    <span class="k">return</span> <span class="s1">&#39; &#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">words</span><span class="p">)</span>



<span class="c1">#读取md&amp;a</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;data/mda01-23.csv.gz&#39;</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;会计年度&#39;</span><span class="p">,</span> <span class="s1">&#39;股票代码&#39;</span><span class="p">,</span> <span class="s1">&#39;经营讨论与分析内容&#39;</span><span class="p">]</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;会计年度&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;会计年度&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="nb">str</span><span class="p">)</span>


<span class="c1">#上市公司行业信息</span>
<span class="n">ind_info_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_excel</span><span class="p">(</span><span class="s1">&#39;data/上市公司基本信息2000-2023.xlsx&#39;</span><span class="p">,</span> <span class="n">usecols</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;Symbol&#39;</span><span class="p">,</span> <span class="s1">&#39;EndDate&#39;</span><span class="p">,</span> <span class="s1">&#39;IndustryCodeC&#39;</span><span class="p">,</span> <span class="s1">&#39;ShortName&#39;</span><span class="p">])</span>
<span class="n">ind_info_df</span> <span class="o">=</span> <span class="n">ind_info_df</span><span class="p">[</span><span class="n">ind_info_df</span><span class="o">.</span><span class="n">Symbol</span><span class="o">!=</span><span class="s1">&#39;股票代码&#39;</span><span class="p">]</span>
<span class="n">ind_info_df</span><span class="p">[</span><span class="s1">&#39;会计年度&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">ind_info_df</span><span class="o">.</span><span class="n">EndDate</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">date</span><span class="p">:</span> <span class="n">date</span><span class="p">[:</span><span class="mi">4</span><span class="p">])</span>
<span class="n">ind_info_df</span><span class="o">.</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="s1">&#39;Symbol&#39;</span><span class="p">:</span> <span class="s1">&#39;股票代码&#39;</span><span class="p">,</span> <span class="s1">&#39;IndustryCodeC&#39;</span><span class="p">:</span><span class="s1">&#39;行业代码&#39;</span><span class="p">,</span> <span class="s1">&#39;ShortName&#39;</span><span class="p">:</span> <span class="s1">&#39;股票简称&#39;</span><span class="p">},</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">ind_info_df</span> <span class="o">=</span> <span class="n">ind_info_df</span><span class="p">[[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">,</span> <span class="s1">&#39;会计年度&#39;</span><span class="p">,</span> <span class="s1">&#39;行业代码&#39;</span><span class="p">,</span> <span class="s1">&#39;股票简称&#39;</span><span class="p">]]</span>

<span class="c1">#合并数据</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">merge</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">ind_info_df</span><span class="p">,</span> <span class="n">on</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">,</span> <span class="s1">&#39;会计年度&#39;</span><span class="p">],</span> <span class="n">how</span><span class="o">=</span><span class="s1">&#39;inner&#39;</span><span class="p">)</span>

<span class="c1"># 剔除金融行业处理</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="o">~</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;行业代码&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s2">&#34;J&#34;</span><span class="p">)]</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="o">~</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;股票简称&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s2">&#34;ST&#34;</span><span class="p">)]</span>



 
<span class="k">for</span> <span class="n">year</span> <span class="ow">in</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;会计年度&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">unique</span><span class="p">():</span>
    <span class="n">df_per_year</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;会计年度&#39;</span><span class="p">]</span><span class="o">==</span><span class="n">year</span><span class="p">]</span>
    <span class="n">df_per_year</span><span class="o">.</span><span class="n">reset_index</span><span class="p">(</span><span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
    <span class="n">df_per_year</span><span class="p">[</span><span class="s1">&#39;clean_text&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df_per_year</span><span class="p">[</span><span class="s1">&#39;经营讨论与分析内容&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">transform</span><span class="p">)</span>
    

    <span class="n">cv</span> <span class="o">=</span> <span class="n">CountVectorizer</span><span class="p">(</span><span class="n">min_df</span><span class="o">=</span><span class="mf">0.05</span><span class="p">,</span> <span class="n">max_df</span><span class="o">=</span><span class="mf">0.5</span><span class="p">)</span> 
    <span class="c1"># 生成稀疏bow矩阵</span>
    <span class="n">dtm_per_year</span> <span class="o">=</span> <span class="n">cv</span><span class="o">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">df_per_year</span><span class="p">[</span><span class="s1">&#39;clean_text&#39;</span><span class="p">])</span> 
    <span class="n">dtm_per_year</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">dtm_per_year</span><span class="o">.</span><span class="n">toarray</span><span class="p">(),</span> <span class="n">index</span><span class="o">=</span><span class="n">df_per_year</span><span class="o">.</span><span class="n">index</span><span class="p">)</span>
    <span class="n">dtm_per_year</span> <span class="o">=</span> <span class="n">dtm_per_year</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">row</span><span class="p">:</span> <span class="n">row</span><span class="o">/</span><span class="n">np</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="n">row</span><span class="p">),</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
    
    <span class="n">dtm_per_year</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">concat</span><span class="p">([</span><span class="n">df_per_year</span><span class="p">[[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">,</span> <span class="s1">&#39;会计年度&#39;</span><span class="p">,</span> <span class="s1">&#39;行业代码&#39;</span><span class="p">]],</span> <span class="n">dtm_per_year</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
    
    <span class="k">for</span> <span class="n">idx</span> <span class="ow">in</span> <span class="n">tqdm</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">dtm_per_year</span><span class="p">)),</span> <span class="n">desc</span><span class="o">=</span><span class="sa">f</span><span class="s2">&#34;会计年度</span><span class="si">{</span><span class="n">year</span><span class="si">}</span><span class="s2">进度&#34;</span><span class="p">):</span>
        <span class="n">code</span> <span class="o">=</span> <span class="n">dtm_per_year</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="n">idx</span><span class="p">,</span> <span class="s1">&#39;股票代码&#39;</span><span class="p">]</span>
        <span class="n">ind</span> <span class="o">=</span> <span class="n">dtm_per_year</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="n">idx</span><span class="p">,</span> <span class="s1">&#39;行业代码&#39;</span><span class="p">]</span>
        <span class="n">year</span> <span class="o">=</span> <span class="n">dtm_per_year</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="n">idx</span><span class="p">,</span> <span class="s1">&#39;会计年度&#39;</span><span class="p">]</span>



        <span class="n">ind_freq</span> <span class="o">=</span> <span class="n">dtm_per_year</span><span class="p">[</span><span class="n">dtm_per_year</span><span class="p">[</span><span class="s1">&#39;行业代码&#39;</span><span class="p">]</span><span class="o">==</span><span class="n">ind</span><span class="p">][</span><span class="n">dtm_per_year</span><span class="p">[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">]</span><span class="o">!=</span><span class="n">code</span><span class="p">]</span><span class="o">.</span><span class="n">iloc</span><span class="p">[:,</span> <span class="mi">3</span><span class="p">:]</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
        <span class="n">market_freq</span> <span class="o">=</span> <span class="n">dtm_per_year</span><span class="p">[</span><span class="n">dtm_per_year</span><span class="p">[</span><span class="s1">&#39;行业代码&#39;</span><span class="p">]</span><span class="o">!=</span><span class="n">ind</span><span class="p">]</span><span class="o">.</span><span class="n">iloc</span><span class="p">[:,</span> <span class="mi">3</span><span class="p">:]</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
        
        <span class="n">dtm_per_year_melted</span> <span class="o">=</span> <span class="n">dtm_per_year</span><span class="o">.</span><span class="n">melt</span><span class="p">(</span><span class="n">id_vars</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">,</span> <span class="s1">&#39;会计年度&#39;</span><span class="p">,</span> <span class="s1">&#39;行业代码&#39;</span><span class="p">],</span>
                                                <span class="n">var_name</span><span class="o">=</span><span class="s1">&#39;word_id&#39;</span><span class="p">,</span> 
                                                <span class="n">value_name</span><span class="o">=</span><span class="s1">&#39;word_freq&#39;</span><span class="p">)</span>
        <span class="n">corporate_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">({</span> <span class="s1">&#39;word_id&#39;</span><span class="p">:</span> <span class="n">dtm_per_year_melted</span><span class="p">[</span><span class="n">dtm_per_year_melted</span><span class="p">[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">]</span><span class="o">==</span><span class="n">code</span><span class="p">][</span><span class="s1">&#39;word_id&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">values</span><span class="p">,</span>
                                       <span class="s1">&#39;word_freq&#39;</span><span class="p">:</span> <span class="n">dtm_per_year_melted</span><span class="p">[</span><span class="n">dtm_per_year_melted</span><span class="p">[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">]</span><span class="o">==</span><span class="n">code</span><span class="p">][</span><span class="s1">&#39;word_freq&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">values</span><span class="p">,</span>
                                       <span class="s1">&#39;ind_freq&#39;</span><span class="p">:</span> <span class="n">ind_freq</span><span class="p">,</span>
                                       <span class="s1">&#39;market_freq&#39;</span><span class="p">:</span><span class="n">market_freq</span><span class="p">})</span>
        <span class="n">corporate_df</span><span class="p">[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">code</span>
        <span class="n">corporate_df</span><span class="p">[</span><span class="s1">&#39;行业代码&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">ind</span>
        <span class="n">corporate_df</span><span class="p">[</span><span class="s1">&#39;会计年度&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">year</span>
        <span class="n">corporate_df</span><span class="o">.</span><span class="n">reset_index</span><span class="p">(</span><span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
        <span class="n">corporate_df</span> <span class="o">=</span> <span class="n">corporate_df</span><span class="p">[[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">,</span> <span class="s1">&#39;行业代码&#39;</span><span class="p">,</span> <span class="s1">&#39;会计年度&#39;</span><span class="p">,</span> <span class="s1">&#39;word_id&#39;</span><span class="p">,</span> <span class="s1">&#39;word_freq&#39;</span><span class="p">,</span> <span class="s1">&#39;ind_freq&#39;</span><span class="p">,</span> <span class="s1">&#39;market_freq&#39;</span><span class="p">]]</span>
        
        
        <span class="k">if</span> <span class="ow">not</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">exists</span><span class="p">(</span><span class="s1">&#39;mda_infor_output/</span><span class="si">{year}</span><span class="s1">&#39;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">year</span><span class="o">=</span><span class="n">year</span><span class="p">)):</span>
            <span class="n">os</span><span class="o">.</span><span class="n">mkdir</span><span class="p">(</span><span class="s1">&#39;mda_infor_output/</span><span class="si">{year}</span><span class="s1">&#39;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">year</span><span class="o">=</span><span class="n">year</span><span class="p">))</span>
        <span class="n">corporate_df</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s1">&#39;mda_infor_output/</span><span class="si">{year}</span><span class="s1">/</span><span class="si">{code}</span><span class="s1">.csv&#39;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">year</span><span class="o">=</span><span class="n">year</span><span class="p">,</span> <span class="n">code</span><span class="o">=</span><span class="n">code</span><span class="p">),</span> <span class="n">index</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/y0/4gqxky0s2t94x1c1qhlwr6100000gn/T/jieba.cache
Loading model cost 0.281 seconds.
Prefix dict has been built successfully.
会计年度2001进度: 100%|█████████████████████| 1038/1038 [04:35&lt;00:00,  3.77it/s]
会计年度2002进度: 100%|█████████████████████| 1073/1073 [04:53&lt;00:00,  3.65it/s]
会计年度2003进度: 100%|█████████████████████| 1102/1102 [05:41&lt;00:00,  3.22it/s]
......
会计年度2021进度: 100%|███████████████████| 4412/4412 [2:51:33&lt;00:00,  2.33s/it]
会计年度2022进度: 100%|███████████████████| 4880/4880 [3:23:30&lt;00:00,  2.50s/it]
会计年度2023进度: 100%|███████████████████| 2699/2699 [4:10:30&lt;00:00,  2.45s/it]
</code></pre></div><p>大邓使用的电脑是 96G 内存， 运行时间大概 12 小时。 常见电脑的内存是 16 G， 速度可能会慢一点， 预估 12 ~ 20 小时左右。</p>
<p><br><br></p>
<h2 id="六标准信息信息含量">六、标准信息、信息含量</h2>
<p>以2023年000002为例，计算其标准信息、信息含量。计算成功后，再计算所有年份所有上市公司 md&amp;a的标准信息、信息含量。</p>
<p><strong>原文除了计算md&amp;a，还将md&amp;a区分为回顾过去、展望未来两部分，并分别计算了对应的标准信息、信息含量。这里只计算md&amp;a的标准信息、信息含量。</strong></p>
<p><img loading="lazy" src="img/infor_pre.png" alt=""  />
</p>
<p>这里使用Python的统计模型statsmodels库OLS来计算标准信息和信息含量。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">csv_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;mda_infor_output/2023/A000002.csv&#39;</span><span class="p">)</span>
<span class="n">csv_df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/df7.png" alt=""  />
</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#更改字段名</span>
<span class="n">csv_df</span><span class="o">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">,</span> <span class="s1">&#39;行业代码&#39;</span><span class="p">,</span> <span class="s1">&#39;会计年度&#39;</span><span class="p">,</span> <span class="s1">&#39;word_id&#39;</span><span class="p">,</span> <span class="s1">&#39;Norm&#39;</span><span class="p">,</span> <span class="s1">&#39;Norm_Ind&#39;</span><span class="p">,</span> <span class="s1">&#39;Norm_Market&#39;</span><span class="p">]</span>
<span class="n">csv_df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/df8.png" alt=""  />
</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">statsmodels.formula.api</span> <span class="k">as</span> <span class="nn">smf</span>

<span class="c1">#因变量Norm</span>
<span class="c1">#解释变量Norm_Ind、 Norm_Market</span>
<span class="n">formula</span> <span class="o">=</span> <span class="s1">&#39;Norm ~ Norm_Ind + Norm_Market&#39;</span>

<span class="n">model</span> <span class="o">=</span> <span class="n">smf</span><span class="o">.</span><span class="n">ols</span><span class="p">(</span><span class="n">formula</span><span class="p">,</span> <span class="n">data</span><span class="o">=</span><span class="n">csv_df</span><span class="p">)</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">()</span>
<span class="nb">print</span><span class="p">(</span><span class="n">result</span><span class="o">.</span><span class="n">summary</span><span class="p">())</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">OLS Regression Results                            
==============================================================================
Dep. Variable:                   Norm   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 9.583e+27
Date:                Sat, 27 Jul 2024   Prob (F-statistic):               0.00
Time:                        17:07:19   Log-Likelihood:             1.5646e+05
No. Observations:                4662   AIC:                        -3.129e+05
Df Residuals:                    4659   BIC:                        -3.129e+05
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
===============================================================================
                  coef    std err          t      P&gt;|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept    6.524e-16   1.21e-17     53.966      0.000    6.29e-16    6.76e-16
Norm_Ind        1.0000   7.27e-15   1.37e+14      0.000       1.000       1.000
Norm_Market -3.345e-15   3.54e-14     -0.095      0.925   -7.27e-14     6.6e-14
==============================================================================
Omnibus:                    10415.000   Durbin-Watson:                   0.035
Prob(Omnibus):                  0.000   Jarque-Bera (JB):         80430526.920
Skew:                          20.542   Prob(JB):                         0.00
Kurtosis:                     645.160   Cond. No.                     3.76e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.76e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#标准信息</span>
<span class="n">standard_info</span> <span class="o">=</span> <span class="n">result</span><span class="o">.</span><span class="n">params</span><span class="o">.</span><span class="n">Norm_Ind</span> <span class="o">+</span> <span class="n">result</span><span class="o">.</span><span class="n">params</span><span class="o">.</span><span class="n">Norm_Market</span>


<span class="c1">#信息含量</span>
<span class="n">informative_content</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">(</span><span class="nb">abs</span><span class="p">(</span><span class="n">result</span><span class="o">.</span><span class="n">resid</span><span class="p">))</span>


<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;A000002标准信息: </span><span class="si">{}</span><span class="s1">&#39;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">standard_info</span><span class="p">))</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;A000002信息含量: </span><span class="si">{}</span><span class="s1">&#39;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">informative_content</span><span class="p">))</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">A000002标准信息: 0.9999999999999309
A000002信息含量: 2.986269512206345e-12
</code></pre></div><br>
<p>既然能成功计算某年某公司的标准信息、信息含量，现在推广到所有年份所有公司，计算结果存储为一个csv文件。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>

<span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">csv</span>
<span class="kn">import</span> <span class="nn">statsmodels.formula.api</span> <span class="k">as</span> <span class="nn">smf</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">import</span> <span class="nn">re</span>


<span class="c1">#结果存储到mda_infor.csv</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s1">&#39;mda_infor2001-2023.csv&#39;</span><span class="p">,</span> <span class="s1">&#39;w&#39;</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8&#39;</span><span class="p">,</span> <span class="n">newline</span><span class="o">=</span><span class="s1">&#39;&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">csvf</span><span class="p">:</span>
    <span class="n">fieldnames</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">,</span> <span class="s1">&#39;会计年度&#39;</span><span class="p">,</span> <span class="s1">&#39;标准信息&#39;</span><span class="p">,</span> <span class="s1">&#39;信息含量&#39;</span><span class="p">]</span>
    <span class="n">writer</span> <span class="o">=</span> <span class="n">csv</span><span class="o">.</span><span class="n">DictWriter</span><span class="p">(</span><span class="n">csvf</span><span class="p">,</span> <span class="n">fieldnames</span><span class="o">=</span><span class="n">fieldnames</span><span class="p">)</span>
    <span class="n">writer</span><span class="o">.</span><span class="n">writeheader</span><span class="p">()</span>
    
    <span class="n">year_dirs</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">listdir</span><span class="p">(</span><span class="s1">&#39;mda_infor_output&#39;</span><span class="p">)</span>
    <span class="n">year_dirs</span> <span class="o">=</span> <span class="p">[</span><span class="n">y</span> <span class="k">for</span> <span class="n">y</span> <span class="ow">in</span> <span class="n">year_dirs</span> <span class="k">if</span> <span class="s1">&#39;DS&#39;</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">y</span><span class="p">]</span>
    <span class="k">for</span> <span class="n">year_dir</span> <span class="ow">in</span> <span class="n">year_dirs</span><span class="p">:</span>
        <span class="n">code_csvfs</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;mda_infor_output/</span><span class="si">{year}</span><span class="s1">/</span><span class="si">{csvf}</span><span class="s1">&#39;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">year</span><span class="o">=</span><span class="n">year_dir</span><span class="p">,</span> <span class="n">csvf</span><span class="o">=</span><span class="n">f</span><span class="p">)</span> 
                      <span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="n">os</span><span class="o">.</span><span class="n">listdir</span><span class="p">(</span><span class="s1">&#39;mda_infor_output/</span><span class="si">{}</span><span class="s1">&#39;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">year_dir</span><span class="p">))]</span>
        <span class="n">code_csvfs</span> <span class="o">=</span> <span class="p">[</span><span class="n">f</span> <span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="n">code_csvfs</span> <span class="k">if</span> <span class="s1">&#39;DS&#39;</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">f</span><span class="p">]</span>
        <span class="k">for</span> <span class="n">csvf</span> <span class="ow">in</span> <span class="n">code_csvfs</span><span class="p">:</span> 
            <span class="k">try</span><span class="p">:</span>
                <span class="n">csv_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="n">csvf</span><span class="p">)</span>
                <span class="n">csv_df</span><span class="o">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">,</span> <span class="s1">&#39;行业代码&#39;</span><span class="p">,</span> <span class="s1">&#39;会计年度&#39;</span><span class="p">,</span> <span class="s1">&#39;word_id&#39;</span><span class="p">,</span> <span class="s1">&#39;Norm&#39;</span><span class="p">,</span> <span class="s1">&#39;Norm_Ind&#39;</span><span class="p">,</span> <span class="s1">&#39;Norm_Market&#39;</span><span class="p">]</span>
                <span class="n">formula</span> <span class="o">=</span> <span class="s1">&#39;Norm ~ Norm_Ind + Norm_Market&#39;</span>
                <span class="n">model</span> <span class="o">=</span> <span class="n">smf</span><span class="o">.</span><span class="n">ols</span><span class="p">(</span><span class="n">formula</span><span class="p">,</span> <span class="n">data</span><span class="o">=</span><span class="n">csv_df</span><span class="p">)</span>
                <span class="n">result</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">()</span>

                <span class="c1">#标准信息</span>
                <span class="n">standard_info</span> <span class="o">=</span> <span class="n">result</span><span class="o">.</span><span class="n">params</span><span class="o">.</span><span class="n">Norm_Ind</span> <span class="o">+</span> <span class="n">result</span><span class="o">.</span><span class="n">params</span><span class="o">.</span><span class="n">Norm_Market</span>
                <span class="c1">#信息含量</span>
                <span class="n">informative_content</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">(</span><span class="nb">abs</span><span class="p">(</span><span class="n">result</span><span class="o">.</span><span class="n">resid</span><span class="p">))</span>

                <span class="n">data</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;股票代码&#39;</span><span class="p">:</span> <span class="s1">&#39;A&#39;</span><span class="o">+</span><span class="nb">str</span><span class="p">(</span><span class="n">re</span><span class="o">.</span><span class="n">findall</span><span class="p">(</span><span class="s1">&#39;\d</span><span class="si">{6}</span><span class="s1">&#39;</span><span class="p">,</span> <span class="n">csvf</span><span class="p">)[</span><span class="mi">0</span><span class="p">]),</span> 
                        <span class="s1">&#39;会计年度&#39;</span><span class="p">:</span> <span class="n">re</span><span class="o">.</span><span class="n">findall</span><span class="p">(</span><span class="s1">&#39;\d</span><span class="si">{4}</span><span class="s1">&#39;</span><span class="p">,</span> <span class="n">csvf</span><span class="p">)[</span><span class="mi">0</span><span class="p">],</span> 
                        <span class="s1">&#39;标准信息&#39;</span><span class="p">:</span> <span class="n">standard_info</span><span class="p">,</span> 
                        <span class="s1">&#39;信息含量&#39;</span><span class="p">:</span> <span class="n">informative_content</span><span class="p">}</span>
                <span class="n">writer</span><span class="o">.</span><span class="n">writerow</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
            <span class="k">except</span><span class="p">:</span>
                <span class="k">pass</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">CPU times: user 7min 40s, sys: 33min 5s, total: 40min 45s
Wall time: 4min 36s
</code></pre></div><p><br><br></p>
<p>读取生成的<em><strong>mda_infor2001-2023.csv</strong></em>  文件，欣赏一下 <code>标准信息、信息含量</code></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;mda_infor2001-2023.csv&#39;</span><span class="p">)</span>
<span class="n">df</span>
</code></pre></div><p><img loading="lazy" src="img/df9.png" alt=""  />
</p>
<br>
<p>需要注意，原文选取 2007 — 2015 年中国上市公司年报中的 MD&amp;A 信息作为研究样本。 之所以选取 2007 年作为样本的起点， 是因为从 2007 年开始， MD&amp;A 在企业定期报告中的披露要求已经较为完善， 而且 2007 年是中国会计准则国际趋同的重要时点， 新制定的《企业会计准则》已经开始实施， 为避免前后会计准则差异而产生的影响， 因此选取 2007 年作为样本区间的起点。</p>
<p><strong>如要复现原文，需要注意筛选2007之后的数据。</strong></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="nb">print</span><span class="p">(</span><span class="s1">&#39;mda_infor2001-2023.csv 记录数:&#39;</span><span class="p">,</span><span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">))</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">mda_infor2001-2023.csv记录数: 53502
</code></pre></div><p><br><br></p>
<h2 id="七资料获取">七、资料获取</h2>
<p>数据&amp;代码创作不易，如果需要源代码和数据， 加微信372335839， 备注「姓名-学校-专业」</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">打包价 100元， 含
 - 管理层讨论与分析(mda01-23.csv.gz)、上市公司基本信息2000-2023.xlsx
 - 计算结果(mda_infor2001-2023.csv)
</code></pre></div><br>
<p>资料截图， 整个资料文件夹体积高达 12 G。</p>
<p><img loading="lazy" src="img/screen.png" alt=""  />
</p>
<p><img loading="lazy" src="img/size.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="cntext使用声明">cntext使用声明</h2>
<p>如在研究或项目中使用 cntext ，请在文中介绍并附引用声明。引用格式可参考 <a href="https://textdata.cn/blog/2025-09-09-transform-text-data-into-structured-data-with-lmstudio-and-cntext/#%E4%BD%BF%E7%94%A8%E5%A3%B0%E6%98%8E">cntext 推荐引用格式</a></p>
<p><br><br></p>
<h2 id="相关内容">相关内容</h2>
<ul>
<li><a href="https://textdata.cn/blog/2023-01-13-information-content-of-critical-audit/">金融研究 | 使用Python构建「关键审计事项信息含量」</a></li>
<li><a href="https://textdata.cn/blog/2023-09-08-earnings-communication-conference-forward-looking-statements-information/">中国管理科学 | 使用业绩说明会文本数据测量上市公司前瞻性信息</a></li>
<li><a href="https://textdata.cn/blog/2024-04-16-china-listed-company-information-dataset/"><strong>数据集 | A股上市公司基本信息</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-01-21-hk-stock-market-anual-report/"><strong>数据集 | 港股年报文本数据集(2007 ~ 2023.12)</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-01-18-neeq-china-listed-on-nation-equities-exchange-and-quotation-system-anunal-year-report/"><strong>数据集(付费) | 三板上市公司年报2002-2023.12</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-01-14-usa-sec-10k-report-dataset/"><strong>数据集 | 美股年报10-K、20-F数据(2000-2023.12)</strong></a></li>
<li><a href="https://textdata.cn/blog/2023-03-24-load-w2v-and-expand-your-concpet-dicitonary/"><strong>词向量(付费) | 使用MD&amp;A2001-2022语料训练Word2Vec模型</strong></a></li>
<li><a href="https://textdata.cn/blog/2023-03-23-china-a-share-market-dataset-mda-from-01-to-21/"><strong>数据集 | 2001-2022年A股上市公司年报&amp;管理层讨论与分析</strong></a></li>
</ul>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>豆瓣影评 | 探索词向量妙处</title>
      <link>https://textdata.cn/blog/douban_w2v/</link>
      <pubDate>Sun, 21 Apr 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/douban_w2v/</guid>
      <description>使用cntext训练、使用词向量。</description>
      <content:encoded><![CDATA[<p>本文要点</p>
<ul>
<li>读取 <em><strong>csv</strong></em></li>
<li>准备语料</li>
<li><em><strong>cntext</strong></em> 训练词向量模型</li>
<li>运用词向量模型</li>
</ul>
<br>
<br>
<h2 id="一读取数据">一、读取数据</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;douban.csv&#39;</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/df.png" alt=""  />
</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="nb">print</span><span class="p">(</span><span class="s2">&#34;电影  : </span><span class="si">{}</span><span class="s2"> 部&#34;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">Movie_Name_CN</span><span class="o">.</span><span class="n">nunique</span><span class="p">()))</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">&#34;评论  : </span><span class="si">{}</span><span class="s2"> 条&#34;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">)))</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">    电影  : 28 部
    评论  : 2125056 条
</code></pre></div><p><br><br></p>
<h2 id="二准备语料">二、准备语料</h2>
<p>提取文本，去除非中文字符，保存为txt文件</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;douban.csv&#39;</span><span class="p">)</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s1">&#39;douban.txt&#39;</span><span class="p">,</span> <span class="s1">&#39;w&#39;</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
    <span class="n">raw_text</span> <span class="o">=</span> <span class="s1">&#39;</span><span class="se">\n</span><span class="s1">&#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;Comment&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">))</span>
    <span class="n">f</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">raw_text</span><span class="p">)</span>
</code></pre></div><p><br><br></p>
<h2 id="三训练模型">三、训练模型</h2>
<h3 id="31-安装cntext">3.1 安装cntext</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">pip3 install cntext --upgrade
</code></pre></div><br>
<h3 id="32-训练模型">3.2 训练模型</h3>
<p>使用 <a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/"><em><strong>cntext</strong></em></a> 库(版本号2.1.6) 训练词向量word2vec模型, 这里我把 csv 数据整理为 txt</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="c1"># 训练</span>
<span class="n">w2v_model</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">Word2Vec</span><span class="p">(</span><span class="n">corpus_file</span> <span class="o">=</span> <span class="s1">&#39;douban.txt&#39;</span><span class="p">,</span>  
                        <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8&#39;</span><span class="p">,</span> 
                        <span class="n">lang</span><span class="o">=</span><span class="s1">&#39;chinese&#39;</span><span class="p">,</span> 
                        <span class="n">vector_size</span><span class="o">=</span><span class="mi">200</span><span class="p">,</span> 
                        <span class="n">window_size</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span>
                        <span class="n">only_binary</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>  <span class="c1"># 只保存二进制模型文件</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">Mac(Linux) System, Enable Parallel Processing
Cache output/douban_cache.txt Not Found or Empty, Preprocessing Corpus
Processing Corpus: 11150it [00:07, 5759.05it/s]
Reading Preprocessed Corpus from output/douban_cache.txt
Start Training Word2Vec
Word2Vec Training Cost 2001 s. 
Output Saved To: output/douban-Word2Vec.200.15.bin
</code></pre></div><br>
<p>在代码所在文件夹内可以找到</p>
<ul>
<li>output/douban-Word2Vec.200.15.bin</li>
<li>新的  pos.txt</li>
<li>新的  neg.txt</li>
</ul>
<p>新的 <em><strong>pos.txt</strong></em> 是对 <em><strong>pos.txt</strong></em> 词典的扩展。</p>
<br>
<h3 id="24-评估模型">2.4 评估模型</h3>
<p>使用近义法和类比法， 判断模型的表现。详情可查看<a href="https://cntext.readthedocs.io/zh-cn/latest/model.html">文档</a></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">ct</span><span class="o">.</span><span class="n">evaluate_similarity</span><span class="p">(</span><span class="n">w2v_model</span><span class="p">)</span>

<span class="n">ct</span><span class="o">.</span><span class="n">evaluate_analogy</span><span class="p">(</span><span class="n">w2v_model</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">近义测试: similarity.txt
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/cntext/model/evaluate_data/similarity.txt

评估结果：
+----------+------------+----------------------------+
| 发现词语 | 未发现词语 | Spearman&#39;s Rank Coeficient |
+----------+------------+----------------------------+
|   459    |     78     |            0.43            |
+----------+------------+----------------------------+


类比测试: analogy.txt
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/cntext/model/evaluate_data/analogy.txt
Processing Analogy Test: 100%|██████████████| 1198/1198 [00:11&lt;00:00, 99.91it/s]

评估结果：
+--------------------+----------+------------+------------+----------+
|      Category      | 发现词语 | 未发现词语 | 准确率 (%) | 平均排名 |
+--------------------+----------+------------+------------+----------+
| CapitalOfCountries |   615    |     62     |   39.02    |   2.98   |
|   CityInProvince   |   175    |     0      |   28.57    |   4.74   |
| FamilyRelationship |   272    |     0      |   92.65    |   1.48   |
|   SocialScience    |    8     |     62     |   25.00    |   6.00   |
+--------------------+----------+------------+------------+----------+
</code></pre></div><p><strong>近义测试</strong>: Spearman&rsquo;s Rank Coeficient系数取值[-1, 1], 取值越大， 说明模型表现越好。</p>
<br>
<p><strong>类比测试</strong>:</p>
<ul>
<li>CapitalOfCountries  豆瓣影评语料在此项表现尚可，可能目前电影库中有一定比例的外国素材。</li>
<li>CityInProvince      豆瓣影评语料在此项表现较差，不太可能是中国素材太少，可能大多数省市以类似汉东省的形式出现。这是我的猜测。 <a href="https://textdata.cn/blog/2023-12-28-train-word2vec-using-renmin-gov-leader-board-dataset/">人民网留言板语料Word2Vec</a>中，该项准确率100%。</li>
<li>FamilyRelationship  豆瓣影评体现的是电影相关内容，而电影永远的主题是人性， 内容少不了家长里短，七大姑八大姨，所以此项准确率高达92.65%。 以<a href="https://textdata.cn/blog/2023-03-24-load-w2v-and-expand-your-concpet-dicitonary/">年报MD&amp;A</a>为例，此处准确率只有10%。</li>
<li>SocialScience       豆瓣影评语料在此项表现一般， 应该是语料中常见的社会科学词语提及较少。</li>
</ul>
<p>整体而言，语料训练的效果很不错，抓住了数据场景的独特性语义。</p>
<p><br><br></p>
<h2 id="四使用word2vec">四、使用Word2Vec</h2>
<h3 id="41-导入word2vec模型文件">4.1 导入Word2Vec模型文件</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>
 
<span class="c1"># 导入模型，请注意路径。</span>
<span class="c1"># 「当前代码」 与 「output」 同处于一个文件夹内</span>

<span class="n">dm_w2v</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">load_w2v</span><span class="p">(</span><span class="s1">&#39;output/douban-Word2Vec.200.15.bin&#39;</span><span class="p">)</span>
<span class="c1"># dm_w2v = ct.load_w2v(&#39;output/douban-Word2Vec.200.15.txt&#39;)</span>

<span class="n">dm_w2v</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">Loading output/output/douban-Word2Vec.200.15.bin...
&lt;gensim.models.keyedvectors.KeyedVectors at 0x314193830&gt;
</code></pre></div><br>
<h3 id="42-keyedvectors的操作方法或属性">4.2 KeyedVectors的操作方法(或属性)</h3>
<table>
<thead>
<tr>
<th>方法</th>
<th>描述</th>
</tr>
</thead>
<tbody>
<tr>
<td><em><strong>KeyedVectors.index_to_key</strong></em></td>
<td>获取词汇表中的所有单词。</td>
</tr>
<tr>
<td><em><strong>KeyedVectors.key_to_index</strong></em></td>
<td>获取单词到索引的映射。</td>
</tr>
<tr>
<td><em><strong>KeyedVectors.vector_size</strong></em></td>
<td>获取GloVe模型中任意词向量的维度。</td>
</tr>
<tr>
<td><em><strong>KeyedVectors.get_vector(word)</strong></em></td>
<td>获取给定单词的词向量。</td>
</tr>
<tr>
<td><em><strong>KeyedVectors.similar_by_word(word, topn=10)</strong></em></td>
<td>获取某词语最相似的10个近义词。</td>
</tr>
<tr>
<td><em><strong>KeyedVectors.similar_by_vector(vector, topn=10)</strong></em></td>
<td>获取词向量最相似的10个近义词。</td>
</tr>
</tbody>
</table>
<br>
<h3 id="44-查看词表">4.4 查看词表</h3>
<p>查看词表所有单词</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">dm_w2v</span><span class="o">.</span><span class="n">index_to_key</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[&#39;电影&#39;,
 &#39;一个&#39;,
 &#39;没有&#39;,
 &#39;喜欢&#39;,
 ...
 &#39;跟着&#39;,
 &#39;意识&#39;,
 &#39;态度&#39;,
 ...]
</code></pre></div><p>为了方便查看， 这里只展示部分数据。</p>
<br>
<h3 id="45-词表映射">4.5 词表映射</h3>
<p>查看单词到索引的映射</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">dm_w2v</span><span class="o">.</span><span class="n">key_to_index</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">{&#39;电影&#39;: 0,
 &#39;一个&#39;: 1,
 &#39;没有&#39;: 2,
...
&#39;跟着&#39;: 997,
 &#39;意识&#39;: 998,
 &#39;态度&#39;: 999,
 ...}
</code></pre></div><br>
<h3 id="46-向量维度数">4.6 向量维度数</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;词表有 </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">dm_w2v</span><span class="o">.</span><span class="n">key_to_index</span><span class="p">)</span><span class="si">}</span><span class="s1"> 个词&#39;</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;向量是 </span><span class="si">{</span><span class="n">dm_w2v</span><span class="o">.</span><span class="n">vector_size</span><span class="si">}</span><span class="s1"> 维&#39;</span><span class="p">)</span>

</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">词表有 426646 个词
向量是 200 维
</code></pre></div><br>
<h3 id="47-获取词向量">4.7 获取词向量</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">dm_w2v</span><span class="o">.</span><span class="n">get_vector</span><span class="p">(</span><span class="s1">&#39;给力&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">array([-1.24090052e+00, -6.79377019e-01,  1.42518425e+00, -1.46615291e+00,
       -9.53197628e-02,  6.50456071e-01, -2.97696137e+00,  2.20916629e+00,
        6.12876177e-01,  1.63172066e+00,  4.91760701e-01, -9
        ......
        ......
         -1.42494082e+00,  2.49131727e+00, -6.27597034e-01, -7.91438043e-01,
       -4.54898655e-01,  1.37747681e+00, -4.20672953e-01, -1.53694853e-01,
        1.04936564e+00,  2.18786263e+00, -8.07472587e-01, -8.32003877e-02],
      dtype=float32)
</code></pre></div><br>
<h3 id="48-近义词">4.8 近义词</h3>
<p>根据词语查看近义词</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1"># 近义词</span>
<span class="n">dm_w2v</span><span class="o">.</span><span class="n">similar_by_word</span><span class="p">(</span><span class="s1">&#39;给力&#39;</span><span class="p">,</span> <span class="n">topn</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[(&#39;相当给力&#39;, 0.6180022358894348),
 (&#39;太给力&#39;, 0.6019443273544312),
 (&#39;带劲&#39;, 0.5840415954589844),
 (&#39;不给力&#39;, 0.5774183869361877),
 (&#39;过瘾&#39;, 0.5616626739501953),
 (&#39;牛叉&#39;, 0.553788959980011),
 (&#39;出彩&#39;, 0.5414286851882935),
 (&#39;精彩&#39;, 0.5332293510437012),
 (&#39;看得过瘾&#39;, 0.5250197649002075),
 (&#39;大赞&#39;, 0.5205727219581604)]
</code></pre></div><br>
<p>根据向量查找最相似的近义词</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">word_vector</span> <span class="o">=</span> <span class="n">dm_w2v</span><span class="o">.</span><span class="n">get_vector</span><span class="p">(</span><span class="s1">&#39;给力&#39;</span><span class="p">)</span>
<span class="n">dm_w2v</span><span class="o">.</span><span class="n">similar_by_vector</span><span class="p">(</span><span class="n">word_vector</span><span class="p">,</span> <span class="n">topn</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[(&#39;给力&#39;, 1.0),
 (&#39;相当给力&#39;, 0.6180021166801453),
 (&#39;太给力&#39;, 0.6019443273544312),
 (&#39;带劲&#39;, 0.5840415954589844),
 (&#39;不给力&#39;, 0.5774183869361877),
 (&#39;过瘾&#39;, 0.5616626739501953),
 (&#39;牛叉&#39;, 0.5537890195846558),
 (&#39;出彩&#39;, 0.5414287447929382),
 (&#39;精彩&#39;, 0.5332292914390564),
 (&#39;看得过瘾&#39;, 0.5250197649002075)]
</code></pre></div><br>
<h3 id="49-计算多个词的中心向量">4.9 计算多个词的中心向量</h3>
<p>我们可以计算「宇宙」、「飞船」、「战争」的宇宙语义向量（中心向量）。 并试图寻找中心向量 <em><strong>universe_vector</strong></em> 的最相似的10个词。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1"># 几个词语构建的宇宙语义向量</span>
<span class="n">universe_vector</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">semantic_centroid</span><span class="p">(</span><span class="n">wv</span><span class="o">=</span><span class="n">dm_w2v</span><span class="p">,</span> 
                                       <span class="n">words</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;宇宙&#39;</span><span class="p">,</span> <span class="s1">&#39;飞船&#39;</span><span class="p">,</span> <span class="s1">&#39;战争&#39;</span><span class="p">])</span>


<span class="n">dm_w2v</span><span class="o">.</span><span class="n">similar_by_vector</span><span class="p">(</span><span class="n">universe_vector</span><span class="p">,</span> <span class="n">topn</span><span class="o">=</span><span class="mi">20</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[(&#39;宇宙&#39;, 0.7568532228469849),
 (&#39;星系&#39;, 0.7090039253234863),
 (&#39;飞船&#39;, 0.7080673575401306),
 (&#39;人类文明&#39;, 0.6973789930343628),
 (&#39;战舰&#39;, 0.6890057325363159),
 (&#39;母舰&#39;, 0.6864359974861145),
 (&#39;星球&#39;, 0.6799622774124146),
 (&#39;卫星&#39;, 0.6799139976501465),
 (&#39;星际&#39;, 0.6789332032203674),
 (&#39;空间站&#39;, 0.6780815124511719),
 (&#39;地球&#39;, 0.6769616603851318),
 (&#39;外太空&#39;, 0.6683873534202576),
 (&#39;核战&#39;, 0.6669113039970398),
 (&#39;外星飞船&#39;, 0.6592534780502319),
 (&#39;木星&#39;, 0.6586896777153015),
 (&#39;能源&#39;, 0.6562989950180054),
 (&#39;战争&#39;, 0.6556441187858582),
 (&#39;巨兽&#39;, 0.6544537544250488),
 (&#39;月球&#39;, 0.6525537967681885),
 (&#39;一艘&#39;, 0.6521110534667969)]
</code></pre></div><p>语义捕捉的很准哦。</p>
<h3 id="410-类比-king-man--woman--queen">4.10 类比 king-man + woman ~ queen</h3>
<p>每个词是高维向量空间中的一个点， 两个点可以组成有方向的向量，而向量可以比较方向。这里是推理过程，受限于数据，公式不一定完全成立， 但是思维可以类比。</p>
<p><img loading="lazy" src="img/kingqueenformular.png" alt=""  />
</p>
<h4 id="4101-传统类比">4.10.1 传统类比</h4>
<p>这两个词相减，按感觉应该得到的是性别方向，雄性-&gt;雌性。</p>
<p>$$
Vector1 \approx vector(国王)-vector(男人)
$$</p>
<p>$$
Vector2 \approx vector(王后)-vector(女人)
$$</p>
<p>那两个向量方向应该近似，即 <em><strong>Vector1</strong></em>  约等于 <em><strong>Vector2</strong></em> ，将其看做等式就得到如下公式：</p>
<p>$$
vector(国王)-vector(男人) \approx vector(王后) - vector(女人)
$$</p>
<p>现在我们检查三个语义向量计算出的新的向量是否有与queen相关的语义信息。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">men_vector</span> <span class="o">=</span> <span class="n">dm_w2v</span><span class="o">.</span><span class="n">get_vector</span><span class="p">(</span><span class="s1">&#39;男人&#39;</span><span class="p">)</span>
<span class="n">women_vector</span> <span class="o">=</span> <span class="n">dm_w2v</span><span class="o">.</span><span class="n">get_vector</span><span class="p">(</span><span class="s1">&#39;女人&#39;</span><span class="p">)</span> 
<span class="n">king_vector</span> <span class="o">=</span> <span class="n">dm_w2v</span><span class="o">.</span><span class="n">get_vector</span><span class="p">(</span><span class="s1">&#39;国王&#39;</span><span class="p">)</span> 

<span class="c1"># 假设 king- queen 近似等于 man -woman </span>
<span class="c1"># result 近似等于 king - queen + women</span>
<span class="n">result_vector</span> <span class="o">=</span> <span class="n">king_vector</span> <span class="o">-</span> <span class="n">men_vector</span> <span class="o">+</span> <span class="n">women_vector</span>
<span class="c1"># 现在检查 result_vector 的语义应该与queen相关</span>
<span class="n">dm_w2v</span><span class="o">.</span><span class="n">similar_by_vector</span><span class="p">(</span><span class="n">result_vector</span><span class="p">,</span> <span class="n">topn</span><span class="o">=</span><span class="mi">20</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[(&#39;国王&#39;, 0.8276543617248535),
 (&#39;王后&#39;, 0.754295289516449),
 (&#39;皇后&#39;, 0.6877321004867554),
 (&#39;公主&#39;, 0.6311503052711487),
 (&#39;王位&#39;, 0.6292931437492371),
 (&#39;皇帝&#39;, 0.6280742287635803),
 (&#39;王妃&#39;, 0.6235458850860596),
 (&#39;伊丽莎白一世&#39;, 0.6158717274665833),
 (&#39;君主&#39;, 0.6151927709579468),
 (&#39;公爵&#39;, 0.6111372113227844),
 (&#39;女王&#39;, 0.6068686246871948),
 (&#39;登基&#39;, 0.606802225112915),
 (&#39;皇子&#39;, 0.5979987382888794),
 (&#39;侍卫&#39;, 0.594831109046936),
 (&#39;夫人&#39;, 0.5942187309265137),
 (&#39;王室&#39;, 0.5891965627670288),
 (&#39;女皇&#39;, 0.5889874696731567),
 (&#39;继位&#39;, 0.5818601846694946),
 (&#39;皇室&#39;, 0.5812580585479736),
 (&#39;王冠&#39;, 0.5733407139778137)]
</code></pre></div><h4 id="4102-新算法">4.10.2 新算法</h4>
<p><em><strong>most_similar_cosmul</strong></em> 使用了一种基于 <strong>乘法组合</strong> 的相似度计算方法，而不是简单的向量加减法。其核心公式如下：
$$
\text{Similarity}(w, \text{positive}, \text{negative}) = \frac{\prod_{p \in \text{positive}} \cos(w, p)}{\prod_{n \in \text{negative}} \cos(w, n)}
$$
对于给定的正样本词集合 P 和负样本词集合 N，目标是找到一个词 w，使得得分最大化。</p>
<p>参照如下的例子
$$
vector(王后)   \approx  vector(国王) + vector(女人) -vector(男人)<br>
$$</p>
<p>其中正向目标词有 <em><strong>国王</strong></em> 和 <em><strong>女人</strong></em>， 负向词有 <em><strong>男人</strong></em></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1"># 类比函数</span>
<span class="n">dm_w2v</span><span class="o">.</span><span class="n">most_similar_cosmul</span><span class="p">(</span><span class="n">positive</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;国王&#39;</span><span class="p">,</span> <span class="s1">&#39;女人&#39;</span><span class="p">],</span>   <span class="c1">#</span>
                           <span class="n">negative</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;男人&#39;</span><span class="p">],</span> 
                           <span class="n">topn</span><span class="o">=</span><span class="mi">20</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[(&#39;王后&#39;, 0.9907146692276001),
 (&#39;皇后&#39;, 0.9572808146476746),
 (&#39;公主&#39;, 0.9137295484542847),
 (&#39;王妃&#39;, 0.9079920649528503),
 (&#39;皇帝&#39;, 0.905644953250885),
 (&#39;伊丽莎白一世&#39;, 0.9031068682670593),
 (&#39;女王&#39;, 0.8956636190414429),
 (&#39;王位&#39;, 0.8942943215370178),
 (&#39;登基&#39;, 0.8899738192558289),
 (&#39;君主&#39;, 0.8883361220359802),
 (&#39;公爵&#39;, 0.8862053751945496),
 (&#39;王室&#39;, 0.8842172622680664),
 (&#39;夫人&#39;, 0.8840034604072571),
 (&#39;女皇&#39;, 0.8824913501739502),
 (&#39;侍卫&#39;, 0.8815361857414246),
 (&#39;皇子&#39;, 0.8785887360572815),
 (&#39;皇室&#39;, 0.8755369186401367),
 (&#39;继位&#39;, 0.8736834526062012),
 (&#39;驾崩&#39;, 0.8675689101219177),
 (&#39;波旁王朝&#39;, 0.8671858906745911)]
</code></pre></div><p>可以看到返回前2的词直接表明了词语是王后皇后，与公式推算结果一般无二。</p>
<p><br><br></p>
<h2 id="五获取资料">五、获取资料</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">- 免费词向量      https://cntext.readthedocs.io/zh-cn/latest/embeddings.html

- 1000w-douban-movies.zip  链接: https://pan.baidu.com/s/1V8FUA9_qwHBW-utoOcV11w?pwd=t3sa 提取码: t3sa 

- 442w-douban-movies.zip
链接: https://pan.baidu.com/s/1bhJls4P33a6EwZ6guhiw_A?pwd=qi28 提取码: qi28

- 212w-douban-movie.zip
链接: https://pan.baidu.com/s/1vaOKOJPA3F4ipBrdZygtLA?pwd=gfvd 提取码: gfvd 
</code></pre></div><p><br><br></p>
<h2 id="cntext使用声明">cntext使用声明</h2>
<p>如在研究或项目中使用 cntext ，请在文中介绍并附引用声明。引用格式可参考 <a href="https://textdata.cn/blog/2025-09-09-transform-text-data-into-structured-data-with-lmstudio-and-cntext/#%E4%BD%BF%E7%94%A8%E5%A3%B0%E6%98%8E">cntext 推荐引用格式</a></p>
<p><br><br></p>
<h2 id="相关内容">相关内容</h2>
<ul>
<li><a href="https://cntext.readthedocs.io/">文本分析库cntext2.x使用手册 https://cntext.readthedocs.io/</a></li>
<li><a href="https://textdata.cn/blog/2025-03-28-train_a_glove_model_on_chinese_corpus_using_stanfordnlp/">实验 | 使用Stanford Glove代码训练中文语料的Glove模型</a></li>
<li><a href="https://textdata.cn/blog/2023-12-28-train-word2vec-using-renmin-gov-leader-board-dataset/">词向量 | 使用人民网领导留言板语料训练Word2Vec模型</a></li>
<li><a href="https://textdata.cn/blog/2023-11-20-word2vec-by-year-by-province/">使用 5000w 专利申请数据集按年份(按省份)训练词向量</a></li>
<li><a href="https://textdata.cn/blog/2024-04-16-douban-movie-1000w-ratings-comments-dataset/">使用 1000w 条豆瓣影评训练 Word2Vec</a></li>
<li><a href="https://textdata.cn/blog/2023-03-15-39faq-about-word-embeddings-for-social-science/">词嵌入技术在社会科学领域进行数据挖掘常见39个FAQ汇总</a></li>
<li><a href="https://textdata.cn/blog/2022-04-07-word-embeddings-in-social-science/">转载|大数据时代下社会科学研究方法的拓展——基于词嵌入技术的文本分析的应用</a></li>
<li><a href="https://textdata.cn/blog/2023-11-03-organization-science-with-word-embeddings/">OS2022 | 概念空间 | 词嵌入模型如何为组织科学中的测量和理论提供信息</a></li>
</ul>
]]></content:encoded>
    </item>
    
    <item>
      <title>管理世界 | 使用 经营讨论与分析 测量 企业数字化</title>
      <link>https://textdata.cn/blog/2022-11-03-mda-measure-digitalization/</link>
      <pubDate>Sat, 20 Apr 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2022-11-03-mda-measure-digitalization/</guid>
      <description>使用 经营讨论与分 文本数据，测量企业的数字化指标，代码实现过程参考2021管理世界的一篇论文。This paper conducted a research on the impact of corporate digital transformation on stock liquidity, by means of data from 2007~2018 of A-share listed companies in China. This paper empirically tested the impact,mechanisms and external basic conditions of corporate digital transformation on stock liquidity. The main conclusions are as fol⁃ lows. Firstly, corporate digital transformation has significantly improved the level of stock liquidity. In particular, there are significant asymmetric effects under different corporate attributes and characteristics, which is the digital transformation of non-state-owned enterprises and high-tech enterprises can raise the level of stock liquidity in the market. Secondly, corporate digital transformation can improve the problem of information asymmetry, increase market investors&amp;#39; expectations, and optimize the input and output of enterprise innovation, and therefore improve the quality and efficiency of corporate operations. All of these will contribute to the improvement of corporate stock liquidity. Thirdly, effective external conditions are the important foundation for corporate digital transformation to work effective⁃ ly. Moreover, a good foundation for the development of financial technology plays a positive moderating effect in the &amp;#34;corporate digital transformation—stock liquidity&amp;#34; relationship.</description>
      <content:encoded><![CDATA[<p>使用 经营讨论与分析 数据，计算企业数字化指标, 相关论文:</p>
<ul>
<li>吴非, 胡慧芷, 林慧妍, and 任晓怡. &ldquo;企业数字化转型与资本市场表现——来自股票流动性的经验证据.&rdquo; 管理世界 (2021).</li>
<li>宋德勇, 朱文博, and 丁海. &ldquo;企业数字化能否促进绿色技术创新?.&rdquo; 财经研究 48, no. 4 (2022).</li>
<li>方明月,聂辉华,阮睿,沈昕毅.企业数字化转型与经济政策不确定性感知[J].金融研究,2023,(02):21-39.</li>
</ul>
<p>数字化指标数分析结果以xlsx存储，如下图</p>
<p><br><br></p>
<h2 id="一读取数据">一、读取数据</h2>
<p><img loading="lazy" src="img/mda_screen.png" alt=""  />
</p>
<p>完整md&amp;a数据集 841 M，覆盖 55856 条md&amp;a记录。 查看数据集详情可点击</p>
<ul>
<li><a href="https://textdata.cn/blog/2023-03-23-china-a-share-market-dataset-mda-from-01-to-21/"><strong>数据集 | 2001-2023年A股上市公司年报&amp;管理层讨论与分析</strong></a></li>
</ul>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">import pandas as pd

df = pd.read_csv(&#39;mda01-22.csv.gz&#39;, compression=&#39;gzip&#39;)
print(len(df))
df.head()
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">55856
</code></pre></div><p><img loading="lazy" src="img/df1.png" alt=""  />
</p>
<br>
<h2 id="二构建词典">二、构建词典</h2>
<p>下图是吴非等(2021)数字化指标的截图</p>
<p><img loading="lazy" src="img/%e7%ae%a1%e7%90%86%e4%b8%96%e7%95%8c2021%e5%90%b4%e9%9d%9e-%e4%bc%81%e4%b8%9a%e6%95%b0%e5%ad%97%e5%8c%96-%e5%85%b3%e9%94%ae%e8%af%8d.png" alt=""  />
</p>
<blockquote>
<p>后期，如果想自己扩展词典，可以初步筛选种子词(该篇论文的词表), 使用md&amp;a语料文件(txt格式)， 结合cntext库的so-pmi或词向量方法，对数字化词典进行扩充。</p>
</blockquote>
<p>这里我已将吴非等(2021)的词表内置到 cntext库（2.1.1版本）的 zh_common_Digitalization.yaml 中。</p>
<br>
<h3 id="21-安装cntext">2.1 安装cntext</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">pip3 install distinctiveness
pip3 install cntext==2.1.1
</code></pre></div><br>
<h3 id="22-导入词典">2.2 导入词典</h3>
<p>查看内置词典</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">import cntext as ct

ct.get_dict_list()
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[&#39;zh_common_NTUSD.yaml&#39;,
 &#39;zh_common_DUTIR.yaml&#39;,
 &#39;enzh_common_StopWords.yaml&#39;,
 &#39;en_valence_Concreteness.yaml&#39;,
 &#39;en_common_LoughranMcDonald.yaml&#39;,
 &#39;zh_common_FinanceSenti.yaml&#39;,
 &#39;zh_common_TsinghuaPraiseDegrade.yaml&#39;,
 &#39;en_common_ANEW.yaml&#39;,
 &#39;en_common_NRC.yaml&#39;,
 &#39;zh_valence_ChineseEmoBank.yaml&#39;,
 &#39;zh_valence_SixSemanticDimensionDatabase.yaml&#39;,
 &#39;zh_common_FinacialFormalUnformal.yaml&#39;,
 &#39;zh_common_LoughranMcDonald.yaml&#39;,
 &#39;enzh_common_AdvConj.yaml&#39;,
 &#39;en_common_SentiWS.yaml&#39;,
 &#39;zh_common_Digitalization.yaml&#39;,
 &#39;en_common_LSD2015.yaml&#39;,
 &#39;zh_common_HowNet.yaml&#39;]
</code></pre></div><br>
<p>导入数字化词典</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">Digitalization_Infos</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">read_yaml_dict</span><span class="p">(</span><span class="s1">&#39;zh_common_Digitalization.yaml&#39;</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">Digitalization_Infos</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">{&#39;Name&#39;: &#39;中文数字化词典&#39;, 

&#39;Desc&#39;: &#39;基于这篇论文，构建了中文数字化词典，含人工智能技术、大数据技术、云计算技术、区块链技术、数字技术应用等关键词列表。 &#39;, 

&#39;Refer&#39;: &#39;吴非,胡慧芷,林慧妍,任晓怡. 企业数字化转型与资本市场表现——来自股票流动性的经验证据[J]. 管理世界,2021,37(07):130-144+10.&#39;, 

&#39;Category&#39;: [&#39;Artificial_Intelligence&#39;, &#39;Big_Data&#39;, &#39;Cloud_Computing&#39;, &#39;Block_Chains&#39;, &#39;Usage_of_Digitalization&#39;], 

&#39;Dictionary&#39;: {
    &#39;Artificial_Intelligence&#39;: [&#39;人工智能&#39;, &#39;商业智能&#39;, &#39;图像理解&#39;, &#39;投资决策辅助系统&#39;, &#39;智能数据分析&#39;, &#39;智能机器人&#39;, &#39;机器学习&#39;, &#39;深度学习&#39;, &#39;语义搜索&#39;, &#39;生物识别技术&#39;, &#39;人脸识别&#39;, &#39;语音识别&#39;, &#39;身份验证&#39;, &#39;自动驾驶&#39;, &#39;自然语言处理&#39;], 
    
    &#39;Big_Data&#39;: [&#39;大数据&#39;, &#39;数据挖掘&#39;, &#39;文本挖掘&#39;, &#39;数据可视化&#39;, &#39;异构数据&#39;, &#39;征信&#39;, &#39;增强现实&#39;, &#39;混合现实&#39;, &#39;虚拟现实&#39;], 
    
    &#39;Cloud_Computing&#39;: [&#39;云计算&#39;, &#39;流计算&#39;, &#39;图计算&#39;, &#39;内存计算&#39;, &#39;多方安全计算&#39;, &#39;类脑计算&#39;, &#39;绿色计算&#39;, &#39;认知计算&#39;, &#39;融合架构&#39;, &#39;亿级并发&#39;, &#39;EB级存储&#39;, &#39;物联网&#39;, &#39;信息物理系统&#39;], 
    
    &#39;Block_Chains&#39;: [&#39;区块链&#39;, &#39;数字货币&#39;, &#39;分布式计算&#39;, &#39;差分隐私技术&#39;, &#39;智能金融合约&#39;], 
    
    &#39;Usage_of_Digitalization&#39;: [&#39;移动互联网&#39;, &#39;工业互联网&#39;, &#39;移动互联&#39;, &#39;互联网医疗&#39;, &#39;电子商务&#39;, &#39;移动支付&#39;, &#39;第三方支付&#39;, &#39;NFC支付&#39;, &#39;智能能源&#39;, &#39;B2B&#39;, &#39;B2C&#39;, &#39;C2B&#39;, &#39;C2C&#39;, &#39;O2O&#39;, &#39;网联&#39;, &#39;智能穿戴&#39;, &#39;智慧农业&#39;, &#39;智能交通&#39;, &#39;智能医疗&#39;, &#39;智能客服&#39;, &#39;智能家居&#39;, &#39;智能投顾&#39;, &#39;智能文旅&#39;, &#39;智能环保&#39;, &#39;智能电网&#39;, &#39;智能营销&#39;, &#39;数字营销&#39;, &#39;无人零售&#39;, &#39;互联网金融&#39;, &#39;数字金融&#39;, &#39;Fintech&#39;, &#39;金融科技&#39;, &#39;量化金融&#39;, &#39;开放银行&#39;]
    }
}
</code></pre></div><br>
<br>
<h2 id="三定义数字化函数">三、定义数字化函数</h2>
<p>目前，对于企业数字化水平的度量是相关研究的难点，现有文献主要有三种度量方法。</p>
<ul>
<li>第一，祁怀锦等（2020）使用企业年末无形资产明细项中与数字经济相关部分的金额占无形资产总额的比例度量企业数字化程度。</li>
<li>第二，大量研究运用数字化相关关键词在年报中的词频数量或占比度量企业的数字化转型或数字化水平（赵宸宇，2021；袁淳等，2021）。</li>
<li>第三，相关研究采取问卷调查的方式获取企业的数字化水平数据（刘政等，2020）。</li>
</ul>
<p>使用第二种方法，通过Python定义数字化函数，统计文本中数字化词语个数得到相应指标。</p>
<blockquote>
<p>吴非等(2021管理世界)数字化指标的计算更复杂一些，在此基础上，剔除关键词前存在“没”
“无”
“不”等否定词语的表述，同时也剔除非本公司（包括公司的股东、客户、供应商、公司高管简介介绍在内）的“数
字化转型”关键词。</p>
</blockquote>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">import pandas as pd

#函数内导入jieba是为了适配并行运算pandarallel
def digtal_function(text):
    import cntext as ct
    #统计text中每类词的个数
    digtal_diction = ct.read_yaml_dict(&#39;zh_common_Digitalization.yaml&#39;)[&#39;Dictionary&#39;]
    res = ct.sentiment(text=text,  diction=digtal_diction)
    return pd.Series(res)


test_text = &#39;经过技术人员不懈努力， 该企业在人工智能、大数据、云计算、工业互联网等领域有了一定的市场地位....&#39;


digtal_function(text=test_text)
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">Artificial_Intelligence_num     1
Big_Data_num                    1
Cloud_Computing_num             1
Block_Chains_num                0
Usage_of_Digitalization_num     1
stopword_num                   11
word_num                       24
sentence_num                    1
dtype: int64
</code></pre></div><p><br><br></p>
<h2 id="四批量计算">四、批量计算</h2>
<p>使用 <em><strong>apply</strong></em> 方法，对 <em><strong>text</strong></em> 列，进行  <em><strong>digtal_function</strong></em> 运算, 得到  <em><strong>res_df</strong></em></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">from pandarallel import pandarallel
pandarallel.initialize()

#结果返回为dataframe，数字代表的是每类词出现次数
res_df = df[&#39;text&#39;].parallel_apply(digtal_function)
res_df.head()
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">INFO: Pandarallel will run on 12 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
</code></pre></div><p><img loading="lazy" src="img/df2.png" alt=""  />
</p>
<p>参数解读</p>
<ul>
<li><em><strong>Artificial_Intelligence_num</strong></em>	 人工智能技术词出现在md&amp;a中的次数</li>
<li><em><strong>Big_Data_num</strong></em>	 大数据技术词出现在md&amp;a中的次数</li>
<li><em><strong>Cloud_Computing_num</strong></em>	云计算技术词出现在md&amp;a中的次数</li>
<li><em><strong>Block_Chains_num</strong></em>	区块链技术词出现在md&amp;a中的次数</li>
<li><em><strong>Usage_of_Digitalization_num</strong></em>	数字化应用技术词出现在md&amp;a中的次数</li>
<li><em><strong>stopword_num</strong></em>	停用词出现在md&amp;a中的次数</li>
<li><em><strong>word_num</strong></em>	md&amp;a中的总词数(md&amp;a的长度)</li>
<li><em><strong>sentence_num</strong></em>   md&amp;a的句子数</li>
</ul>
<p><br><br></p>
<h2 id="五结果整理">五、结果整理</h2>
<p>上一环节，将各种技术词出现次数加总，构建企业数字化词语出现个数， 并将其转为数字化指标(词频)。</p>
<blockquote>
<p>由于这类数据具有典型的“右偏性”特征，后续在其他计量分析软件中需要将其进行对数化处理，从而得到刻画企业数字化转型的整体指标。</p>
</blockquote>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">res_df[&#39;Digital_word_num&#39;] = res_df[[&#39;Artificial_Intelligence_num&#39;, 
                                     &#39;Big_Data_num&#39;, 
                                     &#39;Cloud_Computing_num&#39;, 
                                     &#39;Block_Chains_num&#39;, 
                                     &#39;Usage_of_Digitalization_num&#39;]].sum(axis=1)

# [数字化相关技术词] 在 [文本总词数] 中的占比
res_df[&#39;Digital_Index&#39;] = np.log(res_df[&#39;Digital_word_num&#39;]+1)
res_df.head()
</code></pre></div><p><img loading="lazy" src="img/df3.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="六保存结果">六、保存结果</h2>
<p>合并 <em><strong>df</strong></em> 和 <em><strong>res_df</strong></em>， 查看 <em><strong>Digital_Index</strong></em> 的最大、最小、均值</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">df2 = pd.concat([df, res_df], axis=1)

print(&#39;Digital_Index最小值: &#39;, df2.Digital_Index.min())
print(&#39;Digital_Index平均值: &#39;, df2.Digital_Index.mean())
print(&#39;Digital_Index最大值: &#39;, df2.Digital_Index.max())
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">Digital_Index最小值: 0.0
Digital_Index平均值: 0.836223935643458
Digital_Index最大值: 5.963579343618446
</code></pre></div><br>
<p>选中需要的字段，保存到 <em><strong>corporate_digitalization.xlsx</strong></em> 内</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df2</span><span class="p">[[</span><span class="s1">&#39;code&#39;</span><span class="p">,</span> 
     <span class="s1">&#39;year&#39;</span><span class="p">,</span> 
     <span class="s1">&#39;Digital_word_num&#39;</span><span class="p">,</span> 
     <span class="s1">&#39;word_num&#39;</span><span class="p">,</span> 
     <span class="s1">&#39;Digital_Index&#39;</span><span class="p">]]</span><span class="o">.</span><span class="n">to_excel</span><span class="p">(</span><span class="s1">&#39;corporate_digitalization.xlsx&#39;</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
</code></pre></div><p><br><br></p>
<p>查看结果 <em><strong>corporate_digitalization.xlsx</strong></em></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">## 查看结果</span>
<span class="n">pd</span><span class="o">.</span><span class="n">read_excel</span><span class="p">(</span><span class="s1">&#39;corporate_digitalization.xlsx&#39;</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/digital_index.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="七获取资料">七、获取资料</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">- 100元 
   - 管理层讨论与分析数据mda01-23.csv.gz
   - 数字化代码.ipynb
   - corporate_digitalization.xlsx
</code></pre></div><p>加微信 <strong>372335839</strong>， 备注「姓名-学校-专业」。</p>
<p><br><br></p>
<h2 id="cntext使用声明">cntext使用声明</h2>
<p>如在研究或项目中使用 cntext ，请在文中介绍并附引用声明。引用格式可参考 <a href="https://textdata.cn/blog/2025-09-09-transform-text-data-into-structured-data-with-lmstudio-and-cntext/#%E4%BD%BF%E7%94%A8%E5%A3%B0%E6%98%8E">cntext 推荐引用格式</a></p>
<p><br><br></p>
<h2 id="精选内容">精选内容</h2>
<ul>
<li><a href="https://textdata.cn/blog/2025-02-14-using-online-large-model-api-to-transform-text-data-into-structured-data/"><strong>教程 | 使用大模型将文本数据转化为结构化数据</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-06-14-how-to-download-large-language-model-with-ollama/"><strong>教程 | 如何使用 Ollama 下载 &amp; 使用本地大语言模型</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-08-07-structured-outputs-with-ollama/"><strong>实验 | 如何使 Ollama 结构化输出 JSON 样式的结果</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/"><strong>推荐 | 文本分析库 cntext 使用手册</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-06-14-using-large-language-model-to-extract-structure-data-from-raw-text/"><strong>实验 | 使用本地大模型从文本中提取结构化信息</strong></a></li>
</ul>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>数据集 |  上市公司董监高人员的个人特征/教育背景/任职情况</title>
      <link>https://textdata.cn/blog/2024-04-18-china-a-listed-company-figure-characteristic-dataset/</link>
      <pubDate>Thu, 18 Apr 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-04-18-china-a-listed-company-figure-characteristic-dataset/</guid>
      <description>&lt;h2 id=&#34;一上市公司董监高&#34;&gt;一、上市公司董监高&lt;/h2&gt;
&lt;h3 id=&#34;11-数据集概况&#34;&gt;1.1 数据集概况&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;数据集: 中国上市公司人物特征研究数据库
   
董监高人数: 375105

数据源: 新浪财经高管(公开信息)

记录数:
   - 董监高个人特征  1548448
   - 董监高教育背景明细表 639615
   - 董监高任职情况表 1448841 

截止日期: 1990-2024.4.8

本文声明: 如有问题， 请加微信372335839，备注「姓名-学校-专业」
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;以「新希望」为例， 董监高截图。&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;a href=&#34;https://vip.stock.finance.sina.com.cn/corp/go.php/vCI_CorpManager/stockid/000876.phtml&#34;&gt;https://vip.stock.finance.sina.com.cn/corp/go.php/vCI_CorpManager/stockid/000876.phtml&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/01-cover.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/03-cover.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/02-cover.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;12-声明&#34;&gt;1.2 声明&lt;/h3&gt;
&lt;p&gt;科研用途；如有问题， 请加微信372335839，备注「姓名-学校-专业」&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二查看数据&#34;&gt;二、查看数据&lt;/h2&gt;
&lt;h3 id=&#34;21-董监高教育背景明细表&#34;&gt;2.1 董监高教育背景明细表&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;import pandas as pd

df1 = pd.read_csv(&amp;#39;董监高教育背景明细表.csv&amp;#39;)
df1.head()
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/01-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;查看字段&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;field_max_len&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;max&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;([&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;x&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df1&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;iloc&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;:]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;index&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;desc_max_len&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;max&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;([&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;x&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df1&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;iloc&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;:]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;values&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;

&lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;field&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;desc&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;zip&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df1&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;iloc&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;:]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;index&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df1&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;iloc&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;:]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;values&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
    &lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;sa&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;- &lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;field&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;field_max_len&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}}&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt; &lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;desc&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;desc_max_len&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}}&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;- Symbol         股票代码  
- EndDate        截止日期  
- PersonID       人员ID  
- FullName       人员姓名  
- Degree         学历    
- UniversityID   毕业院校ID
- University     毕业院校  
- Major          专业    
- AdmissionTime  入校时间  
- GraduationTime 毕业时间  
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h3 id=&#34;22-董监高个人特征&#34;&gt;2.2 董监高个人特征&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df2&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;董监高个人特征.csv&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df2&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;head&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/02-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;查看字段&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;field_max_len = max([len(x) for x in df2.iloc[0, :].index])
desc_max_len = max([len(x) for x in df2.iloc[0, :].values])

for field, desc in zip(df2.iloc[0, :].index, df2.iloc[0, :].values):
    print(f&amp;#39;- {field:&amp;lt;{field_max_len}} {desc:&amp;lt;{desc_max_len}}&amp;#39;)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Stkcd&lt;/span&gt;             &lt;span class=&#34;n&#34;&gt;证券代码&lt;/span&gt;          
&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Reptdt&lt;/span&gt;            &lt;span class=&#34;n&#34;&gt;统计截止日期&lt;/span&gt;        
&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;PersonID&lt;/span&gt;          &lt;span class=&#34;n&#34;&gt;人员ID&lt;/span&gt;          
&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Name&lt;/span&gt;              &lt;span class=&#34;n&#34;&gt;姓名&lt;/span&gt;            
&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Nationality&lt;/span&gt;       &lt;span class=&#34;n&#34;&gt;国籍&lt;/span&gt;            
&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;NativePlace&lt;/span&gt;       &lt;span class=&#34;n&#34;&gt;籍贯&lt;/span&gt;            
&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;NatAreaCode&lt;/span&gt;       &lt;span class=&#34;n&#34;&gt;籍贯所在地区代码&lt;/span&gt;      
&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;BirthPlace&lt;/span&gt;        &lt;span class=&#34;n&#34;&gt;出生地&lt;/span&gt;           
&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;BirAreaCode&lt;/span&gt;       &lt;span class=&#34;n&#34;&gt;出生地所在地区代码&lt;/span&gt;     
&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Gender&lt;/span&gt;            &lt;span class=&#34;n&#34;&gt;性别&lt;/span&gt;            
&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Age&lt;/span&gt;               &lt;span class=&#34;n&#34;&gt;年龄&lt;/span&gt;            
&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;University&lt;/span&gt;        &lt;span class=&#34;n&#34;&gt;毕业院校&lt;/span&gt;          
&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Degree&lt;/span&gt;            &lt;span class=&#34;n&#34;&gt;学历&lt;/span&gt;            
&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Major&lt;/span&gt;             &lt;span class=&#34;n&#34;&gt;专业&lt;/span&gt;            
&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Profession&lt;/span&gt;        &lt;span class=&#34;n&#34;&gt;职称&lt;/span&gt;            
&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Resume&lt;/span&gt;            &lt;span class=&#34;n&#34;&gt;个人简历&lt;/span&gt;          
&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;PaidSign&lt;/span&gt;          &lt;span class=&#34;n&#34;&gt;是否领取薪酬&lt;/span&gt;        
&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;TotalSalary&lt;/span&gt;       &lt;span class=&#34;n&#34;&gt;报告期报酬总额&lt;/span&gt;       
&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Allowance&lt;/span&gt;         &lt;span class=&#34;n&#34;&gt;其中&lt;/span&gt;&lt;span class=&#34;err&#34;&gt;：&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;津贴&lt;/span&gt;         
&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;SharEnd&lt;/span&gt;           &lt;span class=&#34;n&#34;&gt;年末持股数&lt;/span&gt;         
&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;IsMTMT&lt;/span&gt;            &lt;span class=&#34;n&#34;&gt;是否高管团队成员&lt;/span&gt;      
&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;TMTP&lt;/span&gt;              &lt;span class=&#34;n&#34;&gt;高管职务类别&lt;/span&gt;        
&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;IsMTB&lt;/span&gt;             &lt;span class=&#34;n&#34;&gt;是否董事会成员&lt;/span&gt;       
&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;CTB&lt;/span&gt;               &lt;span class=&#34;n&#34;&gt;董事会职务类别&lt;/span&gt;       
&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;IsIdirecotr&lt;/span&gt;       &lt;span class=&#34;n&#34;&gt;是否独立董事&lt;/span&gt;        
&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;IsDuality&lt;/span&gt;         &lt;span class=&#34;n&#34;&gt;是否兼任董事长和CEO&lt;/span&gt;   
&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;IsSupervisor&lt;/span&gt;      &lt;span class=&#34;n&#34;&gt;是否监事&lt;/span&gt;          
&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Position&lt;/span&gt;          &lt;span class=&#34;n&#34;&gt;具体职务&lt;/span&gt;          
&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;PositionID&lt;/span&gt;        &lt;span class=&#34;n&#34;&gt;具体职务ID&lt;/span&gt;        
&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ServicePosition&lt;/span&gt;   &lt;span class=&#34;n&#34;&gt;在职职务&lt;/span&gt;          
&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ServicePositionID&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;在职职务ID&lt;/span&gt;        
&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Funback&lt;/span&gt;           &lt;span class=&#34;n&#34;&gt;职业背景&lt;/span&gt;          
&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;OveseaBack&lt;/span&gt;        &lt;span class=&#34;n&#34;&gt;海外背景&lt;/span&gt;          
&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Academic&lt;/span&gt;          &lt;span class=&#34;n&#34;&gt;学术背景&lt;/span&gt;          
&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;FinBack&lt;/span&gt;           &lt;span class=&#34;n&#34;&gt;金融背景&lt;/span&gt;          
&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;IsCocurP&lt;/span&gt;          &lt;span class=&#34;n&#34;&gt;是否在股东单位兼任&lt;/span&gt;     
&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;OtherCo&lt;/span&gt;           &lt;span class=&#34;n&#34;&gt;兼任职务&lt;/span&gt;          
&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;OtherCoType&lt;/span&gt;       &lt;span class=&#34;n&#34;&gt;兼任职务类别&lt;/span&gt;        
&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Director_TotCO&lt;/span&gt;    &lt;span class=&#34;n&#34;&gt;兼任职务为董事的公司总数&lt;/span&gt;  
&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Director_ListCO&lt;/span&gt;   &lt;span class=&#34;n&#34;&gt;兼任职务为董事的上市公司总数&lt;/span&gt;
&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Stkcd_director&lt;/span&gt;    &lt;span class=&#34;n&#34;&gt;兼任职务为董事的上市公司代码&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;br&gt;
&lt;h3 id=&#34;23-董监高任职情况表&#34;&gt;2.3 董监高任职情况表&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df3&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;董监高任职情况表.csv&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df3&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;head&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/03-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;field_max_len&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;max&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;([&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;x&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df3&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;iloc&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;:]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;index&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;desc_max_len&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;max&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;([&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;x&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df3&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;iloc&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;:]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;values&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;

&lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;field&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;desc&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;zip&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df3&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;iloc&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;:]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;index&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df3&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;iloc&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;:]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;values&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
    &lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;sa&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;- &lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;field&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;field_max_len&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}}&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt; &lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;desc&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;desc_max_len&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}}&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;- Stkcd         证券代码    
- Reptdt        统计截止日期  
- PersonID      人员ID    
- Name          姓名      
- Position      具体职务    
- PositionID    具体职务ID  
- StartDate     任职开始日期  
- EndDate       任职结束日期  
- ServiceStatus 是否在职    
- Tenure        任期      
- ToLeavPost    距离离任剩余日期
- ResignReason  离职原因    
- GTAPosition   职务名称    
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;三相关数据&#34;&gt;三、相关数据&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&#34;https://textdata.cn/blog/2022-11-25-senior-manager-resume-dataset/&#34;&gt;数据集(付费) | 90w条中国上市公司高管数据&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-05-17-top-manager-violation/&#34;&gt;数据集 | 上市公司高管违规数据(2008-2022)&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-03-23-china-a-share-market-dataset-mda-from-01-to-21/&#34;&gt;数据集 | 2001-2022年A股上市公司年报&amp;amp;管理层讨论与分析&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&#34;https://textdata.cn/blog/2024-01-18-neeq-china-listed-on-nation-equities-exchange-and-quotation-system-anunal-year-report/&#34;&gt;数据集(付费) | 三板上市公司年报2002-2023.12&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&#34;https://textdata.cn/blog/2024-01-03-listed-company-arbitration-dataset/&#34;&gt;数据集 | 36330条上市公司仲裁数据(2000-2021)&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-12-07-patent-application-dataset-of-listed-company-in-china-a-market/&#34;&gt;数据集 | 上市公司 208 万条专利数据集 (1991-2022)&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-09-08-china-a-share-market-listed-company-earnings-communication-conference/&#34;&gt;数据集 | 84w条业绩说明会问答数据(2005-2023)&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-08-11-china-a-market-corporate-social-responsibility-dataste/&#34;&gt;数据集 | 2006年-2022年企业社会责任报告&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-04-17-china-a-market-inquiry-letter-datasets/&#34;&gt;数据集(付费) | 2014年-2022年监管问询函&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-04-26-entrusted-loan-dataset/&#34;&gt;数据集| 07-21年上市公司「委托贷款公告」&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&#34;https://textdata.cn/blog/coporate_social_responsibility_datasets/&#34;&gt;数据集 | 企业社会责任报告数据集&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
</description>
      <content:encoded><![CDATA[<h2 id="一上市公司董监高">一、上市公司董监高</h2>
<h3 id="11-数据集概况">1.1 数据集概况</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">数据集: 中国上市公司人物特征研究数据库
   
董监高人数: 375105

数据源: 新浪财经高管(公开信息)

记录数:
   - 董监高个人特征  1548448
   - 董监高教育背景明细表 639615
   - 董监高任职情况表 1448841 

截止日期: 1990-2024.4.8

本文声明: 如有问题， 请加微信372335839，备注「姓名-学校-专业」
</code></pre></div><p>以「新希望」为例， 董监高截图。</p>
<blockquote>
<p><a href="https://vip.stock.finance.sina.com.cn/corp/go.php/vCI_CorpManager/stockid/000876.phtml">https://vip.stock.finance.sina.com.cn/corp/go.php/vCI_CorpManager/stockid/000876.phtml</a></p>
</blockquote>
<p><img loading="lazy" src="img/01-cover.png" alt=""  />
</p>
<p><img loading="lazy" src="img/03-cover.png" alt=""  />
</p>
<p><img loading="lazy" src="img/02-cover.png" alt=""  />
</p>
<br>
<h3 id="12-声明">1.2 声明</h3>
<p>科研用途；如有问题， 请加微信372335839，备注「姓名-学校-专业」</p>
<p><br><br></p>
<h2 id="二查看数据">二、查看数据</h2>
<h3 id="21-董监高教育背景明细表">2.1 董监高教育背景明细表</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">import pandas as pd

df1 = pd.read_csv(&#39;董监高教育背景明细表.csv&#39;)
df1.head()
</code></pre></div><p><img loading="lazy" src="img/01-df.png" alt=""  />
</p>
<br>
<p>查看字段</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">field_max_len</span> <span class="o">=</span> <span class="nb">max</span><span class="p">([</span><span class="nb">len</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">df1</span><span class="o">.</span><span class="n">iloc</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="p">:]</span><span class="o">.</span><span class="n">index</span><span class="p">])</span>
<span class="n">desc_max_len</span> <span class="o">=</span> <span class="nb">max</span><span class="p">([</span><span class="nb">len</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">df1</span><span class="o">.</span><span class="n">iloc</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="p">:]</span><span class="o">.</span><span class="n">values</span><span class="p">])</span>

<span class="k">for</span> <span class="n">field</span><span class="p">,</span> <span class="n">desc</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">df1</span><span class="o">.</span><span class="n">iloc</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="p">:]</span><span class="o">.</span><span class="n">index</span><span class="p">,</span> <span class="n">df1</span><span class="o">.</span><span class="n">iloc</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="p">:]</span><span class="o">.</span><span class="n">values</span><span class="p">):</span>
    <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;- </span><span class="si">{</span><span class="n">field</span><span class="si">:</span><span class="s1">&lt;</span><span class="si">{</span><span class="n">field_max_len</span><span class="si">}}</span><span class="s1"> </span><span class="si">{</span><span class="n">desc</span><span class="si">:</span><span class="s1">&lt;</span><span class="si">{</span><span class="n">desc_max_len</span><span class="si">}}</span><span class="s1">&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">- Symbol         股票代码  
- EndDate        截止日期  
- PersonID       人员ID  
- FullName       人员姓名  
- Degree         学历    
- UniversityID   毕业院校ID
- University     毕业院校  
- Major          专业    
- AdmissionTime  入校时间  
- GraduationTime 毕业时间  
</code></pre></div><p><br><br></p>
<h3 id="22-董监高个人特征">2.2 董监高个人特征</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df2</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;董监高个人特征.csv&#39;</span><span class="p">)</span>
<span class="n">df2</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/02-df.png" alt=""  />
</p>
<br>
<p>查看字段</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">field_max_len = max([len(x) for x in df2.iloc[0, :].index])
desc_max_len = max([len(x) for x in df2.iloc[0, :].values])

for field, desc in zip(df2.iloc[0, :].index, df2.iloc[0, :].values):
    print(f&#39;- {field:&lt;{field_max_len}} {desc:&lt;{desc_max_len}}&#39;)
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">-</span> <span class="n">Stkcd</span>             <span class="n">证券代码</span>          
<span class="o">-</span> <span class="n">Reptdt</span>            <span class="n">统计截止日期</span>        
<span class="o">-</span> <span class="n">PersonID</span>          <span class="n">人员ID</span>          
<span class="o">-</span> <span class="n">Name</span>              <span class="n">姓名</span>            
<span class="o">-</span> <span class="n">Nationality</span>       <span class="n">国籍</span>            
<span class="o">-</span> <span class="n">NativePlace</span>       <span class="n">籍贯</span>            
<span class="o">-</span> <span class="n">NatAreaCode</span>       <span class="n">籍贯所在地区代码</span>      
<span class="o">-</span> <span class="n">BirthPlace</span>        <span class="n">出生地</span>           
<span class="o">-</span> <span class="n">BirAreaCode</span>       <span class="n">出生地所在地区代码</span>     
<span class="o">-</span> <span class="n">Gender</span>            <span class="n">性别</span>            
<span class="o">-</span> <span class="n">Age</span>               <span class="n">年龄</span>            
<span class="o">-</span> <span class="n">University</span>        <span class="n">毕业院校</span>          
<span class="o">-</span> <span class="n">Degree</span>            <span class="n">学历</span>            
<span class="o">-</span> <span class="n">Major</span>             <span class="n">专业</span>            
<span class="o">-</span> <span class="n">Profession</span>        <span class="n">职称</span>            
<span class="o">-</span> <span class="n">Resume</span>            <span class="n">个人简历</span>          
<span class="o">-</span> <span class="n">PaidSign</span>          <span class="n">是否领取薪酬</span>        
<span class="o">-</span> <span class="n">TotalSalary</span>       <span class="n">报告期报酬总额</span>       
<span class="o">-</span> <span class="n">Allowance</span>         <span class="n">其中</span><span class="err">：</span><span class="n">津贴</span>         
<span class="o">-</span> <span class="n">SharEnd</span>           <span class="n">年末持股数</span>         
<span class="o">-</span> <span class="n">IsMTMT</span>            <span class="n">是否高管团队成员</span>      
<span class="o">-</span> <span class="n">TMTP</span>              <span class="n">高管职务类别</span>        
<span class="o">-</span> <span class="n">IsMTB</span>             <span class="n">是否董事会成员</span>       
<span class="o">-</span> <span class="n">CTB</span>               <span class="n">董事会职务类别</span>       
<span class="o">-</span> <span class="n">IsIdirecotr</span>       <span class="n">是否独立董事</span>        
<span class="o">-</span> <span class="n">IsDuality</span>         <span class="n">是否兼任董事长和CEO</span>   
<span class="o">-</span> <span class="n">IsSupervisor</span>      <span class="n">是否监事</span>          
<span class="o">-</span> <span class="n">Position</span>          <span class="n">具体职务</span>          
<span class="o">-</span> <span class="n">PositionID</span>        <span class="n">具体职务ID</span>        
<span class="o">-</span> <span class="n">ServicePosition</span>   <span class="n">在职职务</span>          
<span class="o">-</span> <span class="n">ServicePositionID</span> <span class="n">在职职务ID</span>        
<span class="o">-</span> <span class="n">Funback</span>           <span class="n">职业背景</span>          
<span class="o">-</span> <span class="n">OveseaBack</span>        <span class="n">海外背景</span>          
<span class="o">-</span> <span class="n">Academic</span>          <span class="n">学术背景</span>          
<span class="o">-</span> <span class="n">FinBack</span>           <span class="n">金融背景</span>          
<span class="o">-</span> <span class="n">IsCocurP</span>          <span class="n">是否在股东单位兼任</span>     
<span class="o">-</span> <span class="n">OtherCo</span>           <span class="n">兼任职务</span>          
<span class="o">-</span> <span class="n">OtherCoType</span>       <span class="n">兼任职务类别</span>        
<span class="o">-</span> <span class="n">Director_TotCO</span>    <span class="n">兼任职务为董事的公司总数</span>  
<span class="o">-</span> <span class="n">Director_ListCO</span>   <span class="n">兼任职务为董事的上市公司总数</span>
<span class="o">-</span> <span class="n">Stkcd_director</span>    <span class="n">兼任职务为董事的上市公司代码</span>
</code></pre></div><br>
<br>
<h3 id="23-董监高任职情况表">2.3 董监高任职情况表</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df3</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;董监高任职情况表.csv&#39;</span><span class="p">)</span>
<span class="n">df3</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/03-df.png" alt=""  />
</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">field_max_len</span> <span class="o">=</span> <span class="nb">max</span><span class="p">([</span><span class="nb">len</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">df3</span><span class="o">.</span><span class="n">iloc</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="p">:]</span><span class="o">.</span><span class="n">index</span><span class="p">])</span>
<span class="n">desc_max_len</span> <span class="o">=</span> <span class="nb">max</span><span class="p">([</span><span class="nb">len</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">df3</span><span class="o">.</span><span class="n">iloc</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="p">:]</span><span class="o">.</span><span class="n">values</span><span class="p">])</span>

<span class="k">for</span> <span class="n">field</span><span class="p">,</span> <span class="n">desc</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">df3</span><span class="o">.</span><span class="n">iloc</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="p">:]</span><span class="o">.</span><span class="n">index</span><span class="p">,</span> <span class="n">df3</span><span class="o">.</span><span class="n">iloc</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="p">:]</span><span class="o">.</span><span class="n">values</span><span class="p">):</span>
    <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;- </span><span class="si">{</span><span class="n">field</span><span class="si">:</span><span class="s1">&lt;</span><span class="si">{</span><span class="n">field_max_len</span><span class="si">}}</span><span class="s1"> </span><span class="si">{</span><span class="n">desc</span><span class="si">:</span><span class="s1">&lt;</span><span class="si">{</span><span class="n">desc_max_len</span><span class="si">}}</span><span class="s1">&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">- Stkcd         证券代码    
- Reptdt        统计截止日期  
- PersonID      人员ID    
- Name          姓名      
- Position      具体职务    
- PositionID    具体职务ID  
- StartDate     任职开始日期  
- EndDate       任职结束日期  
- ServiceStatus 是否在职    
- Tenure        任期      
- ToLeavPost    距离离任剩余日期
- ResignReason  离职原因    
- GTAPosition   职务名称    
</code></pre></div><p><br><br></p>
<h2 id="三相关数据">三、相关数据</h2>
<ul>
<li>
<p><a href="https://textdata.cn/blog/2022-11-25-senior-manager-resume-dataset/">数据集(付费) | 90w条中国上市公司高管数据</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-05-17-top-manager-violation/">数据集 | 上市公司高管违规数据(2008-2022)</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-03-23-china-a-share-market-dataset-mda-from-01-to-21/">数据集 | 2001-2022年A股上市公司年报&amp;管理层讨论与分析</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2024-01-18-neeq-china-listed-on-nation-equities-exchange-and-quotation-system-anunal-year-report/">数据集(付费) | 三板上市公司年报2002-2023.12</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2024-01-03-listed-company-arbitration-dataset/">数据集 | 36330条上市公司仲裁数据(2000-2021)</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-12-07-patent-application-dataset-of-listed-company-in-china-a-market/">数据集 | 上市公司 208 万条专利数据集 (1991-2022)</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-09-08-china-a-share-market-listed-company-earnings-communication-conference/">数据集 | 84w条业绩说明会问答数据(2005-2023)</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-08-11-china-a-market-corporate-social-responsibility-dataste/">数据集 | 2006年-2022年企业社会责任报告</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-04-17-china-a-market-inquiry-letter-datasets/">数据集(付费) | 2014年-2022年监管问询函</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-04-26-entrusted-loan-dataset/">数据集| 07-21年上市公司「委托贷款公告」</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/coporate_social_responsibility_datasets/">数据集 | 企业社会责任报告数据集</a></p>
</li>
</ul>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>数据集 |  使用3394w条豆瓣书评数据集</title>
      <link>https://textdata.cn/blog/2024-04-17-douban-book-3394w-ratings-comments-dataset/</link>
      <pubDate>Wed, 17 Apr 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-04-17-douban-book-3394w-ratings-comments-dataset/</guid>
      <description>&lt;h2 id=&#34;一豆瓣读书介绍&#34;&gt;一、豆瓣读书介绍&lt;/h2&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;数据集: douba-book

数据源: 豆瓣读书
   
记录数:
   - 标签 120 个
   - 书 17967 部
   - 书评 33941454 条
   
书评日期起止: 2005-06-12 ~ 2018-10-13
   
体积: 2.11G(解压后5.52G) 
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;该数据已经过初步清洗，可用于推荐系统、情感分析、知识图谱、社会学文化变迁等多个领域(或主题)。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二查看数据&#34;&gt;二、查看数据&lt;/h2&gt;
&lt;h3 id=&#34;21-读取数据&#34;&gt;2.1 读取数据&lt;/h3&gt;
&lt;p&gt;下载 &lt;em&gt;&lt;strong&gt;douban_book.csv.gz&lt;/strong&gt;&lt;/em&gt; 解压后，可以看到数据集中有一个 &lt;em&gt;&lt;strong&gt;douban_book.csv&lt;/strong&gt;&lt;/em&gt; 文件。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;douban_book.csv.gz&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;compression&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;gzip&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;33941454
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/01-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;22-所含字段&#34;&gt;2.2 所含字段&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;col&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;columns&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;sa&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39; - &lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;col&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt; - tag          标签
 - book_name    书名
 - user_name    书评人
 - date         书评发布日期
 - comment      书评内容
 - star         评分(1-5)
 - vote_count   书评获赞数
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;23--覆盖日期&#34;&gt;2.3  覆盖日期&lt;/h3&gt;
&lt;p&gt;书评发布日期覆盖(最早~ 最晚)&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;date&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;to_datetime&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;date&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;

&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;date&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;min&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;())&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;date&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;max&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;2005-06-12 00:00:00
2018-10-13 00:00:00
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;24-标签&#34;&gt;2.4 标签&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;tag&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;nunique&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;())&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;tag&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;unique&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;120

[&amp;#39;思想&amp;#39; &amp;#39;科技&amp;#39; &amp;#39;金融&amp;#39; &amp;#39;政治学&amp;#39; &amp;#39;随笔&amp;#39; &amp;#39;爱情&amp;#39; &amp;#39;名著&amp;#39; &amp;#39;幾米&amp;#39; &amp;#39;人文&amp;#39; &amp;#39;交互&amp;#39; &amp;#39;悬疑&amp;#39; &amp;#39;算法&amp;#39; &amp;#39;哲学&amp;#39; &amp;#39;艺术史&amp;#39;
 &amp;#39;历史&amp;#39; &amp;#39;用户体验&amp;#39; &amp;#39;绘画&amp;#39; &amp;#39;诗词&amp;#39; &amp;#39;考古&amp;#39; &amp;#39;心理学&amp;#39; &amp;#39;互联网&amp;#39; &amp;#39;戏剧&amp;#39; &amp;#39;安妮宝贝&amp;#39; &amp;#39;艺术&amp;#39; &amp;#39;东野圭吾&amp;#39; &amp;#39;散文&amp;#39; &amp;#39;魔幻&amp;#39;
 &amp;#39;童话&amp;#39; &amp;#39;商业&amp;#39; &amp;#39;UCD&amp;#39; &amp;#39;日本文学&amp;#39; &amp;#39;武侠&amp;#39; &amp;#39;音乐&amp;#39; &amp;#39;通信&amp;#39; &amp;#39;科幻小说&amp;#39; &amp;#39;科普&amp;#39; &amp;#39;程序&amp;#39; &amp;#39;生活&amp;#39; &amp;#39;张悦然&amp;#39; &amp;#39;经济&amp;#39;
 &amp;#39;小说&amp;#39; &amp;#39;科幻&amp;#39; &amp;#39;军事&amp;#39; &amp;#39;心理&amp;#39; &amp;#39;文学&amp;#39; &amp;#39;电影&amp;#39; &amp;#39;社会学&amp;#39; &amp;#39;广告&amp;#39; &amp;#39;管理&amp;#39; &amp;#39;励志&amp;#39; &amp;#39;耽美&amp;#39; &amp;#39;郭敬明&amp;#39; &amp;#39;穿越&amp;#39;
 &amp;#39;阿加莎·克里斯蒂&amp;#39; &amp;#39;杂文&amp;#39; &amp;#39;传记&amp;#39; &amp;#39;韩寒&amp;#39; &amp;#39;设计&amp;#39; &amp;#39;落落&amp;#39; &amp;#39;言情&amp;#39; &amp;#39;职场&amp;#39; &amp;#39;成长&amp;#39; &amp;#39;佛教&amp;#39; &amp;#39;女性&amp;#39; &amp;#39;政治&amp;#39; &amp;#39;近代史&amp;#39;
 &amp;#39;营销&amp;#39; &amp;#39;推理小说&amp;#39; &amp;#39;建筑&amp;#39; &amp;#39;经典&amp;#39; &amp;#39;外国名著&amp;#39; &amp;#39;二战&amp;#39; &amp;#39;鲁迅&amp;#39; &amp;#39;J.K.罗琳&amp;#39; &amp;#39;奇幻&amp;#39; &amp;#39;外国文学&amp;#39; &amp;#39;校园&amp;#39; &amp;#39;人物传记&amp;#39;
 &amp;#39;西方哲学&amp;#39; &amp;#39;自由主义&amp;#39; &amp;#39;文化&amp;#39; &amp;#39;旅行&amp;#39; &amp;#39;张小娴&amp;#39; &amp;#39;企业史&amp;#39; &amp;#39;国学&amp;#39; &amp;#39;摄影&amp;#39; &amp;#39;亦舒&amp;#39; &amp;#39;青春&amp;#39; &amp;#39;科学&amp;#39; &amp;#39;策划&amp;#39; &amp;#39;web&amp;#39;
 &amp;#39;创业&amp;#39; &amp;#39;美术&amp;#39; &amp;#39;宗教&amp;#39; &amp;#39;古龙&amp;#39; &amp;#39;沧月&amp;#39; &amp;#39;村上春树&amp;#39; &amp;#39;社会&amp;#39; &amp;#39;股票&amp;#39; &amp;#39;理财&amp;#39; &amp;#39;日本漫画&amp;#39; &amp;#39;轻小说&amp;#39; &amp;#39;数学&amp;#39; &amp;#39;神经网络&amp;#39;
 &amp;#39;网络小说&amp;#39; &amp;#39;当代文学&amp;#39; &amp;#39;中国历史&amp;#39; &amp;#39;三毛&amp;#39; &amp;#39;回忆录&amp;#39; &amp;#39;古典文学&amp;#39; &amp;#39;交互设计&amp;#39; &amp;#39;推理&amp;#39; &amp;#39;高木直子&amp;#39; &amp;#39;中国文学&amp;#39; &amp;#39;青春文学&amp;#39;
 &amp;#39;金庸&amp;#39; &amp;#39;UE&amp;#39; &amp;#39;投资&amp;#39; &amp;#39;编程&amp;#39; &amp;#39;几米&amp;#39;]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;25--可视化&#34;&gt;2.5  可视化&lt;/h3&gt;
&lt;p&gt;书评发布数量随年份变化&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;plt&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;matplotlib&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;matplotlib_inline&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;matplotlib_inline&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;backend_inline&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;set_matplotlib_formats&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;png&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;svg&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;scienceplots&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;platform&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#初始化matplotlib汉化美化配置&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;style&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;use&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;([&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;science&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;no-latex&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;cjk-sc-font&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;system&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;platform&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;system&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;  &lt;span class=&#34;c1&#34;&gt;# 获取操作系统类型&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;if&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;system&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;==&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;Windows&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;font&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;family&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;SimHei&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;elif&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;system&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;==&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;Darwin&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;font&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;family&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;Arial Unicode MS&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;else&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;font&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;family&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;sans-serif&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;matplotlib&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;rc&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;font&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;**&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;font&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;  &lt;span class=&#34;c1&#34;&gt;# 设置全局字体&lt;/span&gt;


&lt;span class=&#34;c1&#34;&gt;#构造数据&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;date_series&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[]&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;volume_series&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[]&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;date&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;year_df&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;groupby&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Grouper&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;key&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;date&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;freq&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;M&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)):&lt;/span&gt;
    &lt;span class=&#34;c1&#34;&gt;#这里的date， month_df都是特殊数据类型&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;date_series&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;append&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;date&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;date&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;())&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;volume_series&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;append&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;year_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;volume_by_time_df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;DataFrame&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;({&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;date&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;date_series&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;volume&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;volume_series&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;})&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;volume_by_time_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;date&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;to_datetime&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;volume_by_time_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;date&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;



&lt;span class=&#34;c1&#34;&gt;#开始绘图&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;figure&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;figsize&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;10&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;5&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;plot&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;volume_by_time_df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;date&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
         &lt;span class=&#34;n&#34;&gt;volume_by_time_df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;volume&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
         &lt;span class=&#34;n&#34;&gt;linestyle&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;--&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;scatter&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;volume_by_time_df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;date&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
            &lt;span class=&#34;n&#34;&gt;volume_by_time_df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;volume&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
            &lt;span class=&#34;n&#34;&gt;s&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;8&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;title&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;豆瓣读书随年份书评数量变化(2005.6.12 ~ 2018.10.13)&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
          &lt;span class=&#34;n&#34;&gt;fontsize&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;10&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;xlabel&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;日期&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ylabel&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;书评数量&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;savefig&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;plot.png&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;dpi&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;200&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;show&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/plot.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;三相关内容&#34;&gt;三、相关内容&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2024-04-16-douban-movie-1000w-ratings-comments-dataset/&#34;&gt;数据集 | 使用1000w条豆瓣影评训练Word2Vec&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;四获取数据&#34;&gt;四、获取数据&lt;/h2&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;douban-book&lt;/strong&gt;&lt;/em&gt; 链接: &lt;a href=&#34;https://pan.baidu.com/s/1qySKU_0dsoi1NAF9lQ971w?pwd=n5qe&#34;&gt;https://pan.baidu.com/s/1qySKU_0dsoi1NAF9lQ971w?pwd=n5qe&lt;/a&gt; 提取码: n5qe&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
</description>
      <content:encoded><![CDATA[<h2 id="一豆瓣读书介绍">一、豆瓣读书介绍</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">数据集: douba-book

数据源: 豆瓣读书
   
记录数:
   - 标签 120 个
   - 书 17967 部
   - 书评 33941454 条
   
书评日期起止: 2005-06-12 ~ 2018-10-13
   
体积: 2.11G(解压后5.52G) 
</code></pre></div><p>该数据已经过初步清洗，可用于推荐系统、情感分析、知识图谱、社会学文化变迁等多个领域(或主题)。</p>
<p><br><br></p>
<h2 id="二查看数据">二、查看数据</h2>
<h3 id="21-读取数据">2.1 读取数据</h3>
<p>下载 <em><strong>douban_book.csv.gz</strong></em> 解压后，可以看到数据集中有一个 <em><strong>douban_book.csv</strong></em> 文件。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;douban_book.csv.gz&#39;</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">))</span>
<span class="n">df</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">33941454
</code></pre></div><p><img loading="lazy" src="img/01-df.png" alt=""  />
</p>
<br>
<h3 id="22-所含字段">2.2 所含字段</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="k">for</span> <span class="n">col</span> <span class="ow">in</span> <span class="n">df</span><span class="o">.</span><span class="n">columns</span><span class="p">:</span>
    <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39; - </span><span class="si">{</span><span class="n">col</span><span class="si">}</span><span class="s1">&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"> - tag          标签
 - book_name    书名
 - user_name    书评人
 - date         书评发布日期
 - comment      书评内容
 - star         评分(1-5)
 - vote_count   书评获赞数
</code></pre></div><br>
<h3 id="23--覆盖日期">2.3  覆盖日期</h3>
<p>书评发布日期覆盖(最早~ 最晚)</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s1">&#39;date&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;date&#39;</span><span class="p">])</span>

<span class="nb">print</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;date&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">min</span><span class="p">())</span>
<span class="nb">print</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;date&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">())</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">2005-06-12 00:00:00
2018-10-13 00:00:00
</code></pre></div><br>
<h3 id="24-标签">2.4 标签</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="nb">print</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">tag</span><span class="o">.</span><span class="n">nunique</span><span class="p">())</span>
<span class="nb">print</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">tag</span><span class="o">.</span><span class="n">unique</span><span class="p">())</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">120

[&#39;思想&#39; &#39;科技&#39; &#39;金融&#39; &#39;政治学&#39; &#39;随笔&#39; &#39;爱情&#39; &#39;名著&#39; &#39;幾米&#39; &#39;人文&#39; &#39;交互&#39; &#39;悬疑&#39; &#39;算法&#39; &#39;哲学&#39; &#39;艺术史&#39;
 &#39;历史&#39; &#39;用户体验&#39; &#39;绘画&#39; &#39;诗词&#39; &#39;考古&#39; &#39;心理学&#39; &#39;互联网&#39; &#39;戏剧&#39; &#39;安妮宝贝&#39; &#39;艺术&#39; &#39;东野圭吾&#39; &#39;散文&#39; &#39;魔幻&#39;
 &#39;童话&#39; &#39;商业&#39; &#39;UCD&#39; &#39;日本文学&#39; &#39;武侠&#39; &#39;音乐&#39; &#39;通信&#39; &#39;科幻小说&#39; &#39;科普&#39; &#39;程序&#39; &#39;生活&#39; &#39;张悦然&#39; &#39;经济&#39;
 &#39;小说&#39; &#39;科幻&#39; &#39;军事&#39; &#39;心理&#39; &#39;文学&#39; &#39;电影&#39; &#39;社会学&#39; &#39;广告&#39; &#39;管理&#39; &#39;励志&#39; &#39;耽美&#39; &#39;郭敬明&#39; &#39;穿越&#39;
 &#39;阿加莎·克里斯蒂&#39; &#39;杂文&#39; &#39;传记&#39; &#39;韩寒&#39; &#39;设计&#39; &#39;落落&#39; &#39;言情&#39; &#39;职场&#39; &#39;成长&#39; &#39;佛教&#39; &#39;女性&#39; &#39;政治&#39; &#39;近代史&#39;
 &#39;营销&#39; &#39;推理小说&#39; &#39;建筑&#39; &#39;经典&#39; &#39;外国名著&#39; &#39;二战&#39; &#39;鲁迅&#39; &#39;J.K.罗琳&#39; &#39;奇幻&#39; &#39;外国文学&#39; &#39;校园&#39; &#39;人物传记&#39;
 &#39;西方哲学&#39; &#39;自由主义&#39; &#39;文化&#39; &#39;旅行&#39; &#39;张小娴&#39; &#39;企业史&#39; &#39;国学&#39; &#39;摄影&#39; &#39;亦舒&#39; &#39;青春&#39; &#39;科学&#39; &#39;策划&#39; &#39;web&#39;
 &#39;创业&#39; &#39;美术&#39; &#39;宗教&#39; &#39;古龙&#39; &#39;沧月&#39; &#39;村上春树&#39; &#39;社会&#39; &#39;股票&#39; &#39;理财&#39; &#39;日本漫画&#39; &#39;轻小说&#39; &#39;数学&#39; &#39;神经网络&#39;
 &#39;网络小说&#39; &#39;当代文学&#39; &#39;中国历史&#39; &#39;三毛&#39; &#39;回忆录&#39; &#39;古典文学&#39; &#39;交互设计&#39; &#39;推理&#39; &#39;高木直子&#39; &#39;中国文学&#39; &#39;青春文学&#39;
 &#39;金庸&#39; &#39;UE&#39; &#39;投资&#39; &#39;编程&#39; &#39;几米&#39;]
</code></pre></div><br>
<h3 id="25--可视化">2.5  可视化</h3>
<p>书评发布数量随年份变化</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">import</span> <span class="nn">matplotlib</span>
<span class="kn">import</span> <span class="nn">matplotlib_inline</span>
<span class="n">matplotlib_inline</span><span class="o">.</span><span class="n">backend_inline</span><span class="o">.</span><span class="n">set_matplotlib_formats</span><span class="p">(</span><span class="s1">&#39;png&#39;</span><span class="p">,</span> <span class="s1">&#39;svg&#39;</span><span class="p">)</span>
<span class="kn">import</span> <span class="nn">scienceplots</span>
<span class="kn">import</span> <span class="nn">platform</span>

<span class="c1">#初始化matplotlib汉化美化配置</span>
<span class="n">plt</span><span class="o">.</span><span class="n">style</span><span class="o">.</span><span class="n">use</span><span class="p">([</span><span class="s1">&#39;science&#39;</span><span class="p">,</span> <span class="s1">&#39;no-latex&#39;</span><span class="p">,</span> <span class="s1">&#39;cjk-sc-font&#39;</span><span class="p">])</span>
<span class="n">system</span> <span class="o">=</span> <span class="n">platform</span><span class="o">.</span><span class="n">system</span><span class="p">()</span>  <span class="c1"># 获取操作系统类型</span>
<span class="k">if</span> <span class="n">system</span> <span class="o">==</span> <span class="s1">&#39;Windows&#39;</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;SimHei&#39;</span><span class="p">}</span>
<span class="k">elif</span> <span class="n">system</span> <span class="o">==</span> <span class="s1">&#39;Darwin&#39;</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;Arial Unicode MS&#39;</span><span class="p">}</span>
<span class="k">else</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;sans-serif&#39;</span><span class="p">}</span>
<span class="n">matplotlib</span><span class="o">.</span><span class="n">rc</span><span class="p">(</span><span class="s1">&#39;font&#39;</span><span class="p">,</span> <span class="o">**</span><span class="n">font</span><span class="p">)</span>  <span class="c1"># 设置全局字体</span>


<span class="c1">#构造数据</span>
<span class="n">date_series</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">volume_series</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">date</span><span class="p">,</span> <span class="n">year_df</span> <span class="ow">in</span> <span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">pd</span><span class="o">.</span><span class="n">Grouper</span><span class="p">(</span><span class="n">key</span><span class="o">=</span><span class="s1">&#39;date&#39;</span><span class="p">,</span> <span class="n">freq</span><span class="o">=</span><span class="s1">&#39;M&#39;</span><span class="p">)):</span>
    <span class="c1">#这里的date， month_df都是特殊数据类型</span>
    <span class="n">date_series</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">date</span><span class="o">.</span><span class="n">date</span><span class="p">())</span>
    <span class="n">volume_series</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">year_df</span><span class="p">))</span>
<span class="n">volume_by_time_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="s1">&#39;date&#39;</span><span class="p">:</span> <span class="n">date_series</span><span class="p">,</span> <span class="s1">&#39;volume&#39;</span><span class="p">:</span> <span class="n">volume_series</span><span class="p">})</span>
<span class="n">volume_by_time_df</span><span class="p">[</span><span class="s1">&#39;date&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">volume_by_time_df</span><span class="p">[</span><span class="s1">&#39;date&#39;</span><span class="p">])</span>



<span class="c1">#开始绘图</span>
<span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span>

<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">volume_by_time_df</span><span class="o">.</span><span class="n">date</span><span class="p">,</span> 
         <span class="n">volume_by_time_df</span><span class="o">.</span><span class="n">volume</span><span class="p">,</span>
         <span class="n">linestyle</span> <span class="o">=</span> <span class="s1">&#39;--&#39;</span><span class="p">)</span>

<span class="n">plt</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">volume_by_time_df</span><span class="o">.</span><span class="n">date</span><span class="p">,</span> 
            <span class="n">volume_by_time_df</span><span class="o">.</span><span class="n">volume</span><span class="p">,</span> 
            <span class="n">s</span><span class="o">=</span><span class="mi">8</span><span class="p">)</span>

<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">&#39;豆瓣读书随年份书评数量变化(2005.6.12 ~ 2018.10.13)&#39;</span><span class="p">,</span> 
          <span class="n">fontsize</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>

<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s1">&#39;日期&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s1">&#39;书评数量&#39;</span><span class="p">)</span>

<span class="n">plt</span><span class="o">.</span><span class="n">savefig</span><span class="p">(</span><span class="s1">&#39;plot.png&#39;</span><span class="p">,</span> <span class="n">dpi</span><span class="o">=</span><span class="mi">200</span><span class="p">)</span>

<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/plot.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="三相关内容">三、相关内容</h2>
<ul>
<li><a href="https://textdata.cn/blog/2024-04-16-douban-movie-1000w-ratings-comments-dataset/">数据集 | 使用1000w条豆瓣影评训练Word2Vec</a></li>
</ul>
<p><br><br></p>
<h2 id="四获取数据">四、获取数据</h2>
<p><em><strong>douban-book</strong></em> 链接: <a href="https://pan.baidu.com/s/1qySKU_0dsoi1NAF9lQ971w?pwd=n5qe">https://pan.baidu.com/s/1qySKU_0dsoi1NAF9lQ971w?pwd=n5qe</a> 提取码: n5qe</p>
<br>
<br>
]]></content:encoded>
    </item>
    
    <item>
      <title>数据集 |  使用 1000w 条豆瓣影评训练 Word2Vec</title>
      <link>https://textdata.cn/blog/2024-04-16-douban-movie-1000w-ratings-comments-dataset/</link>
      <pubDate>Tue, 16 Apr 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-04-16-douban-movie-1000w-ratings-comments-dataset/</guid>
      <description>&lt;p&gt;本文内容&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;介绍豆瓣影评数据集&lt;/li&gt;
&lt;li&gt;构造语料训练 &lt;strong&gt;&lt;em&gt;Word2Vec&lt;/em&gt;&lt;/strong&gt; 模型&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;一豆瓣影评数据集&#34;&gt;一、豆瓣影评数据集&lt;/h2&gt;
&lt;h3 id=&#34;11-数据集介绍&#34;&gt;1.1 数据集介绍&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;数据集: douba-movie-1000w

数据源: 豆瓣电影

记录数:
   - 电影 10269 部
   - 影评 10310989 条

体积: 1.35G
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;该数据集正好弥补下国内公开电影数据集的空缺， 数据已经过初步清洗，可用于推荐系统、情感分析、知识图谱、新闻传播学、社会学文化变迁等多个领域(或主题)。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;12-读取数据&#34;&gt;1.2 读取数据&lt;/h3&gt;
&lt;p&gt;下载 &lt;strong&gt;&lt;em&gt;douba-movie-1000w.zip&lt;/em&gt;&lt;/strong&gt; 解压后，可以看到数据集中有一个 &lt;strong&gt;&lt;em&gt;all_movies_with_id.csv&lt;/em&gt;&lt;/strong&gt; 文件。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;all_movies_with_id.csv&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/01-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;13-所含字段&#34;&gt;1.3 所含字段&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;col&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;columns&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;sa&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39; - &lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;col&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt; - ID
 - Movie_Name  电影名
 - Score  豆瓣电影评分(1-10)
 - Review_People  评论者人数
 - Star_Distribution  评论评分分布(1-5, 含多个数值，数值以%间隔)
 - Craw_Date 爬虫运行日期
 - Username 豆瓣评论者用户名
 - Date 影评日期
 - Star  影评评分(1-5)
 - Comment 影评内容
 - Comment_Distribution 影评评分分布
 - Like 影评获得的喜欢数
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二-构造语料训练-word2vec&#34;&gt;二、 构造语料&amp;amp;训练 Word2Vec&lt;/h2&gt;
&lt;h3 id=&#34;21-构造语料&#34;&gt;2.1 构造语料&lt;/h3&gt;
&lt;p&gt;将字段 &lt;strong&gt;&lt;em&gt;Comment&lt;/em&gt;&lt;/strong&gt; 中所有文本汇总到 &lt;strong&gt;&lt;em&gt;douban-movie-1000w.txt&lt;/em&gt;&lt;/strong&gt;,&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;# 读取数据，只读取字段sign&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;all_movies_with_id.csv&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;usecols&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;sign&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;# 将sign列中的所有文本汇总到douban-movie-1000w.txt&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;with&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;open&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;douban-movie-1000w.txt&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;w&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;encoding&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;utf-8&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;text&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;se&#34;&gt;\n&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;join&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;Comment&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fillna&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;write&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;# 删除df和text变量，释放内存&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;del&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;del&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;22-配置-cntext&#34;&gt;2.2 配置 cntext&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;pip3 install cntext==2.1.6
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;23-训练-word2vec&#34;&gt;2.3 训练 Word2Vec&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# cntext为2.1.6&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;cntext&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;ct&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;# 训练Word2Vec模型&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;w2v&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Word2Vec&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;corpus_file&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;douban-movie-1000w.txt&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
                 &lt;span class=&#34;n&#34;&gt;vector_size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;200&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
                 &lt;span class=&#34;n&#34;&gt;window_size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;15&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
                 &lt;span class=&#34;n&#34;&gt;min_count&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;5&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;Mac(Linux) System, Enable Parallel Processing
Cache output/douban-movie-1000w_cache.txt Not Found or Empty, Preprocessing Corpus
Reading Preprocessed Corpus from output/douban-movie-1000w_cache.txt
Start Training Word2Vec
Word2Vec Training Cost 2965 s.
Output Saved To: output/douban-movie-1000w.200.15.txt
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;经过半个小时的训练， 得到&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;模型文件 &lt;strong&gt;&lt;em&gt;output/douban-movie-1000w-Word2Vec.200.15.txt&lt;/em&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;模型文件 &lt;strong&gt;&lt;em&gt;output/douban-movie-1000w-Word2Vec.200.15.bin&lt;/em&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;缓存文件 &lt;strong&gt;&lt;em&gt;output/douban-movie-1000w_cache.txt&lt;/em&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;其中模型文件有 &lt;strong&gt;&lt;em&gt;txt&lt;/em&gt;&lt;/strong&gt; 和 &lt;strong&gt;&lt;em&gt;bin&lt;/em&gt;&lt;/strong&gt; 两种格式， 信息量完全等同。 &lt;strong&gt;&lt;em&gt;txt&lt;/em&gt;&lt;/strong&gt; 可以用记事本打开查看， 而 &lt;strong&gt;&lt;em&gt;bin&lt;/em&gt;&lt;/strong&gt; 则是二进制文件， 体积更小。 已训练好的模型， 因 bin 格式体积更小， 便于分享。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;24-评估模型&#34;&gt;2.4 评估模型&lt;/h3&gt;
&lt;p&gt;使用近义法和类比法， 判断模型的表现。详情可查看&lt;a href=&#34;https://cntext.readthedocs.io/zh-cn/latest/model.html&#34;&gt;文档&lt;/a&gt;&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;evaluate_similarity&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;w2v_model&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;evaluate_analogy&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;w2v_model&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;近义测试: similarity.txt
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/cntext/model/evaluate_data/similarity.txt

评估结果：
+----------+------------+----------------------------+
| 发现词语 | 未发现词语 | Spearman&amp;#39;s Rank Coeficient |
+----------+------------+----------------------------+
|   459    |     78     |            0.43            |
+----------+------------+----------------------------+


类比测试: analogy.txt
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/cntext/model/evaluate_data/analogy.txt
Processing Analogy Test: 100%|██████████████| 1198/1198 [00:11&amp;lt;00:00, 99.91it/s]

评估结果：
+--------------------+----------+------------+------------+----------+
|      Category      | 发现词语 | 未发现词语 | 准确率 (%) | 平均排名 |
+--------------------+----------+------------+------------+----------+
| CapitalOfCountries |   615    |     62     |   39.02    |   2.98   |
|   CityInProvince   |   175    |     0      |   28.57    |   4.74   |
| FamilyRelationship |   272    |     0      |   92.65    |   1.48   |
|   SocialScience    |    8     |     62     |   25.00    |   6.00   |
+--------------------+----------+------------+------------+----------+
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;strong&gt;近义测试&lt;/strong&gt;: Spearman&amp;rsquo;s Rank Coeficient 系数取值[-1, 1], 取值越大， 说明模型表现越好。&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;&lt;strong&gt;类比测试&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;CapitalOfCountries 豆瓣影评语料在此项表现尚可，可能目前电影库中有一定比例的外国素材。&lt;/li&gt;
&lt;li&gt;CityInProvince 豆瓣影评语料在此项表现较差，不太可能是中国素材太少，可能大多数省市以类似汉东省的形式出现。这是我的猜测。 &lt;a href=&#34;https://textdata.cn/blog/2023-12-28-train-word2vec-using-renmin-gov-leader-board-dataset/&#34;&gt;人民网留言板语料 Word2Vec&lt;/a&gt;中，该项准确率 100%。&lt;/li&gt;
&lt;li&gt;FamilyRelationship 豆瓣影评体现的是电影相关内容，而电影永远的主题是人性， 内容少不了家长里短，七大姑八大姨，所以此项准确率高达 92.65%。 以&lt;a href=&#34;https://textdata.cn/blog/2023-03-24-load-w2v-and-expand-your-concpet-dicitonary/&#34;&gt;年报 MD&amp;amp;A&lt;/a&gt;为例，此处准确率只有 10%。&lt;/li&gt;
&lt;li&gt;SocialScience 豆瓣影评语料在此项表现一般， 应该是语料中常见的社会科学词语提及较少。&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;整体而言，语料训练的效果很不错，抓住了数据场景的独特性语义。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;四使用-word2vec&#34;&gt;四、使用 Word2Vec&lt;/h2&gt;
&lt;h3 id=&#34;41-导入-word2vec-模型文件&#34;&gt;4.1 导入 Word2Vec 模型文件&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;cntext&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;ct&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;# 导入模型，请注意路径。&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;# 「当前代码」 与 「output」 同处于一个文件夹内&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;dm_w2v&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;load_w2v&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;output/douban-movie-1000w-Word2Vec.200.15.txt&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;# dm_w2v = ct.load_w2v(&amp;#39;output/douban-movie-1000w-Word2Vec.200.15.bin&amp;#39;)&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;dm_w2v&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;Loading output/douban-movie-1000w-Word2Vec.200.15.txt...
&amp;lt;gensim.models.keyedvectors.KeyedVectors at 0x314193830&amp;gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;42-keyedvectors-的操作方法或属性&#34;&gt;4.2 KeyedVectors 的操作方法(或属性)&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;方法&lt;/th&gt;
&lt;th&gt;描述&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;em&gt;KeyedVectors.index_to_key&lt;/em&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;获取词汇表中的所有单词。&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;em&gt;KeyedVectors.key_to_index&lt;/em&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;获取单词到索引的映射。&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;em&gt;KeyedVectors.vector_size&lt;/em&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;获取 GloVe 模型中任意词向量的维度。&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;em&gt;KeyedVectors.get_vector(word)&lt;/em&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;获取给定单词的词向量。&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;em&gt;KeyedVectors.similar_by_word(word, topn=10)&lt;/em&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;获取某词语最相似的 10 个近义词。&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;em&gt;KeyedVectors.similar_by_vector(vector, topn=10)&lt;/em&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;获取词向量最相似的 10 个近义词。&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;br&gt;
&lt;h3 id=&#34;43-查看词表&#34;&gt;4.3 查看词表&lt;/h3&gt;
&lt;p&gt;查看词表所有单词&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;dm_w2v&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;index_to_key&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;[&amp;#39;电影&amp;#39;,
 &amp;#39;一个&amp;#39;,
 &amp;#39;没有&amp;#39;,
 &amp;#39;喜欢&amp;#39;,
 ...
 &amp;#39;跟着&amp;#39;,
 &amp;#39;意识&amp;#39;,
 &amp;#39;态度&amp;#39;,
 ...]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;为了方便查看， 这里只展示部分数据。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;44-词表映射&#34;&gt;4.4 词表映射&lt;/h3&gt;
&lt;p&gt;查看单词到索引的映射&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;dm_w2v&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;key_to_index&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;{&amp;#39;电影&amp;#39;: 0,
 &amp;#39;一个&amp;#39;: 1,
 &amp;#39;没有&amp;#39;: 2,
...
&amp;#39;跟着&amp;#39;: 997,
 &amp;#39;意识&amp;#39;: 998,
 &amp;#39;态度&amp;#39;: 999,
 ...}
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;45-向量维度数&#34;&gt;4.5 向量维度数&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;sa&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;词表有 &lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;dm_w2v&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;key_to_index&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt; 个词&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;sa&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;向量是 &lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;dm_w2v&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;vector_size&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt; 维&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;词表有 426646 个词
向量是 200 维
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;46-获取词向量&#34;&gt;4.6 获取词向量&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;dm_w2v&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;get_vector&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;给力&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;array([-1.24090052e+00, -6.79377019e-01,  1.42518425e+00, -1.46615291e+00,
       -9.53197628e-02,  6.50456071e-01, -2.97696137e+00,  2.20916629e+00,
        6.12876177e-01,  1.63172066e+00,  4.91760701e-01, -9
        ......
        ......
         -1.42494082e+00,  2.49131727e+00, -6.27597034e-01, -7.91438043e-01,
       -4.54898655e-01,  1.37747681e+00, -4.20672953e-01, -1.53694853e-01,
        1.04936564e+00,  2.18786263e+00, -8.07472587e-01, -8.32003877e-02],
      dtype=float32)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;47-近义词&#34;&gt;4.7 近义词&lt;/h3&gt;
&lt;p&gt;根据词语查看近义词&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# 近义词&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;dm_w2v&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;similar_by_word&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;给力&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;topn&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;10&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;[(&amp;#39;相当给力&amp;#39;, 0.6180022358894348),
 (&amp;#39;太给力&amp;#39;, 0.6019443273544312),
 (&amp;#39;带劲&amp;#39;, 0.5840415954589844),
 (&amp;#39;不给力&amp;#39;, 0.5774183869361877),
 (&amp;#39;过瘾&amp;#39;, 0.5616626739501953),
 (&amp;#39;牛叉&amp;#39;, 0.553788959980011),
 (&amp;#39;出彩&amp;#39;, 0.5414286851882935),
 (&amp;#39;精彩&amp;#39;, 0.5332293510437012),
 (&amp;#39;看得过瘾&amp;#39;, 0.5250197649002075),
 (&amp;#39;大赞&amp;#39;, 0.5205727219581604)]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;p&gt;根据向量查找最相似的近义词&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;word_vector&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;dm_w2v&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;get_vector&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;给力&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;dm_w2v&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;similar_by_vector&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;word_vector&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;topn&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;10&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;[(&amp;#39;给力&amp;#39;, 1.0),
 (&amp;#39;相当给力&amp;#39;, 0.6180021166801453),
 (&amp;#39;太给力&amp;#39;, 0.6019443273544312),
 (&amp;#39;带劲&amp;#39;, 0.5840415954589844),
 (&amp;#39;不给力&amp;#39;, 0.5774183869361877),
 (&amp;#39;过瘾&amp;#39;, 0.5616626739501953),
 (&amp;#39;牛叉&amp;#39;, 0.5537890195846558),
 (&amp;#39;出彩&amp;#39;, 0.5414287447929382),
 (&amp;#39;精彩&amp;#39;, 0.5332292914390564),
 (&amp;#39;看得过瘾&amp;#39;, 0.5250197649002075)]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;48-计算多个词的中心向量&#34;&gt;4.8 计算多个词的中心向量&lt;/h3&gt;
&lt;p&gt;我们可以计算「宇宙」、「飞船」、「战争」的宇宙语义向量（中心向量）。 并试图寻找中心向量 &lt;strong&gt;&lt;em&gt;universe_vector&lt;/em&gt;&lt;/strong&gt; 的最相似的 10 个词。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# 几个词语构建的宇宙语义向量&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;universe_vector&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;semantic_centroid&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;wv&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;dm_w2v&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
                                       &lt;span class=&#34;n&#34;&gt;words&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;宇宙&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;飞船&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;战争&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;


&lt;span class=&#34;n&#34;&gt;dm_w2v&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;similar_by_vector&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;universe_vector&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;topn&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;20&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;[(&amp;#39;宇宙&amp;#39;, 0.7568532228469849),
 (&amp;#39;星系&amp;#39;, 0.7090039253234863),
 (&amp;#39;飞船&amp;#39;, 0.7080673575401306),
 (&amp;#39;人类文明&amp;#39;, 0.6973789930343628),
 (&amp;#39;战舰&amp;#39;, 0.6890057325363159),
 (&amp;#39;母舰&amp;#39;, 0.6864359974861145),
 (&amp;#39;星球&amp;#39;, 0.6799622774124146),
 (&amp;#39;卫星&amp;#39;, 0.6799139976501465),
 (&amp;#39;星际&amp;#39;, 0.6789332032203674),
 (&amp;#39;空间站&amp;#39;, 0.6780815124511719),
 (&amp;#39;地球&amp;#39;, 0.6769616603851318),
 (&amp;#39;外太空&amp;#39;, 0.6683873534202576),
 (&amp;#39;核战&amp;#39;, 0.6669113039970398),
 (&amp;#39;外星飞船&amp;#39;, 0.6592534780502319),
 (&amp;#39;木星&amp;#39;, 0.6586896777153015),
 (&amp;#39;能源&amp;#39;, 0.6562989950180054),
 (&amp;#39;战争&amp;#39;, 0.6556441187858582),
 (&amp;#39;巨兽&amp;#39;, 0.6544537544250488),
 (&amp;#39;月球&amp;#39;, 0.6525537967681885),
 (&amp;#39;一艘&amp;#39;, 0.6521110534667969)]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;语义捕捉的很准哦。&lt;/p&gt;
&lt;h3 id=&#34;49-概念轴&#34;&gt;4.9 概念轴&lt;/h3&gt;
&lt;p&gt;男性概念向量由多个男性词的向量加总求均值得到，女性概念向量算法类似。当性质或方向明显相反的两个概念向量相减， 得到的新的向量，我们可以称之为**&lt;em&gt;概念轴向量 Concept Axis&lt;/em&gt;**。常见的概念轴，&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;- 尺寸(大, 小)
- 湿度(干燥,潮湿)
- 性别(男, 女)
- 财富(富裕, 贫穷)
- 等
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;其实任意概念的向量也可看做概念轴，即该概念向量与 0 向量相减。只不过两组性质方向相反的方式得到的概念轴， 在语义上更稳定。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;cntext&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;ct&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;numpy&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;np&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;# 定义词语列表&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;phy_words&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;游泳&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;跑步&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;篮球&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;羽毛球&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;马拉松&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;马术&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;徒步&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;rich_words&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;
    &lt;span class=&#34;s1&#34;&gt;&amp;#39;富裕&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;财富&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;金钱&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;豪宅&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;豪车&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
    &lt;span class=&#34;s1&#34;&gt;&amp;#39;奢侈品&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;投资&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;股票&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;基金&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;黄金&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
    &lt;span class=&#34;s1&#34;&gt;&amp;#39;钻石&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;游艇&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;私人飞机&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;企业家&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;富豪&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
    &lt;span class=&#34;s1&#34;&gt;&amp;#39;成功&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;繁荣&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;奢华&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;贵族&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;高收入&amp;#39;&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;poor_words&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;
    &lt;span class=&#34;s1&#34;&gt;&amp;#39;贫穷&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;贫困&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;饥饿&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;失业&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;低收入&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
    &lt;span class=&#34;s1&#34;&gt;&amp;#39;简陋&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;破旧&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;乞丐&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;流浪&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;欠债&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
    &lt;span class=&#34;s1&#34;&gt;&amp;#39;破产&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;困境&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;艰难&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;挣扎&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;匮乏&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
    &lt;span class=&#34;s1&#34;&gt;&amp;#39;落后&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;无助&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;绝望&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;赤贫&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;温饱&amp;#39;&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;phy_project_on_fortune&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;sematic_projection&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;wv&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;dm_w2v&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
                                               &lt;span class=&#34;n&#34;&gt;words&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;phy_words&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
                                               &lt;span class=&#34;n&#34;&gt;poswords&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;rich_words&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
                                               &lt;span class=&#34;n&#34;&gt;negwords&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;poor_words&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
                                               &lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;phy_project_on_fortune&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;[(&amp;#39;跑步&amp;#39;, -1.82),
 (&amp;#39;徒步&amp;#39;, -0.82),
 (&amp;#39;游泳&amp;#39;, -0.19),
 (&amp;#39;羽毛球&amp;#39;, 0.57),
 (&amp;#39;马拉松&amp;#39;, 0.62),
 (&amp;#39;马术&amp;#39;, 1.15),
 (&amp;#39;篮球&amp;#39;, 4.0)]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;计算结果中， 数值越大越接近于 c_words2, 越小越接近于 c_words1 。 可以看到在财富概念轴向量上的投影， 篮球不太准，但是其他几项基本上看出运动的贫富性。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;410-类比-king-man--woman--queen&#34;&gt;4.10 类比 king-man + woman ~ queen&lt;/h3&gt;
&lt;p&gt;每个词是高维向量空间中的一个点， 两个点可以组成有方向的向量，而向量可以比较方向。这里是推理过程，受限于数据，公式不一定完全成立， 但是思维可以类比。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/king-queen-formular.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;这两个词相减，按感觉应该得到的是性别方向，雄性-&amp;gt;雌性。&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;em&gt;gender_direction_1 = vector(man)-vector(woman)&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;em&gt;gender_direction_2 = vector(king)-vector(queen)&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;那两个性别方向应该近似，即 gender_direction_1 约等于 gender_direction_2 ，将其看做等式就得到如下公式：&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;em&gt;vector(理应近似 queen) = vector(king)-vector(men)+vector(women)&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;现在我们检查三个语义向量计算出的新的向量是否有与 queen 相关的语义信息。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;k&#34;&gt;def&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;semactic_vector&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;wv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;words&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;vector&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;semantic_centroid&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;wv&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;wv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
                                  &lt;span class=&#34;n&#34;&gt;words&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;words&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;k&#34;&gt;return&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;vector&lt;/span&gt;


&lt;span class=&#34;n&#34;&gt;men_vector&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;semactic_vector&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;dm_w2v&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;男&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;男孩&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;男人&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;他&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;父亲&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;爸爸&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;爷爷&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;women_vector&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;semactic_vector&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;dm_w2v&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;女&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;女孩&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;女人&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;她&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;母亲&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;妈妈&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;奶奶&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;king_vector&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;semactic_vector&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;dm_w2v&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;国王&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;皇帝&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;帝王&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;大帝&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;# 假设 king- queen 约等于 man -woman&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;# result 近似等于 king - queen + women&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;result_vector&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;king_vector&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;-&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;men_vector&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;women_vector&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;# 现在检查 result_vector 的语义应该与queen相关&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;dm_w2v&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;similar_by_vector&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;result_vector&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;topn&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;20&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;[(&amp;#39;皇帝&amp;#39;, 0.8448051810264587),
 (&amp;#39;王后&amp;#39;, 0.8056979179382324),
 (&amp;#39;国王&amp;#39;, 0.8004385232925415),
 (&amp;#39;帝王&amp;#39;, 0.7693961262702942),
 (&amp;#39;君主&amp;#39;, 0.7663125991821289),
 (&amp;#39;皇后&amp;#39;, 0.7614380717277527),
 (&amp;#39;太后&amp;#39;, 0.7463700175285339),
 (&amp;#39;妃子&amp;#39;, 0.7433678507804871),
 (&amp;#39;君王&amp;#39;, 0.7407413125038147),
 (&amp;#39;皇子&amp;#39;, 0.7380139231681824),
 (&amp;#39;王位&amp;#39;, 0.7319545745849609),
 (&amp;#39;皇上&amp;#39;, 0.7215542197227478),
 (&amp;#39;登基&amp;#39;, 0.7210745215415955),
 (&amp;#39;大臣&amp;#39;, 0.714862048625946),
 (&amp;#39;伊丽莎白一世&amp;#39;, 0.702217698097229),
 (&amp;#39;王朝&amp;#39;, 0.7000151872634888),
 (&amp;#39;宫女&amp;#39;, 0.6997070908546448),
 (&amp;#39;驾崩&amp;#39;, 0.6992778182029724),
 (&amp;#39;王妃&amp;#39;, 0.6981185078620911),
 (&amp;#39;昏君&amp;#39;, 0.6974363923072815)]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;可以看到三个语义向量四则运算出的 result_vector 与 queen 仍具有较高的相关性。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;五获取资料&#34;&gt;五、获取资料&lt;/h2&gt;
&lt;p&gt;除了本文介绍的这个 1000w 条影评数据集， 大邓还有 2 个类似的豆瓣影评数据集，影评记录量 212w 和 442 w 条。 两个数据集下载链接我都公开，感兴趣的可以都下载下来。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;- 免费  douba-movie-1000w 链接: https://pan.baidu.com/s/15C0fn7oyYEFvuQtPO8tw8Q?pwd=1g7m 提取码: 1g7m
- 免费 douban-movie-1000w-Word2Vec.200.15.bin
链接: https://pan.baidu.com/s/1fK8LhLmK4_xq-eHzNn42lg?pwd=2hwr 提取码: 2hwr
- 免费 douban-movie-442w 链接: https://pan.baidu.com/s/1T_LPuxEZ_W8xfYcxV7rW5Q?pwd=a683 提取码: a683
- 免费 douban-movie-212w 链接: :https://pan.baidu.com/s/1VBwnOqfMPu_Y48bMlQ4oiw?pwd=t8id
 提取码: t8id

- 免费词向量      https://cntext.readthedocs.io/zh-cn/latest/embeddings.html
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;相关内容&#34;&gt;相关内容&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2024-04-17-douban-book-3394w-ratings-comments-dataset/&#34;&gt;数据集 | 3394w 条豆瓣书评数据集&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2025-03-28-train_a_glove_model_on_chinese_corpus_using_stanfordnlp/&#34;&gt;实验 | 使用 Stanford Glove 代码训练中文语料的 GloVe 模型&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-12-28-visualize-the-culture-change-using-people-daily-dataset/&#34;&gt;可视化 | 人民日报语料反映七十年文化演变&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-03-24-load-w2v-and-expand-your-concpet-dicitonary/&#34;&gt;词向量 | 使用 MD&amp;amp;A2001-2023 语料训练 Word2Vec 模型&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2025-03-28-train_a_glove_model_on_chinese_corpus_using_stanfordnlp/&#34;&gt;实验 | 使用 Stanford Glove 代码训练中文语料的 Glove 模型&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;cntext使用声明&#34;&gt;cntext使用声明&lt;/h2&gt;
&lt;p&gt;如在研究或项目中使用 cntext ，请在文中介绍并附引用声明。引用格式可参考 &lt;a href=&#34;https://textdata.cn/blog/2025-09-09-transform-text-data-into-structured-data-with-lmstudio-and-cntext/#%E4%BD%BF%E7%94%A8%E5%A3%B0%E6%98%8E&#34;&gt;cntext 推荐引用格式&lt;/a&gt;&lt;/p&gt;
</description>
      <content:encoded><![CDATA[<p>本文内容</p>
<ol>
<li>介绍豆瓣影评数据集</li>
<li>构造语料训练 <strong><em>Word2Vec</em></strong> 模型</li>
</ol>
<p><br><br></p>
<h2 id="一豆瓣影评数据集">一、豆瓣影评数据集</h2>
<h3 id="11-数据集介绍">1.1 数据集介绍</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">数据集: douba-movie-1000w

数据源: 豆瓣电影

记录数:
   - 电影 10269 部
   - 影评 10310989 条

体积: 1.35G
</code></pre></div><p>该数据集正好弥补下国内公开电影数据集的空缺， 数据已经过初步清洗，可用于推荐系统、情感分析、知识图谱、新闻传播学、社会学文化变迁等多个领域(或主题)。</p>
<br>
<h3 id="12-读取数据">1.2 读取数据</h3>
<p>下载 <strong><em>douba-movie-1000w.zip</em></strong> 解压后，可以看到数据集中有一个 <strong><em>all_movies_with_id.csv</em></strong> 文件。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;all_movies_with_id.csv&#39;</span><span class="p">)</span>
<span class="n">df</span>
</code></pre></div><p><img loading="lazy" src="img/01-df.png" alt=""  />
</p>
<br>
<h3 id="13-所含字段">1.3 所含字段</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="k">for</span> <span class="n">col</span> <span class="ow">in</span> <span class="n">df</span><span class="o">.</span><span class="n">columns</span><span class="p">:</span>
    <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39; - </span><span class="si">{</span><span class="n">col</span><span class="si">}</span><span class="s1">&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"> - ID
 - Movie_Name  电影名
 - Score  豆瓣电影评分(1-10)
 - Review_People  评论者人数
 - Star_Distribution  评论评分分布(1-5, 含多个数值，数值以%间隔)
 - Craw_Date 爬虫运行日期
 - Username 豆瓣评论者用户名
 - Date 影评日期
 - Star  影评评分(1-5)
 - Comment 影评内容
 - Comment_Distribution 影评评分分布
 - Like 影评获得的喜欢数
</code></pre></div><p><br><br></p>
<h2 id="二-构造语料训练-word2vec">二、 构造语料&amp;训练 Word2Vec</h2>
<h3 id="21-构造语料">2.1 构造语料</h3>
<p>将字段 <strong><em>Comment</em></strong> 中所有文本汇总到 <strong><em>douban-movie-1000w.txt</em></strong>,</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="c1"># 读取数据，只读取字段sign</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;all_movies_with_id.csv&#39;</span><span class="p">,</span> <span class="n">usecols</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;sign&#39;</span><span class="p">])</span>

<span class="c1"># 将sign列中的所有文本汇总到douban-movie-1000w.txt</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s1">&#39;douban-movie-1000w.txt&#39;</span><span class="p">,</span> <span class="s1">&#39;w&#39;</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
    <span class="n">text</span> <span class="o">=</span> <span class="s1">&#39;</span><span class="se">\n</span><span class="s1">&#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;Comment&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">))</span>
    <span class="n">f</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>

<span class="c1"># 删除df和text变量，释放内存</span>
<span class="k">del</span> <span class="n">df</span>
<span class="k">del</span> <span class="n">text</span>

</code></pre></div><br>
<h3 id="22-配置-cntext">2.2 配置 cntext</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">pip3 install cntext==2.1.6
</code></pre></div><br>
<h3 id="23-训练-word2vec">2.3 训练 Word2Vec</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1"># cntext为2.1.6</span>
<span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="c1"># 训练Word2Vec模型</span>
<span class="n">w2v</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">Word2Vec</span><span class="p">(</span><span class="n">corpus_file</span><span class="o">=</span><span class="s1">&#39;douban-movie-1000w.txt&#39;</span><span class="p">,</span>
                 <span class="n">vector_size</span><span class="o">=</span><span class="mi">200</span><span class="p">,</span>
                 <span class="n">window_size</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span>
                 <span class="n">min_count</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">Mac(Linux) System, Enable Parallel Processing
Cache output/douban-movie-1000w_cache.txt Not Found or Empty, Preprocessing Corpus
Reading Preprocessed Corpus from output/douban-movie-1000w_cache.txt
Start Training Word2Vec
Word2Vec Training Cost 2965 s.
Output Saved To: output/douban-movie-1000w.200.15.txt
</code></pre></div><p>经过半个小时的训练， 得到</p>
<ul>
<li>模型文件 <strong><em>output/douban-movie-1000w-Word2Vec.200.15.txt</em></strong></li>
<li>模型文件 <strong><em>output/douban-movie-1000w-Word2Vec.200.15.bin</em></strong></li>
<li>缓存文件 <strong><em>output/douban-movie-1000w_cache.txt</em></strong></li>
</ul>
<p>其中模型文件有 <strong><em>txt</em></strong> 和 <strong><em>bin</em></strong> 两种格式， 信息量完全等同。 <strong><em>txt</em></strong> 可以用记事本打开查看， 而 <strong><em>bin</em></strong> 则是二进制文件， 体积更小。 已训练好的模型， 因 bin 格式体积更小， 便于分享。</p>
<br>
<h3 id="24-评估模型">2.4 评估模型</h3>
<p>使用近义法和类比法， 判断模型的表现。详情可查看<a href="https://cntext.readthedocs.io/zh-cn/latest/model.html">文档</a></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">ct</span><span class="o">.</span><span class="n">evaluate_similarity</span><span class="p">(</span><span class="n">w2v_model</span><span class="p">)</span>

<span class="n">ct</span><span class="o">.</span><span class="n">evaluate_analogy</span><span class="p">(</span><span class="n">w2v_model</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">近义测试: similarity.txt
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/cntext/model/evaluate_data/similarity.txt

评估结果：
+----------+------------+----------------------------+
| 发现词语 | 未发现词语 | Spearman&#39;s Rank Coeficient |
+----------+------------+----------------------------+
|   459    |     78     |            0.43            |
+----------+------------+----------------------------+


类比测试: analogy.txt
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/cntext/model/evaluate_data/analogy.txt
Processing Analogy Test: 100%|██████████████| 1198/1198 [00:11&lt;00:00, 99.91it/s]

评估结果：
+--------------------+----------+------------+------------+----------+
|      Category      | 发现词语 | 未发现词语 | 准确率 (%) | 平均排名 |
+--------------------+----------+------------+------------+----------+
| CapitalOfCountries |   615    |     62     |   39.02    |   2.98   |
|   CityInProvince   |   175    |     0      |   28.57    |   4.74   |
| FamilyRelationship |   272    |     0      |   92.65    |   1.48   |
|   SocialScience    |    8     |     62     |   25.00    |   6.00   |
+--------------------+----------+------------+------------+----------+
</code></pre></div><p><strong>近义测试</strong>: Spearman&rsquo;s Rank Coeficient 系数取值[-1, 1], 取值越大， 说明模型表现越好。</p>
<br>
<p><strong>类比测试</strong>:</p>
<ul>
<li>CapitalOfCountries 豆瓣影评语料在此项表现尚可，可能目前电影库中有一定比例的外国素材。</li>
<li>CityInProvince 豆瓣影评语料在此项表现较差，不太可能是中国素材太少，可能大多数省市以类似汉东省的形式出现。这是我的猜测。 <a href="https://textdata.cn/blog/2023-12-28-train-word2vec-using-renmin-gov-leader-board-dataset/">人民网留言板语料 Word2Vec</a>中，该项准确率 100%。</li>
<li>FamilyRelationship 豆瓣影评体现的是电影相关内容，而电影永远的主题是人性， 内容少不了家长里短，七大姑八大姨，所以此项准确率高达 92.65%。 以<a href="https://textdata.cn/blog/2023-03-24-load-w2v-and-expand-your-concpet-dicitonary/">年报 MD&amp;A</a>为例，此处准确率只有 10%。</li>
<li>SocialScience 豆瓣影评语料在此项表现一般， 应该是语料中常见的社会科学词语提及较少。</li>
</ul>
<p>整体而言，语料训练的效果很不错，抓住了数据场景的独特性语义。</p>
<p><br><br></p>
<h2 id="四使用-word2vec">四、使用 Word2Vec</h2>
<h3 id="41-导入-word2vec-模型文件">4.1 导入 Word2Vec 模型文件</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="c1"># 导入模型，请注意路径。</span>
<span class="c1"># 「当前代码」 与 「output」 同处于一个文件夹内</span>

<span class="n">dm_w2v</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">load_w2v</span><span class="p">(</span><span class="s1">&#39;output/douban-movie-1000w-Word2Vec.200.15.txt&#39;</span><span class="p">)</span>
<span class="c1"># dm_w2v = ct.load_w2v(&#39;output/douban-movie-1000w-Word2Vec.200.15.bin&#39;)</span>

<span class="n">dm_w2v</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">Loading output/douban-movie-1000w-Word2Vec.200.15.txt...
&lt;gensim.models.keyedvectors.KeyedVectors at 0x314193830&gt;
</code></pre></div><br>
<h3 id="42-keyedvectors-的操作方法或属性">4.2 KeyedVectors 的操作方法(或属性)</h3>
<table>
<thead>
<tr>
<th>方法</th>
<th>描述</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong><em>KeyedVectors.index_to_key</em></strong></td>
<td>获取词汇表中的所有单词。</td>
</tr>
<tr>
<td><strong><em>KeyedVectors.key_to_index</em></strong></td>
<td>获取单词到索引的映射。</td>
</tr>
<tr>
<td><strong><em>KeyedVectors.vector_size</em></strong></td>
<td>获取 GloVe 模型中任意词向量的维度。</td>
</tr>
<tr>
<td><strong><em>KeyedVectors.get_vector(word)</em></strong></td>
<td>获取给定单词的词向量。</td>
</tr>
<tr>
<td><strong><em>KeyedVectors.similar_by_word(word, topn=10)</em></strong></td>
<td>获取某词语最相似的 10 个近义词。</td>
</tr>
<tr>
<td><strong><em>KeyedVectors.similar_by_vector(vector, topn=10)</em></strong></td>
<td>获取词向量最相似的 10 个近义词。</td>
</tr>
</tbody>
</table>
<br>
<h3 id="43-查看词表">4.3 查看词表</h3>
<p>查看词表所有单词</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">dm_w2v</span><span class="o">.</span><span class="n">index_to_key</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[&#39;电影&#39;,
 &#39;一个&#39;,
 &#39;没有&#39;,
 &#39;喜欢&#39;,
 ...
 &#39;跟着&#39;,
 &#39;意识&#39;,
 &#39;态度&#39;,
 ...]
</code></pre></div><p>为了方便查看， 这里只展示部分数据。</p>
<br>
<h3 id="44-词表映射">4.4 词表映射</h3>
<p>查看单词到索引的映射</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">dm_w2v</span><span class="o">.</span><span class="n">key_to_index</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">{&#39;电影&#39;: 0,
 &#39;一个&#39;: 1,
 &#39;没有&#39;: 2,
...
&#39;跟着&#39;: 997,
 &#39;意识&#39;: 998,
 &#39;态度&#39;: 999,
 ...}
</code></pre></div><br>
<h3 id="45-向量维度数">4.5 向量维度数</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;词表有 </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">dm_w2v</span><span class="o">.</span><span class="n">key_to_index</span><span class="p">)</span><span class="si">}</span><span class="s1"> 个词&#39;</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;向量是 </span><span class="si">{</span><span class="n">dm_w2v</span><span class="o">.</span><span class="n">vector_size</span><span class="si">}</span><span class="s1"> 维&#39;</span><span class="p">)</span>

</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">词表有 426646 个词
向量是 200 维
</code></pre></div><br>
<h3 id="46-获取词向量">4.6 获取词向量</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">dm_w2v</span><span class="o">.</span><span class="n">get_vector</span><span class="p">(</span><span class="s1">&#39;给力&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">array([-1.24090052e+00, -6.79377019e-01,  1.42518425e+00, -1.46615291e+00,
       -9.53197628e-02,  6.50456071e-01, -2.97696137e+00,  2.20916629e+00,
        6.12876177e-01,  1.63172066e+00,  4.91760701e-01, -9
        ......
        ......
         -1.42494082e+00,  2.49131727e+00, -6.27597034e-01, -7.91438043e-01,
       -4.54898655e-01,  1.37747681e+00, -4.20672953e-01, -1.53694853e-01,
        1.04936564e+00,  2.18786263e+00, -8.07472587e-01, -8.32003877e-02],
      dtype=float32)
</code></pre></div><br>
<h3 id="47-近义词">4.7 近义词</h3>
<p>根据词语查看近义词</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1"># 近义词</span>
<span class="n">dm_w2v</span><span class="o">.</span><span class="n">similar_by_word</span><span class="p">(</span><span class="s1">&#39;给力&#39;</span><span class="p">,</span> <span class="n">topn</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[(&#39;相当给力&#39;, 0.6180022358894348),
 (&#39;太给力&#39;, 0.6019443273544312),
 (&#39;带劲&#39;, 0.5840415954589844),
 (&#39;不给力&#39;, 0.5774183869361877),
 (&#39;过瘾&#39;, 0.5616626739501953),
 (&#39;牛叉&#39;, 0.553788959980011),
 (&#39;出彩&#39;, 0.5414286851882935),
 (&#39;精彩&#39;, 0.5332293510437012),
 (&#39;看得过瘾&#39;, 0.5250197649002075),
 (&#39;大赞&#39;, 0.5205727219581604)]
</code></pre></div><br>
<p>根据向量查找最相似的近义词</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">word_vector</span> <span class="o">=</span> <span class="n">dm_w2v</span><span class="o">.</span><span class="n">get_vector</span><span class="p">(</span><span class="s1">&#39;给力&#39;</span><span class="p">)</span>
<span class="n">dm_w2v</span><span class="o">.</span><span class="n">similar_by_vector</span><span class="p">(</span><span class="n">word_vector</span><span class="p">,</span> <span class="n">topn</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[(&#39;给力&#39;, 1.0),
 (&#39;相当给力&#39;, 0.6180021166801453),
 (&#39;太给力&#39;, 0.6019443273544312),
 (&#39;带劲&#39;, 0.5840415954589844),
 (&#39;不给力&#39;, 0.5774183869361877),
 (&#39;过瘾&#39;, 0.5616626739501953),
 (&#39;牛叉&#39;, 0.5537890195846558),
 (&#39;出彩&#39;, 0.5414287447929382),
 (&#39;精彩&#39;, 0.5332292914390564),
 (&#39;看得过瘾&#39;, 0.5250197649002075)]
</code></pre></div><br>
<h3 id="48-计算多个词的中心向量">4.8 计算多个词的中心向量</h3>
<p>我们可以计算「宇宙」、「飞船」、「战争」的宇宙语义向量（中心向量）。 并试图寻找中心向量 <strong><em>universe_vector</em></strong> 的最相似的 10 个词。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1"># 几个词语构建的宇宙语义向量</span>
<span class="n">universe_vector</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">semantic_centroid</span><span class="p">(</span><span class="n">wv</span><span class="o">=</span><span class="n">dm_w2v</span><span class="p">,</span>
                                       <span class="n">words</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;宇宙&#39;</span><span class="p">,</span> <span class="s1">&#39;飞船&#39;</span><span class="p">,</span> <span class="s1">&#39;战争&#39;</span><span class="p">])</span>


<span class="n">dm_w2v</span><span class="o">.</span><span class="n">similar_by_vector</span><span class="p">(</span><span class="n">universe_vector</span><span class="p">,</span> <span class="n">topn</span><span class="o">=</span><span class="mi">20</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[(&#39;宇宙&#39;, 0.7568532228469849),
 (&#39;星系&#39;, 0.7090039253234863),
 (&#39;飞船&#39;, 0.7080673575401306),
 (&#39;人类文明&#39;, 0.6973789930343628),
 (&#39;战舰&#39;, 0.6890057325363159),
 (&#39;母舰&#39;, 0.6864359974861145),
 (&#39;星球&#39;, 0.6799622774124146),
 (&#39;卫星&#39;, 0.6799139976501465),
 (&#39;星际&#39;, 0.6789332032203674),
 (&#39;空间站&#39;, 0.6780815124511719),
 (&#39;地球&#39;, 0.6769616603851318),
 (&#39;外太空&#39;, 0.6683873534202576),
 (&#39;核战&#39;, 0.6669113039970398),
 (&#39;外星飞船&#39;, 0.6592534780502319),
 (&#39;木星&#39;, 0.6586896777153015),
 (&#39;能源&#39;, 0.6562989950180054),
 (&#39;战争&#39;, 0.6556441187858582),
 (&#39;巨兽&#39;, 0.6544537544250488),
 (&#39;月球&#39;, 0.6525537967681885),
 (&#39;一艘&#39;, 0.6521110534667969)]
</code></pre></div><p>语义捕捉的很准哦。</p>
<h3 id="49-概念轴">4.9 概念轴</h3>
<p>男性概念向量由多个男性词的向量加总求均值得到，女性概念向量算法类似。当性质或方向明显相反的两个概念向量相减， 得到的新的向量，我们可以称之为**<em>概念轴向量 Concept Axis</em>**。常见的概念轴，</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">- 尺寸(大, 小)
- 湿度(干燥,潮湿)
- 性别(男, 女)
- 财富(富裕, 贫穷)
- 等
</code></pre></div><p>其实任意概念的向量也可看做概念轴，即该概念向量与 0 向量相减。只不过两组性质方向相反的方式得到的概念轴， 在语义上更稳定。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>

<span class="c1"># 定义词语列表</span>
<span class="n">phy_words</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;游泳&#39;</span><span class="p">,</span> <span class="s1">&#39;跑步&#39;</span><span class="p">,</span> <span class="s1">&#39;篮球&#39;</span><span class="p">,</span> <span class="s1">&#39;羽毛球&#39;</span><span class="p">,</span> <span class="s1">&#39;马拉松&#39;</span><span class="p">,</span> <span class="s1">&#39;马术&#39;</span><span class="p">,</span> <span class="s1">&#39;徒步&#39;</span><span class="p">]</span>

<span class="n">rich_words</span> <span class="o">=</span> <span class="p">[</span>
    <span class="s1">&#39;富裕&#39;</span><span class="p">,</span> <span class="s1">&#39;财富&#39;</span><span class="p">,</span> <span class="s1">&#39;金钱&#39;</span><span class="p">,</span> <span class="s1">&#39;豪宅&#39;</span><span class="p">,</span> <span class="s1">&#39;豪车&#39;</span><span class="p">,</span>
    <span class="s1">&#39;奢侈品&#39;</span><span class="p">,</span> <span class="s1">&#39;投资&#39;</span><span class="p">,</span> <span class="s1">&#39;股票&#39;</span><span class="p">,</span> <span class="s1">&#39;基金&#39;</span><span class="p">,</span> <span class="s1">&#39;黄金&#39;</span><span class="p">,</span>
    <span class="s1">&#39;钻石&#39;</span><span class="p">,</span> <span class="s1">&#39;游艇&#39;</span><span class="p">,</span> <span class="s1">&#39;私人飞机&#39;</span><span class="p">,</span> <span class="s1">&#39;企业家&#39;</span><span class="p">,</span> <span class="s1">&#39;富豪&#39;</span><span class="p">,</span>
    <span class="s1">&#39;成功&#39;</span><span class="p">,</span> <span class="s1">&#39;繁荣&#39;</span><span class="p">,</span> <span class="s1">&#39;奢华&#39;</span><span class="p">,</span> <span class="s1">&#39;贵族&#39;</span><span class="p">,</span> <span class="s1">&#39;高收入&#39;</span>
<span class="p">]</span>

<span class="n">poor_words</span> <span class="o">=</span> <span class="p">[</span>
    <span class="s1">&#39;贫穷&#39;</span><span class="p">,</span> <span class="s1">&#39;贫困&#39;</span><span class="p">,</span> <span class="s1">&#39;饥饿&#39;</span><span class="p">,</span> <span class="s1">&#39;失业&#39;</span><span class="p">,</span> <span class="s1">&#39;低收入&#39;</span><span class="p">,</span>
    <span class="s1">&#39;简陋&#39;</span><span class="p">,</span> <span class="s1">&#39;破旧&#39;</span><span class="p">,</span> <span class="s1">&#39;乞丐&#39;</span><span class="p">,</span> <span class="s1">&#39;流浪&#39;</span><span class="p">,</span> <span class="s1">&#39;欠债&#39;</span><span class="p">,</span>
    <span class="s1">&#39;破产&#39;</span><span class="p">,</span> <span class="s1">&#39;困境&#39;</span><span class="p">,</span> <span class="s1">&#39;艰难&#39;</span><span class="p">,</span> <span class="s1">&#39;挣扎&#39;</span><span class="p">,</span> <span class="s1">&#39;匮乏&#39;</span><span class="p">,</span>
    <span class="s1">&#39;落后&#39;</span><span class="p">,</span> <span class="s1">&#39;无助&#39;</span><span class="p">,</span> <span class="s1">&#39;绝望&#39;</span><span class="p">,</span> <span class="s1">&#39;赤贫&#39;</span><span class="p">,</span> <span class="s1">&#39;温饱&#39;</span>
<span class="p">]</span>

<span class="n">phy_project_on_fortune</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">sematic_projection</span><span class="p">(</span><span class="n">wv</span> <span class="o">=</span> <span class="n">dm_w2v</span><span class="p">,</span>
                                               <span class="n">words</span> <span class="o">=</span> <span class="n">phy_words</span><span class="p">,</span>
                                               <span class="n">poswords</span> <span class="o">=</span><span class="n">rich_words</span><span class="p">,</span>
                                               <span class="n">negwords</span> <span class="o">=</span><span class="n">poor_words</span><span class="p">,</span>
                                               <span class="p">)</span>

<span class="n">phy_project_on_fortune</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[(&#39;跑步&#39;, -1.82),
 (&#39;徒步&#39;, -0.82),
 (&#39;游泳&#39;, -0.19),
 (&#39;羽毛球&#39;, 0.57),
 (&#39;马拉松&#39;, 0.62),
 (&#39;马术&#39;, 1.15),
 (&#39;篮球&#39;, 4.0)]
</code></pre></div><p>计算结果中， 数值越大越接近于 c_words2, 越小越接近于 c_words1 。 可以看到在财富概念轴向量上的投影， 篮球不太准，但是其他几项基本上看出运动的贫富性。</p>
<br>
<h3 id="410-类比-king-man--woman--queen">4.10 类比 king-man + woman ~ queen</h3>
<p>每个词是高维向量空间中的一个点， 两个点可以组成有方向的向量，而向量可以比较方向。这里是推理过程，受限于数据，公式不一定完全成立， 但是思维可以类比。</p>
<p><img loading="lazy" src="img/king-queen-formular.png" alt=""  />
</p>
<p>这两个词相减，按感觉应该得到的是性别方向，雄性-&gt;雌性。</p>
<p><strong><em>gender_direction_1 = vector(man)-vector(woman)</em></strong></p>
<p><strong><em>gender_direction_2 = vector(king)-vector(queen)</em></strong></p>
<p>那两个性别方向应该近似，即 gender_direction_1 约等于 gender_direction_2 ，将其看做等式就得到如下公式：</p>
<p><strong><em>vector(理应近似 queen) = vector(king)-vector(men)+vector(women)</em></strong></p>
<p>现在我们检查三个语义向量计算出的新的向量是否有与 queen 相关的语义信息。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">semactic_vector</span><span class="p">(</span><span class="n">wv</span><span class="p">,</span> <span class="n">words</span><span class="p">):</span>
    <span class="n">vector</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">semantic_centroid</span><span class="p">(</span><span class="n">wv</span><span class="o">=</span><span class="n">wv</span><span class="p">,</span>
                                  <span class="n">words</span><span class="o">=</span><span class="n">words</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">vector</span>


<span class="n">men_vector</span> <span class="o">=</span> <span class="n">semactic_vector</span><span class="p">(</span><span class="n">dm_w2v</span><span class="p">,</span> <span class="p">[</span><span class="s1">&#39;男&#39;</span><span class="p">,</span> <span class="s1">&#39;男孩&#39;</span><span class="p">,</span> <span class="s1">&#39;男人&#39;</span><span class="p">,</span> <span class="s1">&#39;他&#39;</span><span class="p">,</span> <span class="s1">&#39;父亲&#39;</span><span class="p">,</span> <span class="s1">&#39;爸爸&#39;</span><span class="p">,</span> <span class="s1">&#39;爷爷&#39;</span><span class="p">])</span>
<span class="n">women_vector</span> <span class="o">=</span> <span class="n">semactic_vector</span><span class="p">(</span><span class="n">dm_w2v</span><span class="p">,</span> <span class="p">[</span><span class="s1">&#39;女&#39;</span><span class="p">,</span> <span class="s1">&#39;女孩&#39;</span><span class="p">,</span> <span class="s1">&#39;女人&#39;</span><span class="p">,</span> <span class="s1">&#39;她&#39;</span><span class="p">,</span> <span class="s1">&#39;母亲&#39;</span><span class="p">,</span> <span class="s1">&#39;妈妈&#39;</span><span class="p">,</span> <span class="s1">&#39;奶奶&#39;</span><span class="p">])</span>
<span class="n">king_vector</span> <span class="o">=</span> <span class="n">semactic_vector</span><span class="p">(</span><span class="n">dm_w2v</span><span class="p">,</span> <span class="p">[</span><span class="s1">&#39;国王&#39;</span><span class="p">,</span> <span class="s1">&#39;皇帝&#39;</span><span class="p">,</span> <span class="s1">&#39;帝王&#39;</span><span class="p">,</span> <span class="s1">&#39;大帝&#39;</span><span class="p">])</span>
<span class="c1"># 假设 king- queen 约等于 man -woman</span>
<span class="c1"># result 近似等于 king - queen + women</span>
<span class="n">result_vector</span> <span class="o">=</span> <span class="n">king_vector</span> <span class="o">-</span> <span class="n">men_vector</span> <span class="o">+</span> <span class="n">women_vector</span>
<span class="c1"># 现在检查 result_vector 的语义应该与queen相关</span>
<span class="n">dm_w2v</span><span class="o">.</span><span class="n">similar_by_vector</span><span class="p">(</span><span class="n">result_vector</span><span class="p">,</span> <span class="n">topn</span><span class="o">=</span><span class="mi">20</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[(&#39;皇帝&#39;, 0.8448051810264587),
 (&#39;王后&#39;, 0.8056979179382324),
 (&#39;国王&#39;, 0.8004385232925415),
 (&#39;帝王&#39;, 0.7693961262702942),
 (&#39;君主&#39;, 0.7663125991821289),
 (&#39;皇后&#39;, 0.7614380717277527),
 (&#39;太后&#39;, 0.7463700175285339),
 (&#39;妃子&#39;, 0.7433678507804871),
 (&#39;君王&#39;, 0.7407413125038147),
 (&#39;皇子&#39;, 0.7380139231681824),
 (&#39;王位&#39;, 0.7319545745849609),
 (&#39;皇上&#39;, 0.7215542197227478),
 (&#39;登基&#39;, 0.7210745215415955),
 (&#39;大臣&#39;, 0.714862048625946),
 (&#39;伊丽莎白一世&#39;, 0.702217698097229),
 (&#39;王朝&#39;, 0.7000151872634888),
 (&#39;宫女&#39;, 0.6997070908546448),
 (&#39;驾崩&#39;, 0.6992778182029724),
 (&#39;王妃&#39;, 0.6981185078620911),
 (&#39;昏君&#39;, 0.6974363923072815)]
</code></pre></div><p>可以看到三个语义向量四则运算出的 result_vector 与 queen 仍具有较高的相关性。</p>
<p><br><br></p>
<h2 id="五获取资料">五、获取资料</h2>
<p>除了本文介绍的这个 1000w 条影评数据集， 大邓还有 2 个类似的豆瓣影评数据集，影评记录量 212w 和 442 w 条。 两个数据集下载链接我都公开，感兴趣的可以都下载下来。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">- 免费  douba-movie-1000w 链接: https://pan.baidu.com/s/15C0fn7oyYEFvuQtPO8tw8Q?pwd=1g7m 提取码: 1g7m
- 免费 douban-movie-1000w-Word2Vec.200.15.bin
链接: https://pan.baidu.com/s/1fK8LhLmK4_xq-eHzNn42lg?pwd=2hwr 提取码: 2hwr
- 免费 douban-movie-442w 链接: https://pan.baidu.com/s/1T_LPuxEZ_W8xfYcxV7rW5Q?pwd=a683 提取码: a683
- 免费 douban-movie-212w 链接: :https://pan.baidu.com/s/1VBwnOqfMPu_Y48bMlQ4oiw?pwd=t8id
 提取码: t8id

- 免费词向量      https://cntext.readthedocs.io/zh-cn/latest/embeddings.html
</code></pre></div><p><br><br></p>
<h2 id="相关内容">相关内容</h2>
<ul>
<li><a href="https://textdata.cn/blog/2024-04-17-douban-book-3394w-ratings-comments-dataset/">数据集 | 3394w 条豆瓣书评数据集</a></li>
<li><a href="https://textdata.cn/blog/2025-03-28-train_a_glove_model_on_chinese_corpus_using_stanfordnlp/">实验 | 使用 Stanford Glove 代码训练中文语料的 GloVe 模型</a></li>
<li><a href="https://textdata.cn/blog/2023-12-28-visualize-the-culture-change-using-people-daily-dataset/">可视化 | 人民日报语料反映七十年文化演变</a></li>
<li><a href="https://textdata.cn/blog/2023-03-24-load-w2v-and-expand-your-concpet-dicitonary/">词向量 | 使用 MD&amp;A2001-2023 语料训练 Word2Vec 模型</a></li>
<li><a href="https://textdata.cn/blog/2025-03-28-train_a_glove_model_on_chinese_corpus_using_stanfordnlp/">实验 | 使用 Stanford Glove 代码训练中文语料的 Glove 模型</a></li>
</ul>
<p><br><br></p>
<h2 id="cntext使用声明">cntext使用声明</h2>
<p>如在研究或项目中使用 cntext ，请在文中介绍并附引用声明。引用格式可参考 <a href="https://textdata.cn/blog/2025-09-09-transform-text-data-into-structured-data-with-lmstudio-and-cntext/#%E4%BD%BF%E7%94%A8%E5%A3%B0%E6%98%8E">cntext 推荐引用格式</a></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>数据集| A股上市公司基本信息2000-2023</title>
      <link>https://textdata.cn/blog/2024-04-16-china-listed-company-information-dataset/</link>
      <pubDate>Tue, 16 Apr 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-04-16-china-listed-company-information-dataset/</guid>
      <description>A股上市公司基本信息</description>
      <content:encoded><![CDATA[<h2 id="一数据概况">一、数据概况</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">数据集: A股上市公司基本信息
年份: 2000-2023
公司数: 5504
记录数: 60901
用途: 可与年报、md&amp;a数据集进行并表
</code></pre></div><p><br><br></p>
<h2 id="二查看数据">二、查看数据</h2>
<h3 id="21-导入数据">2.1 导入数据</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;上市公司基本信息2000-2023.csv&#39;</span><span class="p">)</span>
<span class="n">df</span>
</code></pre></div><p><img loading="lazy" src="img/df1.png" alt=""  />
</p>
<p><br><br></p>
<p>如果股票代码中带的字母A别扭，可以剔除掉</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="o">.</span><span class="n">Symbol</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">Symbol</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s1">&#39;A&#39;</span><span class="p">,</span> <span class="s1">&#39;&#39;</span><span class="p">)</span>
<span class="n">df</span>
</code></pre></div><p><img loading="lazy" src="img/df2.png" alt=""  />
</p>
<p><br><br></p>
<h3 id="22-查看字段">2.2 查看字段</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1"># 查看字段/含义</span>
<span class="n">max_col_len</span> <span class="o">=</span> <span class="nb">max</span><span class="p">([</span><span class="nb">len</span><span class="p">(</span><span class="n">col</span><span class="p">)</span> <span class="k">for</span> <span class="n">col</span> <span class="ow">in</span> <span class="n">df</span><span class="o">.</span><span class="n">columns</span><span class="p">])</span>
<span class="n">max_desc_len</span> <span class="o">=</span> <span class="nb">max</span><span class="p">([</span><span class="nb">len</span><span class="p">(</span><span class="nb">str</span><span class="p">(</span><span class="n">desc</span><span class="p">))</span> <span class="k">for</span> <span class="n">desc</span> <span class="ow">in</span> <span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="o">.</span><span class="n">index</span><span class="o">==</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">values</span><span class="p">[</span><span class="mi">0</span><span class="p">]])</span>

<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;- 字段                   含义         缺失率&#39;</span><span class="p">)</span>
<span class="k">for</span> <span class="n">col</span><span class="p">,</span> <span class="n">desc</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">columns</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="o">.</span><span class="n">index</span><span class="o">==</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">values</span><span class="p">[</span><span class="mi">0</span><span class="p">]):</span>
    <span class="n">ratio</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">col</span><span class="p">]</span><span class="o">.</span><span class="n">isna</span><span class="p">()</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span><span class="o">/</span><span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
    <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;- </span><span class="si">{</span><span class="n">col</span><span class="si">:</span><span class="s1">&lt;</span><span class="si">{</span><span class="n">max_col_len</span><span class="si">}}</span><span class="s1">   </span><span class="si">{</span><span class="n">desc</span><span class="si">:</span><span class="s1">&lt;</span><span class="si">{</span><span class="n">max_desc_len</span><span class="si">}}</span><span class="s1">     </span><span class="si">{</span><span class="nb">round</span><span class="p">(</span><span class="n">ratio</span><span class="p">,</span> <span class="mi">3</span><span class="p">)</span><span class="o">*</span><span class="mi">100</span><span class="si">}</span><span class="s1">%&#39;</span><span class="p">)</span>

</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">- 字段                   含义         缺失率
- Symbol                股票代码         0.0%
- ShortName             股票简称         0.0%
- EndDate               统计截止日期       0.0%
- ListedCoID            上市公司ID       0.0%
- SecurityID            证券ID         0.0%
- IndustryName          行业名称         0.0%
- IndustryCode          行业代码         0.0%
- IndustryNameC         行业名称C        0.0%
- IndustryCodeC         行业代码C        0.0%
- RegisterAddress       注册具体地址       0.0%
- OfficeAddress         公司办公地址       0.0%
- Zipcode               办公地址邮政编码     0.0%
- Secretary             董事会秘书        0.1%
- SecretaryTel          董秘联系电话       0.1%
- SecretaryFax          董秘传真         0.7000000000000001%
- SecretaryEmail        董秘电子邮箱       0.7000000000000001%
- SecurityConsultant    证券事务代表       17.7%
- SocialCreditCode      统一社会信用代码     23.400000000000002%
- Sigchange             重大变更         5.3%
- Lng                   办公地经度        4.6%
- Lat                   办公地纬度        4.6%
- ISIN                  ISIN编码       0.6%
- FullName              中文全称         0.0%
- LegalRepresentative   法人代表         0.0%
- EstablishDate         公司成立日期       0.0%
- Crcd                  ABH股交叉码      93.8%
- RegisterCapital       注册资本         0.0%
- Website               公司网址         4.5%
- BusinessScope         经营范围         0.0%
- RegisterLongitude     注册地经度        4.7%
- RegisterLatitude      注册地纬度        4.7%
- EMAIL                 电子邮箱         0.7000000000000001%
- LISTINGDATE           首次上市日期       0.0%
- PROVINCECODE          所属省份代码       0.0%
- PROVINCE              所属省份         0.0%
- CITYCODE              所属城市代码       0.2%
- CITY                  所属城市         0.0%
- MAINBUSSINESS         主营业务         0.0%
- LISTINGSTATE          上市状态         0.0%
</code></pre></div><br>
<h3 id="23-公司数">2.3 公司数</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="o">.</span><span class="n">Symbol</span><span class="o">.</span><span class="n">nunique</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">5504
</code></pre></div><br>
<br>
<h2 id="三增加其他数据集字段数量">三、增加其他数据集字段数量</h2>
<p><a href="https://textdata.cn/blog/2023-03-23-china-a-share-market-dataset-mda-from-01-to-21/"><strong>数据集 | 2001-2023A股上市公司年报&amp;管理层讨论与分析</strong></a> 只有 <em><strong>year</strong></em>、<em><strong>code</strong></em>、<em><strong>text</strong></em> 三个字段， 通过与本数据集合并操作(pd.merge) ，现在希望增加 <em><strong>EndDate</strong></em>、<em><strong>ShortName</strong></em>、<em><strong>IndustryCode</strong></em>、 <em><strong>RegisterAddress</strong></em> 四个字段。<br></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">mda_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;mda01-23.csv.gz&#39;</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">)</span>
<span class="n">mda_df</span><span class="p">[</span><span class="s1">&#39;year&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">mda_df</span><span class="p">[</span><span class="s1">&#39;year&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="nb">str</span><span class="p">)</span>
<span class="n">mda_df</span>
</code></pre></div><p><img loading="lazy" src="img/mda.png" alt=""  />
</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#选择需要的字段进行读取</span>
<span class="n">info_df</span> <span class="o">=</span> <span class="n">df</span><span class="p">[[</span><span class="s1">&#39;Symbol&#39;</span><span class="p">,</span> <span class="s1">&#39;ShortName&#39;</span><span class="p">,</span> <span class="s1">&#39;EndDate&#39;</span><span class="p">,</span> <span class="s1">&#39;IndustryCode&#39;</span><span class="p">,</span> <span class="s1">&#39;RegisterAddress&#39;</span><span class="p">]]</span>

<span class="c1">#更改字段名Symbol为code</span>
<span class="n">info_df</span><span class="o">.</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="s2">&#34;Symbol&#34;</span><span class="p">:</span> <span class="s2">&#34;code&#34;</span><span class="p">},</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>

<span class="c1">#根据EndDate计算会计年度year</span>
<span class="n">info_df</span><span class="p">[</span><span class="s1">&#39;year&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">info_df</span><span class="p">[</span><span class="s1">&#39;EndDate&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">y</span><span class="p">:</span> <span class="n">y</span><span class="p">[:</span><span class="mi">4</span><span class="p">])</span>
<span class="n">info_df</span>
</code></pre></div><p><img loading="lazy" src="img/info_df.png" alt=""  />
</p>
<p><br><br>根据字段 <em><strong>year</strong></em>、<em><strong>code</strong></em> 进行合并，合并方式为内连接 <em><strong>inner</strong></em> ， 即两数据集的交集。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df_merge</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">merge</span><span class="p">(</span><span class="n">mda_df</span><span class="p">,</span> <span class="n">info_df</span><span class="p">,</span> <span class="n">on</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;year&#39;</span><span class="p">,</span> <span class="s1">&#39;code&#39;</span><span class="p">],</span> <span class="n">how</span><span class="o">=</span><span class="s1">&#39;inner&#39;</span><span class="p">)</span>

<span class="c1">#保存</span>
<span class="c1">#df_merge.to_csv(&#39;合并后的数据.csv&#39;, index=False)</span>
<span class="c1">#df_merge.to_excel(&#39;合并后的数据.xlsx&#39;, index=False)</span>
<span class="n">df_merge</span>
</code></pre></div><p><img loading="lazy" src="img/merge.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="三相关内容">三、相关内容</h2>
<ul>
<li><a href="https://textdata.cn/blog/2023-03-23-china-a-share-market-dataset-mda-from-01-to-21/">数据集 | 2001-2023年A股上市公司年报&amp;管理层讨论与分析</a></li>
<li><a href="https://textdata.cn/blog/2023-01-06-mda_informative_content/">中国工业经济 | MD&amp;A信息含量指标构建代码实现</a></li>
<li><a href="https://textdata.cn/blog/2023-01-13-information-content-of-critical-audit/">金融研究 | 使用Python构建「关键审计事项信息含量」</a></li>
</ul>
<br>
<br>
<h2 id="四获取数据">四、获取数据</h2>
<p>整理不易， 50元 ， 加微信 372335839 ， 备注 「姓名-学校-专业」。</p>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>LIST| 文本分析代码资料汇总</title>
      <link>https://textdata.cn/blog/text_analysis_code_list_about_ms/</link>
      <pubDate>Mon, 15 Apr 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/text_analysis_code_list_about_ms/</guid>
      <description>如何使用Python从网络中爬取数据，如何从文本数据中抽取信息。本文汇总了常见的python代码案例，方便大家快速学习</description>
      <content:encoded><![CDATA[<p>个人感觉博客 <strong><a href="https://textdata.cn/">textdata.cn</a></strong> 文本分析代码案例都集中在这里了，我将内容按大类分成</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">- Python语法
- 数据采集
- 数据处理&amp;Pandas
  - 正则表达式
  - pandas常用方法
  - pandas性能优化
  - 其他操作
- 文本分析
  - 概览
  - 词典法
  - 词向量
  - 大语言模型
- 数据标注&amp;机器学习
  - 数据标注
  - 监督机器学习
  - 非监督机器学习
- 可视化
- R语言
- 其他
</code></pre></div><p><br><br></p>
<h2 id="一python语法">一、Python语法</h2>
<ul>
<li>
<p><a href="https://textdata.cn/blog/30_days_of_python/">30天Python编程学习挑战</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/dadeng_python_basic_tutorial/">Python语法入门 | 含视频代码</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-07-19-advanced-python-mastery/"><strong>免费下载 | 进阶Python学习资料</strong></a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-02-18-how-to-use-if-elif-else-in-one-line/">如何在一行代码中实现if-elif-else三分支语句</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-02-01-tricks-for-better-python-code-with-examples/">12个优雅的python代码使用案例</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/course_recommendation_about_social_science/">免费社科类Python编程课程列表</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-11-10-bidirectional-mapping-library/">bidict库 | Python双向映射功能，让字典更好用</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2024-01-03-how-to-design-lambda-function/">如何设计好 lambda 函数 ？</a></p>
</li>
</ul>
<p><br><br></p>
<h2 id="二数据采集">二、数据采集</h2>
<ul>
<li><a href="https://textdata.cn/blog//2025-03-24-setting-chromedriver-environment-for-selenium/">资源 | 不同版本Chrome适配的chromedriver下载链接</a></li>
<li><a href="https://textdata.cn/blog/2024-06-16-scrapegraph-ai/">网络爬虫 | 使用scrapegraph-ai(大模型方案)自动采集网页数据</a></li>
<li><a href="https://textdata.cn/blog/2023-10-13-crawler-for-qyer/">网络爬虫 |  采集穷游网某城市旅游景点</a></li>
<li><a href="https://textdata.cn/blog/2023-05-07-bilibili-video-info-list/">网络爬虫 | 使用Python披露采集 Up 主视频列表详情信息</a></li>
<li><a href="https://textdata.cn/blog/2023-05-12-welcome-to-zibo-barbecue/"><strong>网络爬虫 | 批量采集话题「如何评价淄博烧烤？」的回答</strong></a></li>
<li><a href="https://textdata.cn/blog/2023-04-23-data-collector-for-douban-group-parent-child-relationship/">网络爬虫 | 使用Python采集豆瓣「全职儿女」小组组员信息</a></li>
<li><a href="https://textdata.cn/blog/2023-04-23-data-collector-for-bilibili-danmu/">网络爬虫 | 使用Python采集B站弹幕和评论数据</a></li>
<li><a href="https://textdata.cn/blog/qdata_collect_baidu_index/">百度指数 | 使用qdata采集百度指数</a></li>
<li><a href="https://textdata.cn/blog/2022-10-08-find-sns-account-information-with-maigret/"> Maigret库 | 查询某用户名在各平台网站的使用情况</a></li>
</ul>
<p><br><br></p>
<h2 id="三数据处理pandas">三、数据处理&amp;Pandas</h2>
<h3 id="31-文本处理">3.1 文本处理</h3>
<p>使用正则表达式可以筛选文本数据，做数据预处理(数据清洗)</p>
<ul>
<li>
<p><a href="https://textdata.cn/blog/2023-02-18-regex-expression-examples/">正则表达式 | 词频统计、情感分析、融资约束</a></p>
</li>
<li>
<p><a href="https://textdata.cn/2023-10-30-raw-mbti-users/">文本分析 | 使用正则表达式判别微博用户mbti类型</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-02-12-regex-expression-generated-by-chatgpt/">数据清洗 | 借助 chatGPT 设计正则表达式</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-12-17-how-to-generate-panel-data-from-gov-report-dataset/"><strong>代码 | 使用地方gov工作报告生成某类概念词频「面板数据」</strong></a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-12-27-measure-gov-digitalization/">代码 | 使用gov工作报告生成数字化词频「面板数据」</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-12-18-how-to-generate-panel-data-from-daily-news-dataset/"><strong>代码 | 使用「新闻数据」构造概念词提及量「面板数据」</strong></a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-02-26-cctv1-xwlb-news-text-dataset/"><strong>数据代码| 使用cctv新闻联播文稿构造「面板数据」</strong></a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-03-19-word-in-context/">word_in_context | 查看某类词的上下文，更好的理解文本数据</a></p>
</li>
</ul>
<br>
<h3 id="32-常用方法">3.2 常用方法</h3>
<ul>
<li>
<p><a href="https://textdata.cn/blog/2024-12-18-how-to-extract-data-from-patent-application-dataset/"><strong>代码 | 使用5112w专利申请数据集构造「面板数据」</strong></a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-12-20-measure-china-economic-policy-uncertainty/">代码 | 使用「新闻数据」测量 「<em><strong>经济政策不确定性EPU指标</strong></em>」</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2024-04-25-firm-economic-policy-uncertainty/">代码 | 使用 「MD&amp;A文本」测量「<em><strong>企业不确定性感知FEPU指标</strong></em>」</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-04-26-matching-listed-corporate-with-patent-dataset/">从3571w条专利数据集「匹配」上市公司的专利信息</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-05-31-resample-groupby-in-pandas/">可视化 | 使用groupby或resample按月份分组绘制高管违规量趋势图</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-08-31-data-visualization-how-to-plot-a-map-with-geopandas/">可视化 | 使用geopandas可视化地图数据</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-01-27-cheatsheet-about-text-manipulate-in-python/">CheatSheet | Python文本数据处理速查表</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-01-27-pandas-dataframe-tutorial-in-python/">Pandas库 | DataFrame类常用知识点总结</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-01-30-query-method-in-dataframe/">Pandas库 | 使用 df.query 字符串表达式进行数据筛选</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-08-07-using-str-contains-method-to-judge-some-specific-content-in-excel/">Pandas库 | 对高管数据xlsx中的简介字段做文本分析</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2024-03-29-dataframe-add-sub-mul-div/">Pandas技巧 | DataFrame的四则运算</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/text_analysis_in_pandas/">使用Pandas处理文本数据</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/pandas_example_company_analysis/">Pandas小案例 | 对某公司同年的某指标批量汇总</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-03-11-xiaohongshu-data-analysis/">数据分析 | 使用决策树分析小红书帖子数据(含代码)</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-04-25-zhihu-parent-child-relationship/">数据分析 | 知乎热门话题「全职儿女」</a></p>
</li>
</ul>
<br>
<h3 id="33-性能优化其他操作">3.3 性能优化&amp;其他操作</h3>
<ul>
<li>
<p><a href="https://textdata.cn/blog/2023-12-27-polars-tutorial-an-altertaive-of-pandas/"><strong>Polars库 | 最强 Pandas 平替来了</strong></a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-11-17-modin-accecerate-your-process/">Modin库，只需一行代码加速你的Pandas</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-11-19-pandarallel-speed-up-pandas/"><strong>pandarallel库 | 多核运行提升pandas速度</strong></a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-11-17-how-handle-mega-csv-that-far-exceed-memory/"><strong>推荐 | 如何处理远超电脑内存的csv文件</strong></a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-01-08-pandas-5-trips-you-may-or-not-may-know/">5个你或许不知道的pandas数据导入技巧</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2022-11-16-how-to-fix-string-unicode-decode-error/">如何正确读入文本数据不乱码(解决文本乱码问题)</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-01-30-pipeline-for-data-analysis/">使用流水线pipeline模式设计并处理数据</a></p>
</li>
</ul>
<p><br><br></p>
<h2 id="四文本分析">四、文本分析</h2>
<h3 id="41-概览">4.1 概览</h3>
<ul>
<li>
<p><a href="https://textdata.cn/blog/liwc_python_text_mining/">LIWC vs Python  | 文本分析之词典词频法略讲(含代码)</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/text_mining_in_accouting_research/">在会计研究中使用Python进行文本分析</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-03-15-how-to-learn-python-data-mining-with-chatgpt/">借助chatGPT更高效地学习「Python实证指标构建与文本分析」</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2022-10-08-nlp-roadmap/">nlp-roadmap | 文本分析知识点思维脑图</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/culture_analysis/">Python与文化分析入门</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog2023-02-01-chatgpt-usage-first-time/">使用 chatGPT 撰写 Python 文本分析代码</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/chinese_emobank/">EmoBank | 中文维度情感词典</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2024-01-21-chinese-traditional-to-simplified-text/">opencc | 中文简体、繁体转换库</a></p>
</li>
</ul>
<br>
<h3 id="42-词典法">4.2 词典法</h3>
<ul>
<li><a href="https://textdata.cn/blog/cntext_tutorial/">cntext库 | 中文情感分析包</a></li>
<li><a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/">推荐 | 中文文本分析cntext2.x库使用手册</a></li>
<li><a href="https://textdata.cn/blog/weighted_tfidf_sentiment_analysis/">tfidf有权重的情感分析</a></li>
<li><a href="https://textdata.cn/blog/asent_sentiment_analysis/">Asent库 | 英文文本数据情感分析</a></li>
<li><a href="https://textdata.cn/blog/share_your_dict_to_cntext/">欢迎各位向cntext库分享情感词典</a></li>
<li><a href="https://textdata.cn/blog/chinese_financial_dictionary/">中文金融情感词典</a></li>
<li><a href="https://textdata.cn/blog/how_chinese_tmtai_impact_corporate_inovation/">文本分析 | 中国企业高管团队创新注意力</a></li>
</ul>
<br>
<h3 id="43-社交网络分析">4.3 社交网络分析</h3>
<ul>
<li><a href="https://textdata.cn/blog//2024-04-12-semantic-brand-score/">文献&amp;代码 | 使用Python计算 <strong>语义品牌评分(Semantic Brand Score)</strong></a></li>
</ul>
<br>
<h3 id="44-词向量">4.4 词向量</h3>
<ul>
<li><a href="https://textdata.cn/blog/2025-04-23-word-embedding-reflect-human-attitude/">文化几何学：通过词嵌入分析反映文本背后的社会文化(变迁)</a></li>
<li><a href="https://textdata.cn/blog/2025-03-28-train_a_glove_model_on_chinese_corpus_using_stanfordnlp/">实验 | 使用Stanford Glove代码训练中文语料的Glove模型</a></li>
<li><a href="https://textdata.cn/blog/2023-12-28-visualize-the-culture-change-using-people-daily-dataset/"><strong>可视化 | 人民日报语料反映七十年文化演变</strong></a></li>
<li><a href="https://textdata.cn/blog/2023-04-26-chinese-it-industry-slangs-words/"><strong>实验 | 互联网黑话与MD&amp;A</strong></a></li>
<li><a href="https://textdata.cn/blog/douban_w2v/">豆瓣影评 | 探索词向量妙处</a></li>
<li><a href="https://textdata.cn/blog/2023-11-12-using-100m-bilibili-user-sign-data-to-training-word2vec/">词向量 | 使用1亿B站用户签名训练word2vec词向量</a></li>
<li><a href="https://textdata.cn/blog/2022-11-07-embeddings-theory-applicaiton-liuhuanyong/">预训练词向量模型的方法、应用场景、变体延伸与实践总结</a></li>
<li><a href="https://textdata.cn/blog/2022-10-16-python-word-mover-s-distance/"> Python | 词移距离(Word Mover&rsquo;s Distance)</a></li>
<li><a href="https://textdata.cn/blog/2022-11-22-glove-embeddings-model/">训练&amp;使用 Glove 语言模型， 可度量刻板印象等</a></li>
<li><a href="https://textdata.cn/blog/bertopic_tutorial/">BERTopic库 | 使用预训练模型做话题建模</a></li>
<li><a href="https://textdata.cn/blog/2022-12-03-dynamic_topic_model_with_bertopic/">BERTopic | 使用推特数据构建 <strong>动态主题模型模</strong></a></li>
<li><a href="https://textdata.cn/blog/keybert_tutorial/">KeyBERT | 关键词发现库</a></li>
<li><a href="https://textdata.cn/blog/top2vec_tutorial/">Top2Vec | 主题建模和语义搜索库</a></li>
<li><a href="https://textdata.cn/blog/2022-11-17-finbert-finance-bert-model/">FinBERT | 金融文本BERT模型，可情感分析、识别ESG和FLS类型</a></li>
<li><a href="https://textdata.cn/blog/sentence-transformer-tutorial/">sentence-transformer库 | 句子语义向量化</a></li>
<li><a href="https://textdata.cn/blog/wordbias/">WordBias库 | 发现偏见(刻板印象)的交互式工具</a></li>
<li><a href="https://textdata.cn/blog/2023-10-27-nlp_gte_sentence-embedding_chinese/">GTE中文通用文本向量表示模型</a></li>
<li><a href="https://textdata.cn/blog/shifterator_text_vis/">Shifterator库 | 词移图分辨两文本用词风格差异</a></li>
</ul>
<br>
<h3 id="44-大语言模型">4.4 大语言模型</h3>
<ul>
<li><a href="https://textdata.cn/blog/2023-02-23-simplet5-one-line-summary/">simpleT5 库 | 根据英文摘要内容生成标题</a></li>
<li><a href="https://textdata.cn/blog/2023-11-20-how-to-use-llms-tobuild-better-clustering-models/">以聚类为例 | 使用大语言模型LLM做文本分析</a></li>
<li><a href="https://textdata.cn/blog/2024-06-14-how-to-download-large-language-model-with-ollama/">教程 | 如何使用 Ollama 下载 &amp; 使用本地大语言模型</a></li>
<li><a href="https://textdata.cn/blog/2024-06-14-using-large-language-model-to-extract-structure-data-from-raw-text/"><strong>实验 | 使用大模型从文本中提取结构化信息</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-08-07-structured-outputs-with-ollama/"><strong>实验 | 如何使 Ollama 结构化输出 JSON 样式的结果</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-07-10-using-large-language-model-to-build-diy-dictionary/"><strong>实验 | 使用本地大模型DIY制作单词书教案PDF</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-08-05-create-a-blog-writer-multi-agent-system-using-crewai-and-ollama/"><strong>实验 | 使用 Crewai 和 Ollama 构建智能体(AI Agent)帮我撰写博客文章</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-08-06-using-the-ollama-local-large-model-to-predict-the-sentiment-category-of-online-comments/"><strong>实验 | 使用本地大模型预测在线评论情感类别</strong></a></li>
</ul>
<p><br><br></p>
<h2 id="五提取特征机器学习">五、提取特征&amp;机器学习</h2>
<h3 id="51--监督机器学习">5.1  监督机器学习</h3>
<ul>
<li><a href="https://textdata.cn/blog/ml_credit_card_fraud_detection/">机器学习实战 | 信用卡欺诈检测</a></li>
<li><a href="https://textdata.cn/blog/speed_up_sklearn_code_with_sklearnex/">sklearnex库 | 让你的scikit-learn代码加速百倍</a></li>
<li><a href="https://textdata.cn/blog/label_studio_test/">Label-Studio|多媒体数据标注工具</a></li>
<li><a href="https://textdata.cn/blog/doccano_text_anotation/">doccano|为机器学习建模做数据标注</a></li>
</ul>
<br>
<h3 id="52-非监督机器学习">5.2 非监督机器学习</h3>
<ul>
<li>
<p><a href="https://textdata.cn/blog/hierarchy_dendrogram_tutorial/">使用scipy实现层次聚类分析</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/svd_in_recommendation_system/">推荐系统与协同过滤、奇异值分解</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/customer_segment_with_kmeans/">实战 | 构建基于客户细分的 K-Means 聚类算法！</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-11-14-using-lda-to-predict-topic/">代码 | 使用LDA预测文本的话题类型</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-04-25-tomotopy_is_the_fastest_topic_model/">tomotopy库 | 速度最快的LDA主题模型</a></p>
</li>
</ul>
<p><br><br></p>
<h2 id="六可视化">六、可视化</h2>
<ul>
<li><a href="https://textdata.cn/blog/2024-06-05-how-to-show-chinese-in-matplotlib-plotnine/">可视化 | 如何在matplotlib中显示中文</a></li>
<li><a href="https://textdata.cn/blog/2024-05-14-add-readpdf-readdocx-lexical-dispersion-plot/">cntext2.x | 新增读取pdf/docx| 提取MD&amp;A | 文本可视化等功能</a></li>
<li><a href="https://textdata.cn/blog/2024-01-23-umap/"><strong>可视化 | 使用umap对200维词向量的进行降维和可视化</strong></a></li>
<li><a href="https://textdata.cn/blog/2023-11-25-r-patchwork/">使用patchwork包进行多图排版</a></li>
<li><a href="https://textdata.cn/blog/2024-01-21-datamapplot/"><strong>可视化 | 使用 DataMapPlot 绘制数据地图</strong></a></li>
<li><a href="https://textdata.cn/blog/2023-05-11-bilibili-dongbei-big-brother/">B站 | &ldquo;高铁互殴&quot;视频词云图绘制</a></li>
<li><a href="https://textdata.cn/blog/2023-03-22-bedtime-topic_model_visualization/">可视化 | 睡前消息的科学社会、科学技术、社会化抚养话题可视化</a></li>
<li><a href="https://textdata.cn/blog/whatlies_word2vec/">可视化 | 使用whatlies库可视化词向量</a></li>
<li><a href="https://textdata.cn/blog/2022-11-29-santi-relationship-visualization-with-pyecharts/">可视化 | 绘制《三体》人物关系网络图</a></li>
<li><a href="https://textdata.cn/blog/2023-04-03-visualization-wordcloud-similarity-for-santi/">可视化 | 文本数据分成n等份、词云图、情绪变化趋势、相似度变化趋势</a></li>
<li><a href="https://textdata.cn/blog/2023-05-18-weibo-sentiment-score-line-plot/">可视化 | 微博用户群体情绪随时间变化趋势</a></li>
<li><a href="https://textdata.cn/blog/2023-02-11-chatgpt-plus-for-text-mining/">可视化 | 使用 chatGPT 做词频统计&amp;词云图</a></li>
<li><a href="https://textdata.cn/blog/2023-08-28-best-practice-netflix-data-visualization/"><strong>可视化（推荐） | Netflix 数据可视化最佳实践</strong></a></li>
<li><a href="https://textdata.cn/blog/2023-08-31-data_eda_2021_happiness_and_population/"><strong>可视化 | 2021年幸福指数&amp;人口数据可视化最佳实践</strong></a></li>
<li><a href="https://textdata.cn/blog/pyplutchik_emotion_circle/">可视化 | 使用PyPlutchik库可视化文本的情绪轮(情绪指纹)</a></li>
<li><a href="https://textdata.cn/blog/2023-02-11-pyanimate-create-vis-video/">可视化 | 使用pynimate库绘制动态可视化图</a></li>
<li><a href="https://textdata.cn/blog/2022-12-10-lovelyplots/">可视化 | 使用LovelyPlots库绘制科学论文、论文和演示文稿的可视化图形</a></li>
<li><a href="https://textdata.cn/blog/2023-06-02-r-ggdag/">可视化 | 使用ggdag包绘制有向图</a></li>
<li><a href="https://textdata.cn/blog/2023-04-13-prettymaps/">prettymaps库 | 绘制绝美地图</a></li>
</ul>
<p><br><br></p>
<h2 id="七r语言">七、R语言</h2>
<ul>
<li><a href="https://textdata.cn/blog/2023-11-25-ppsr-predictive-power-sccore/">相关性分析 | 从模型预测出发挖掘更多特征之间的关系</a></li>
<li><a href="https://textdata.cn/blog/2022-09-04-r-ggplot2-scatter/">R语言 | ggplot2简明绘图之散点图</a></li>
<li><a href="https://textdata.cn/blog/2022-09-04-r-ggplot2-histogram/">R语言 | ggplot2简明绘图之直方图</a></li>
<li><a href="https://textdata.cn/blog/2022-09-04-r-ggplot2-ggplotly/">R语言 | ggplot2简明绘图之动态图</a></li>
<li><a href="https://textdata.cn/blog/2022-09-04-posterdown/">R语言 | 使用posterdown包制作学术会议海报</a></li>
<li><a href="https://textdata.cn/blog/2022-09-20-r-ggsci/">R语言 | 使用ggsci包绘制sci风格图表</a></li>
<li><a href="https://textdata.cn/blog/2022-09-20-r-ggplot2-ggpubr/">R语言 | ggpubr包让数据可视化更加优雅</a></li>
<li><a href="https://textdata.cn/blog/2022-09-21-r-easystats-report/">R语言 | 让统计更easy的easystats集合包</a></li>
<li><a href="https://textdata.cn/blog/2022-10-07-r-shiny-reactive/">R语言 | 使用shiny的reactive表达式写应用程序</a></li>
<li><a href="https://textdata.cn/blog/2022-10-07-r-stargazer/">R语言 | 使用stargazer包输出格式化回归结果</a></li>
<li><a href="https://textdata.cn/blog/2022-10-12-r-word2vec/">R语言 | 使用word2vec词向量模型</a></li>
<li><a href="https://textdata.cn/blog/2023-01-20-visualization-of-sentiment-analysis-of-historical-text-data-with-r/">R语言 | 绘制文本数据情感历时趋势图</a></li>
</ul>
<p><br><br></p>
<h2 id="八其他">八、其他</h2>
<ul>
<li>
<p><a href="https://textdata.cn/2024-04-21-tqdm-progress-bar/">tqdm库 | Python中实现进度条的几种方式</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2024-01-16-cpca-china-province-city-area/"> cpca库 | 中国省、市区划匹配库</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/causal_inference/">causalinference库 | 使用Python做因果推断</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-11-26-using-ruptures-to-detect-change-point/">使用 Ruptures 识别时间序列数据中的变化点</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-03-31-using-poetry-to-manage-your-project-env/">硬核 | 使用Poetry发布Python库到PyPi的方法</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/karateclub_tutorial/">karateclub库 | 计算社交网络中节点的向量</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-01-21-create-brower-data-label-tools-with-nicegui/">NiceGUI库 | 简单易懂的Web GUI开发包； 可开发数据标注工具、心理学实验工具等</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2022-09-01-how_to_use_tinytex/">Latex | 为Rmarkdown配置tinytex环境</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-03-13-add-cls-to-tex-global-enviroment-path/">Latex | 将 .cls 更新到本地 Tex 发行版的搜索路径</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2022-11-25-faker-generate-test-data/">Faker库 | 生成实验数据</a></p>
</li>
</ul>
<p><br><br></p>
<h2 id="九工具">九、工具</h2>
<ul>
<li>
<p><a href="https://textdata.cn/blog/2024-05-27-pychram-professional-installation-and-usage"> 图文 | PyCharm专业版下载&amp;安装&amp;激活</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2024-01-31-langchain-chatchat/">使用 Langchain-Chatchat 搭建本地知识库问答系统</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-03-16-free-chatgpt-list/">免费可用的chatGPT镜像站点清单</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-03-26-chatgpt-for-jupyter/">在 Jupyter Notebook 内使用 ChatGPT 服务</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-02-15-how-to-sign-up-the-chatgpt-accout-and-upgrade-to-plus/">如何注册chatGPT账号</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-02-11-credit_card_for_chatgpt-plus/">使用虚拟信用卡，国内用户升级为chatGPT plus会员</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-02-15-write-web-scraper-with-chatgpt/">使用 chatGPT 写 Python 网络爬虫</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-01-18-rath-next-generation-business-intelligence/">Rath | 自动化数据分析工具</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-02-01-v2net-science-network/">科学上网工具v2net</a></p>
</li>
</ul>
<p><br><br></p>
<h2 id="广而告之">广而告之</h2>
<ul>
<li>
<p><a href="https://textdata.cn/blog/datasets_available_for_management_science/">LIST | 可供社科(经管)领域使用的数据集汇总</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/the_text_analysis_list_about_ms/">LIST | 社科(经管)数据挖掘文献资料汇总</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/">推荐 | 文本分析库cntext2.x使用手册</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/management_python_course/">付费视频课 | Python实证指标构建与文本分析</a></p>
</li>
</ul>
]]></content:encoded>
    </item>
    
    <item>
      <title>文献&amp;代码 | 使用Python计算语义品牌评分(Semantic Brand Score)</title>
      <link>https://textdata.cn/blog/2024-04-12-semantic-brand-score/</link>
      <pubDate>Fri, 12 Apr 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-04-12-semantic-brand-score/</guid>
      <description>Semantic Brand Score</description>
      <content:encoded><![CDATA[<h2 id="一语义品牌评分">一、语义品牌评分</h2>
<p><strong>语义品牌评分(SBS)</strong>  是一种新颖的指标，可以通过文本语料，衡量(评估)不同环境下一个或多个品牌的 <strong>品牌重要性</strong>。 <br></p>
<blockquote>
<p>Colladon, Andrea Fronzetti. &ldquo;<em><strong>The semantic brand score</strong></em>.&rdquo; <em>Journal of Business Research</em> 88 (2018): 150-160.</p>
</blockquote>
<br>
<p>相对于一些传统措施的优点是，SBS 不依赖于对小样本消费者进行的调查，可以捕捉到真实可信的信号 。该度量可以<strong>对任意来源的文本进行计算</strong>， 例如报纸文章、电子邮件、推文、在线论坛、博客和社交媒体上的帖子。  如果研究景点品牌的重要性，可以从消费者或其他品牌利益相关者通常出现的地方（例如旅游论坛）收集他们的发表的信息。这样做的优点是可以减少因使用问卷而引起的偏见，因为受访者知道他们正在被观察。 SBS 还可以适应不同的语言，并研究特定单词或单词集（不一定是“品牌”）的重要性。</p>
<p>通过 “品牌”，人们可以指政治家的名字，或者代表一个概念的一组单词（例如，“创新”的概念或企业核心价值）。该措施用于评估新品牌取代旧品牌时发生的过渡动态。语义品牌评分还可用于将品牌的重要性与其竞争对手的重要性联系起来，或分析单个品牌的重要性时间趋势。在某些应用中，事实证明该分数对于预测目的很有用。例如，人们发现在线媒体中政治候选人的品牌重要性与选举结果之间存在联系，或者景点品牌的重要性与游客数量趋势之间存在联系。</p>
<p><img loading="lazy" src="img/sbs-trend-plot.jpg" alt=""  />
</p>
<p><br><br></p>
<h2 id="二品牌重要性的三个维度">二、品牌重要性的三个维度</h2>
<p>SBS 衡量 <strong>品牌重要性</strong> ，这是品牌资产的基础(Fronzetti Colladon， 2018)。事实上，该指标的部分灵感来自于众所周知的品牌资产概念以及品牌形象和品牌意识的构建（Keller, 1993）。 品牌重要性通过三个维度来衡量：<strong>流行度</strong>、<strong>多样性</strong> 和 <strong>连通性</strong>。</p>
<ul>
<li><strong>流行度(Prevalence)</strong>   衡量品牌名称的使用频率，即直接提及品牌的次数。</li>
<li><strong>多样性(Diversity)</strong>  衡量与品牌相关的词语的多样性。</li>
<li><strong>连接性(Connectivity)</strong>  代表品牌在其他单词或单词组（有时被视为话语主题）之间建立联系的能力。</li>
</ul>
<p><br><br></p>
<h2 id="三文本分析步骤">三、文本分析步骤</h2>
<p><strong>语义品牌得分(SBS)</strong>  的计算需要结合文本挖掘和社交网络分析的方法和工具。下图说明了主要的初步步骤，包括数据收集、文本预处理和单词共现网络的构建。</p>
<p><img loading="lazy" src="img/text-preprocess.jpg" alt=""  />
</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">1. 准备文本数据
2. 文本预处理(剔除标点符号、剔除特殊字符、剔除html标签、剔除#@等符号、剔除停用词)
3. 英文小写、分词、合并同类项(类似于is、was、are都合并到be)
4. 从文本信息中构建共现语义网络(确定词语上下文范围，涉及到co-range， 默认co-range=7)
5. 剔除贡献语义网络中不重要的边(联系， 涉及到参数link_filter， 默认link_filter=2))
</code></pre></div><br>
<br>
<h2 id="四实验">四、实验</h2>
<p>以三体为例，分析小说中5个角色的语义品牌评分（类比于文本中分析品牌的重要性） 。我们将小说等分为20分，希望得到角色语义品牌评分随着小说进度的变化趋势。</p>
<p><img loading="lazy" src="img/plot.png" alt=""  />
</p>
<br>
<h3 id="41-读取数据">4.1 读取数据</h3>
<p>三体小说2.5M</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>


<span class="k">def</span> <span class="nf">read_txt</span><span class="p">(</span><span class="n">file</span><span class="p">,</span> <span class="n">num_segments</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8&#39;</span><span class="p">):</span>
    <span class="c1"># 读取txt文件</span>
    <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">file</span><span class="p">,</span> <span class="s2">&#34;r&#34;</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="n">encoding</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
        <span class="n">text</span> <span class="o">=</span> <span class="n">f</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>
    
    <span class="c1"># 获取文本的总长度和每一段的长度</span>
    <span class="n">total_length</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>
    <span class="n">segment_length</span> <span class="o">=</span> <span class="n">total_length</span> <span class="o">//</span> <span class="n">num_segments</span>
    
    <span class="c1"># 将文本分割成指定数量的段落</span>
    <span class="n">segments</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">num_segments</span><span class="p">):</span>
        <span class="n">start</span> <span class="o">=</span> <span class="n">i</span> <span class="o">*</span> <span class="n">segment_length</span>
        <span class="n">end</span> <span class="o">=</span> <span class="p">(</span><span class="n">i</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span> <span class="o">*</span> <span class="n">segment_length</span>
        <span class="k">if</span> <span class="n">i</span> <span class="o">==</span> <span class="n">num_segments</span> <span class="o">-</span> <span class="mi">1</span><span class="p">:</span>
            <span class="n">end</span> <span class="o">=</span> <span class="n">total_length</span>
        <span class="n">segment</span> <span class="o">=</span> <span class="n">text</span><span class="p">[</span><span class="n">start</span><span class="p">:</span><span class="n">end</span><span class="p">]</span>
        <span class="n">segments</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">segment</span><span class="p">)</span>

    <span class="c1"># 将内容存储在数据框中</span>
    <span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">segments</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="s2">&#34;docs&#34;</span><span class="p">])</span>
    
    <span class="k">return</span> <span class="n">df</span>


<span class="c1">#分成20份</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">read_txt</span><span class="p">(</span><span class="n">file</span><span class="o">=</span><span class="s1">&#39;三体全集.txt&#39;</span><span class="p">,</span> <span class="n">num_segments</span><span class="o">=</span><span class="mi">20</span><span class="p">)</span>
<span class="n">df</span>
</code></pre></div><p><img loading="lazy" src="img/01-df.png" alt=""  />
</p>
<br>
<h3 id="42-计算sbs">4.2 计算SBS</h3>
<p>语义品牌评分SBS已经封装到  <em><strong>cntext</strong></em> 中</p>
<br>
<h4 id="421-安装cntext">4.2.1 安装cntext</h4>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">pip3 install cntext --upgrade
</code></pre></div><p><br><br></p>
<h4 id="422--开始计算">4.2.2  开始计算</h4>
<p><em><strong>2.7M</strong></em> 的三体小说文本，全部运行下来大概 10-20min ，可见SBS计算非常慢， 所以为了省时间，我们先以三体小说第一份（等分20份中的第一份）做个小实验。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="n">brands</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;汪淼&#39;</span><span class="p">,</span> <span class="s1">&#39;史强&#39;</span><span class="p">,</span> <span class="s1">&#39;罗辑&#39;</span><span class="p">,</span> <span class="s1">&#39;叶文洁&#39;</span><span class="p">,</span> <span class="s1">&#39;伊文斯&#39;</span><span class="p">]</span>

<span class="c1">#小说第一份文本（等分20份中的第一份）</span>
<span class="n">text</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;docs&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">values</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>

<span class="c1">#如果不用三体， 只想分析某个txt，以data.txt为例</span>
<span class="c1">#text = open(&#39;data.txt&#39;).read()</span>

<span class="n">sbs_df0</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">semantic_brand_score</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="n">text</span><span class="p">,</span> 
                               <span class="n">brands</span><span class="o">=</span><span class="n">brands</span><span class="p">,</span> 
                               <span class="n">lang</span><span class="o">=</span><span class="s1">&#39;chinese&#39;</span><span class="p">)</span>
<span class="n">sbs_df0</span><span class="p">[</span><span class="s1">&#39;doc_idx&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">sbs_df0</span>
</code></pre></div><p><img loading="lazy" src="img/02-df.png" alt=""  />
</p>
<p><br> 运行没出现问题， 现在我们对整个小说进行实验，计算五个角色的 SBS 随时间变化。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>  <span class="c1">#记录时间</span>

<span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="n">brands</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;汪淼&#39;</span><span class="p">,</span> <span class="s1">&#39;史强&#39;</span><span class="p">,</span> <span class="s1">&#39;罗辑&#39;</span><span class="p">,</span> <span class="s1">&#39;叶文洁&#39;</span><span class="p">,</span> <span class="s1">&#39;伊文斯&#39;</span><span class="p">]</span>
<span class="n">sbs_dfs</span> <span class="o">=</span> <span class="p">[]</span>

<span class="k">for</span> <span class="n">idx</span><span class="p">,</span> <span class="n">text</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;docs&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">values</span><span class="p">):</span>
    <span class="nb">print</span><span class="p">(</span><span class="n">idx</span><span class="p">)</span>
    <span class="n">sbs_df</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">semantic_brand_score</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="n">text</span><span class="p">,</span> 
                              <span class="n">brands</span><span class="o">=</span><span class="n">brands</span><span class="p">,</span> 
                              <span class="n">lang</span><span class="o">=</span><span class="s1">&#39;chinese&#39;</span><span class="p">)</span>
    <span class="n">sbs_df</span><span class="p">[</span><span class="s1">&#39;doc_idx&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">idx</span>
    <span class="n">sbs_dfs</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">sbs_df</span><span class="p">)</span>
    
<span class="n">SBS_DFs</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">concat</span><span class="p">(</span><span class="n">sbs_dfs</span><span class="p">)</span>
<span class="n">SBS_DFs</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">0
WARNING: Loops will be ignored.
1
WARNING: Loops will be ignored.
2
WARNING: Loops will be ignored.
3
WARNING: Loops will be ignored.
4
WARNING: Loops will be ignored.
5
WARNING: Loops will be ignored.
6
WARNING: Loops will be ignored.
7
WARNING: Loops will be ignored.
8
WARNING: Loops will be ignored.
9
WARNING: Loops will be ignored.
10
WARNING: Loops will be ignored.
11
WARNING: Loops will be ignored.
12
WARNING: Loops will be ignored.
13
WARNING: Loops will be ignored.
14
WARNING: Loops will be ignored.
15
WARNING: Loops will be ignored.
16
WARNING: Loops will be ignored.
17
WARNING: Loops will be ignored.
18
WARNING: Loops will be ignored.
19
WARNING: Loops will be ignored.

CPU times: user 10min 9s, sys: 8.53 s, total: 10min 17s
Wall time: 10min 19s
</code></pre></div><p><img loading="lazy" src="img/03-df.png" alt=""  />
</p>
<br>
<h3 id="43-可视化sbs">4.3 可视化SBS</h3>
<p>可视化三体小说五个角色重要性（语义品牌评分， SBS）随时间 (文本字符位置) 变化趋势</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">SBS_DFs</span><span class="o">.</span><span class="n">reset_index</span><span class="p">(</span><span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">SBS_DFs</span><span class="o">.</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="s1">&#39;index&#39;</span><span class="p">:</span> <span class="s1">&#39;Brand&#39;</span><span class="p">},</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">SBS_DFs</span>
</code></pre></div><p><img loading="lazy" src="img/04-df.png" alt=""  />

<br></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">import</span> <span class="nn">matplotlib</span>
<span class="kn">import</span> <span class="nn">matplotlib_inline</span>
<span class="n">matplotlib_inline</span><span class="o">.</span><span class="n">backend_inline</span><span class="o">.</span><span class="n">set_matplotlib_formats</span><span class="p">(</span><span class="s1">&#39;png&#39;</span><span class="p">,</span> <span class="s1">&#39;svg&#39;</span><span class="p">)</span>
<span class="kn">import</span> <span class="nn">scienceplots</span>
<span class="kn">import</span> <span class="nn">platform</span>
<span class="n">plt</span><span class="o">.</span><span class="n">style</span><span class="o">.</span><span class="n">use</span><span class="p">([</span><span class="s1">&#39;science&#39;</span><span class="p">,</span> <span class="s1">&#39;no-latex&#39;</span><span class="p">,</span> <span class="s1">&#39;cjk-sc-font&#39;</span><span class="p">])</span>
<span class="n">system</span> <span class="o">=</span> <span class="n">platform</span><span class="o">.</span><span class="n">system</span><span class="p">()</span>  <span class="c1"># 获取操作系统类型</span>

<span class="k">if</span> <span class="n">system</span> <span class="o">==</span> <span class="s1">&#39;Windows&#39;</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;SimHei&#39;</span><span class="p">}</span>
<span class="k">elif</span> <span class="n">system</span> <span class="o">==</span> <span class="s1">&#39;Darwin&#39;</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;Arial Unicode MS&#39;</span><span class="p">}</span>
<span class="k">else</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;sans-serif&#39;</span><span class="p">}</span>
<span class="n">matplotlib</span><span class="o">.</span><span class="n">rc</span><span class="p">(</span><span class="s1">&#39;font&#39;</span><span class="p">,</span> <span class="o">**</span><span class="n">font</span><span class="p">)</span>  <span class="c1"># 设置全局字体</span>
    
<span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span>


<span class="k">for</span> <span class="n">brand</span><span class="p">,</span> <span class="n">brand_df</span> <span class="ow">in</span> <span class="n">SBS_DFs</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">&#39;Brand&#39;</span><span class="p">):</span>
    <span class="n">plt</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">brand_df</span><span class="o">.</span><span class="n">doc_idx</span><span class="p">,</span> <span class="n">brand_df</span><span class="o">.</span><span class="n">SBS</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="n">brand</span><span class="p">)</span>
    <span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">brand_df</span><span class="o">.</span><span class="n">doc_idx</span><span class="p">,</span> <span class="n">brand_df</span><span class="o">.</span><span class="n">SBS</span><span class="p">)</span>
    
    
    
    
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">&#39;三体人物角色的语义品牌评分(semantic brand score)趋势&#39;</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">14</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s1">&#39;black&#39;</span><span class="p">,</span> <span class="n">ha</span><span class="o">=</span><span class="s1">&#39;center&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s1">&#39;小说字符位置(小说等分为20份)&#39;</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">13</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s1">&#39;Semantic Brand Score&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">legend</span><span class="p">(</span><span class="n">loc</span><span class="o">=</span><span class="s1">&#39;upper right&#39;</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">12</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>    
</code></pre></div><p><img loading="lazy" src="img/plot.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="五-获取资源">五、 获取资源</h2>
<p>内容整理不易， 如果对本文感兴趣</p>
<ul>
<li><em><strong>免费</strong></em>   获取本文代码&amp;实验数据  链接: <a href="https://pan.baidu.com/s/1ut8bKDxd5PGL_dm_yXTzcA?pwd=tr3t">https://pan.baidu.com/s/1ut8bKDxd5PGL_dm_yXTzcA?pwd=tr3t</a> 提取码: tr3t</li>
</ul>
<p><br><br></p>
<h2 id="cntext使用声明">cntext使用声明</h2>
<p>如在研究或项目中使用 cntext ，请在文中介绍并附引用声明。引用格式可参考 <a href="https://textdata.cn/blog/2025-09-09-transform-text-data-into-structured-data-with-lmstudio-and-cntext/#%E4%BD%BF%E7%94%A8%E5%A3%B0%E6%98%8E">cntext 推荐引用格式</a></p>
<p><br><br></p>
<h2 id="相关资料">相关资料</h2>
<p>Colladon, Andrea Fronzetti. &ldquo;<em><strong>The semantic brand score</strong></em>.&rdquo; <em>Journal of Business Research</em> 88 (2018): 150-160.</p>
<p>SBS相关文章列表  <a href="https://semanticbrandscore.com/sbsarticles.html">https://semanticbrandscore.com/sbsarticles.html</a></p>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>数据集 | 372w政府采购合同公告明细数据（2024.03）</title>
      <link>https://textdata.cn/blog/2023-09-03-government-procurement-contract-data/</link>
      <pubDate>Wed, 10 Apr 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-09-03-government-procurement-contract-data/</guid>
      <description>&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/01-cover.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;h2 id=&#34;一数据集概况&#34;&gt;一、数据集概况&lt;/h2&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;- 数据来源: 中国政府采购网（www.ccgp.gov.cn）
- 记录数量: 3724395
- 发布时间: 1996-06-05 ~ 2024-03-07, 但主要是2015之后


声明: 科研用途；如有问题， 请加微信372335839，备注「姓名-学校-专业」
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二应用&#34;&gt;二、应用&lt;/h2&gt;
&lt;p&gt;随着政府采购规模的逐步增加，中国政府采购网披露的信息越来越丰富。近年来一些学者也关 注到中国政府采购数据，但由于文本数据半结构化、高维、数据量大的特性，该数据在文本的整理、 关键变量识别与关键变量提取方面存在着不小的难度，目前而言使用该数据的研究并没有很多。&lt;/p&gt;
&lt;h3 id=&#34;21-创新&#34;&gt;2.1 创新&lt;/h3&gt;
&lt;p&gt;姜爱华和费堃桀（2021） 手工整理了 2015-2019 年的政府采购数据，利用公告中供应商的名称与上市公司全称进行匹配，最终得到了 13 004 个企业年度观测值，发现企业获 得政府采购订单能够显著促进企业创新。&lt;/p&gt;
&lt;p&gt;Beraja 等（2020）基 于 2013-2019 年政府采购合同，与中国人工智能企业进行名单匹配，得到 28 023 份政 府人脸识别采购合同样本，发现政府采购对人脸识别相关的人工智能专利的增长起到了推动作用。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;22-政企关系&#34;&gt;2.2 政企关系&lt;/h3&gt;
&lt;p&gt;Fang 等（2022）利用中国政府采购网 2013-2020 年的采购公告与工商注册企业数据进行匹配，发现当本地官员处于激烈的政治竞争中时，本地政府将更少地向 竞争地区的企业进行采购，这造成了市场分割，影响了资源分配。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;23-其他&#34;&gt;2.3 其他&lt;/h3&gt;
&lt;p&gt;政府采购影响企业履行企业社会责任（韩旭和武威，2021）、中国特色精准扶贫（武威等，2022）、经济 发展（武威和刘国平，2021）等。此外，还有研究单独使用政府采购数据测量经济生产生活。江鸿 泽和梁平汉（2022）基于政府采购公告整理了各地的公共视频监控系统使用情况，Liu 等（2022） 则抓取了 2013-2021 年政府采购公告，用以识别企业的政治联系。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二查看数据&#34;&gt;二、查看数据&lt;/h2&gt;
&lt;h3 id=&#34;21-读取数据&#34;&gt;2.1 读取数据&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;


&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;政府采购公告1996-2024.3.csv.gz&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;compression&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;gzip&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#gz文件可用bandizp或winrar解压得到csv&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#df = pd.read_csv(&amp;#39;政府采购公告1996-2024.3.csv&amp;#39;)&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;合同公告日期&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;to_datetime&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;合同公告日期&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;head&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/02-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;22-记录数&#34;&gt;2.2 记录数&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;数据集记录数: &amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;pre&gt;&lt;code&gt;数据集记录数:  2883958
&lt;/code&gt;&lt;/pre&gt;
&lt;br&gt;
&lt;h3 id=&#34;23-字段&#34;&gt;2.3 字段&lt;/h3&gt;
&lt;p&gt;数据所含字段&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;col&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;columns&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;col&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;合同编号
合同名称
项目编号
项目名称
采购人(甲方)
采购人地址
采购人联系方式
供应商(乙方)
供应商地址
供应商联系方式
主要标的名称
规格型号或服务要求
主要标的数量
主要标的单价
合同金额(万元)
履约期限、地点等简要信息
采购方式
合同签订日期
合同公告日期
其他补充事宜
所属地域
所属行业
代理机构
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;24-公告日期&#34;&gt;2.4 公告日期&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;#数据集公告日期起止&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;合同公告日期&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;to_datetime&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;合同公告日期&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;

&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;发布时间&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;合同公告日期&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;min&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;())&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;发布时间&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;合同公告日期&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;max&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;pre&gt;&lt;code&gt;发布时间 1996-06-05 00:00:00
发布时间 2024-03-07 00:00:00
&lt;/code&gt;&lt;/pre&gt;
&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;#政府采购合同公告数据，主要出现在2015年之后&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;合同公告日期&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;dt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;year&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;value_counts&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;sort_index&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;合同公告日期
1996          1
2000          1
2002          2
2004          7
2008          5
2009          3
2010          2
2011         13
2012          3
2013          4
2014         24
2015      15543
2016      42195
2017      94193
2018     154922
2019     151181
2020     187874
2021     549078
2022    1060710
2023    1355749
2024     112885
Name: count, dtype: int64
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;梁平汉和郭宇辰(2023) 认为 &lt;strong&gt;2015年财政部相关采购信息发布文件出台之后采购公告上传率大幅上升至80%以上，因此采用2015年以后的中国政府采购网数据进行研究更为合适&lt;/strong&gt;。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;24-甲乙方人数&#34;&gt;2.4 甲(乙)方人数&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;#甲方乙方数量&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#甲方乙方数量&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;采购人(甲方)数: &amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;采购人&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;nunique&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;())&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;供应商(乙方)数: &amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;供应商&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;nunique&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;pre&gt;&lt;code&gt;采购人(甲方)数:  234082
供应商(乙方)数:  499943
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;三实验代码&#34;&gt;三、实验代码&lt;/h2&gt;
&lt;h3 id=&#34;31-是否含某类词&#34;&gt;3.1 是否含某(类)词&lt;/h3&gt;
&lt;p&gt;根据公告中是否出现某(类)词，可以提起一些指标。例如 Beraja 等（2020）基于 2013-2019 年政府采购合同，与中国人工智能企业进行名单匹配，得到 28 023 份政府人脸识别采购合同样本。 本文仅简单示范， 以 &lt;em&gt;&lt;strong&gt;人工智能&lt;/strong&gt;&lt;/em&gt; 相关词为例&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;合同名称&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fillna&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;contains&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;人工智能|自然语言处理|自动驾驶|AI|ai&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;0          False
1          False
2          False
3          False
4          False
           ...  
3724390    False
3724391    False
3724392    False
3724393    False
3724394    False
Name: 合同名称, Length: 3724395, dtype: bool
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;#AI相关公告的数量&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;合同名称&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fillna&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;contains&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;人工智能|自然语言处理|自动驾驶|AI|ai&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;sum&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;1323
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;#显示匹配到的与 AI 有关的【合同名称】&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;合同名称&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fillna&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;contains&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;人工智能|自然语言处理|自动驾驶|AI|ai&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)][&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;合同名称&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;1129                                     贵州大学人工智能研究院建设项目采购合同
4935            龙岩初级中学人工智能创客实验室设备货物类采购项目合同\n （macrodatas.cn）
12231                       中国医学科学院系统医学研究院人工智能高性能计算设备采购合同协议书
13171      双高基于AIoT轨道交通智慧运维环境信号检测分析设备购置(二次)\n\n微信公众号“马克 数据网”
16921                           广州国际生物岛自动驾驶新能源环卫作业创新试点服务采购项目
                                 ...                        
3708596               榆林市教育技术中心人工智能助推教师队伍建设-教师发展智慧管理平台建设项目合同
3708922        邢台市信都区“人工智能公共技术服务平台”项目一标段数字教育、数字文旅采购合同\n\n （）
3709875         吴忠市第三中学南湖校区AI课堂教学行为分析评测系统及智慧教室设备采购项目系统集成服务合同
3712051                               人工智能与机器人领域创新成果产业化成熟度评价
3724277        民乐县现代农业投资有限责任公司民乐县人工智能一二三产业融合功能区食用菌菌棒生产项目  （）
Name: 合同名称, Length: 1323, dtype: object
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;32-构建省份字段&#34;&gt;3.2 构建省份字段&lt;/h3&gt;
&lt;p&gt;数据集中有  &lt;em&gt;&lt;strong&gt;采购人地址&lt;/strong&gt;&lt;/em&gt;、&lt;em&gt;&lt;strong&gt;采购人(甲方)&lt;/strong&gt;&lt;/em&gt; 两个地址字段，我们以 &lt;em&gt;&lt;strong&gt;采购人(甲方)&lt;/strong&gt;&lt;/em&gt; 为例，构建 &lt;em&gt;&lt;strong&gt;采购人省份&lt;/strong&gt;&lt;/em&gt; 字段。 注意: 经过测试，使用cpca库提取省份信息， 两种方式提取省份信息缺失率依次是 24.8%、 7%， 因此我们决定采用 &lt;em&gt;&lt;strong&gt;采购人(甲方)&lt;/strong&gt;&lt;/em&gt;  来提取省份。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;cpca&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;provs_df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;cpca&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;transform&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;采购人(甲方)&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;采购人省份&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;cpca&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;transform&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;采购人(甲方)&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;省&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;采购人省份&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;采购人省份&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fillna&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;apply&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;k&#34;&gt;lambda&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;k&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;re&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;sub&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;自治区|特别行政区&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;k&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;head&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/03-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;33-按省分组查看记录量&#34;&gt;3.3 按省分组查看记录量&lt;/h3&gt;
&lt;p&gt;假设 &lt;em&gt;&lt;strong&gt;采购人省份&lt;/strong&gt;&lt;/em&gt; 构建的准确的话， 就可以分组查看每个省的记录量。   df.groupby(&amp;lsquo;采购人省份&amp;rsquo;)&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;prov&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;prov_df&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;groupby&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;采购人省份&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
    &lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;prov&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;prov_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt; 267312  (未知省份，cpca缺失字段，占比大概7%)
上海市 29493
云南省 49789
内蒙古 480459
北京市 71869
台湾省 93
吉林省 14219
四川省 155028
天津市 10734
宁夏回族 76783
安徽省 44133
山东省 14634
山西省 5784
广东省 1349039
广西壮族 12534
新疆维吾尔 8000
江苏省 28655
江西省 8949
河北省 203761
河南省 8159
浙江省 12158
海南省 38603
湖北省 6156
湖南省 11300
甘肃省 289772
福建省 97527
西藏 2558
贵州省 2599
辽宁省 34547
重庆市 58673
陕西省 55478
青海省 22441
香港 80
黑龙江省 253076
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;plt&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;matplotlib&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;scienceplots&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;platform&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;matplotlib_inline&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;matplotlib_inline&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;backend_inline&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;set_matplotlib_formats&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;png&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;svg&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;jieba&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;warnings&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;warnings&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;filterwarnings&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;ignore&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;style&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;use&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;([&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;science&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;no-latex&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;cjk-sc-font&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;system&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;platform&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;system&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;  &lt;span class=&#34;c1&#34;&gt;# 获取操作系统类型&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;if&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;system&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;==&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;Windows&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;font&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;family&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;SimHei&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;elif&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;system&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;==&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;Darwin&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;font&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;family&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;Arial Unicode MS&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;else&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;font&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;family&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;sans-serif&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;matplotlib&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;rc&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;font&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;**&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;font&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;  &lt;span class=&#34;c1&#34;&gt;# 设置全局字体&lt;/span&gt;



&lt;span class=&#34;n&#34;&gt;prov_volumes&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[]&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;prov&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;prov_df&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;groupby&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;采购人省份&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;prov_volumes&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;append&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;({&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;prov&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;prov&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;volume&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;prov_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)})&lt;/span&gt;
    
&lt;span class=&#34;n&#34;&gt;prov_volumes_df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;DataFrame&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;prov_volumes&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;prov_volumes_df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;set_index&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;prov&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;sort_values&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;volume&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ascending&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;False&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;plot&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;kind&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;bar&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;figsize&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;10&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;4&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;title&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;政府采购数量(采购人按省)&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;15&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;xticks&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;10&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;rotation&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;45&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;xlabel&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;省份&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;13&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ylabel&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;采购公告数量&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;13&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;show&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/plot.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;采购按省， 最多的几个省份依次是广东、内蒙、甘肃、黑龙江等。  甘肃和黑龙江之间有个空白， 这是因为根据采购人(甲方)使用cpca提取省份信息时，有7%记录是缺失的。&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;三相关研究&#34;&gt;三、相关研究&lt;/h2&gt;
&lt;p&gt;相关研究近期文献&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;[1]周亚虹,蒲余路,陈诗一等.政府扶持与新型产业发展——以新能源为例[J].经济研究,2015,50(06):147-161.
[2]武威,刘国平.政府采购与经济发展：转型效应与协同效应——基于产业结构升级视角[J].财政研究,2021(08):77-90.
[3]孙薇,叶初升.政府采购何以牵动企业创新——兼论需求侧政策“拉力”与供给侧政策“推力”的协同[J].中国工业经济,2023(01):95-113.
[4]姜爱华,费堃桀,张鑫娜.政府采购、营商环境与企业创新——基于A股上市公司的经验证据[J].中央财经大学学报,2022(09):3-15.
[5]梁平汉, 郭宇辰. 中国政府采购公告数据的使用和潜在问题[J]. 产业经济评论, 2023, (01): 68-80.
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
</description>
      <content:encoded><![CDATA[<p><img loading="lazy" src="img/01-cover.png" alt=""  />
</p>
<h2 id="一数据集概况">一、数据集概况</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">- 数据来源: 中国政府采购网（www.ccgp.gov.cn）
- 记录数量: 3724395
- 发布时间: 1996-06-05 ~ 2024-03-07, 但主要是2015之后


声明: 科研用途；如有问题， 请加微信372335839，备注「姓名-学校-专业」
</code></pre></div><p><br><br></p>
<h2 id="二应用">二、应用</h2>
<p>随着政府采购规模的逐步增加，中国政府采购网披露的信息越来越丰富。近年来一些学者也关 注到中国政府采购数据，但由于文本数据半结构化、高维、数据量大的特性，该数据在文本的整理、 关键变量识别与关键变量提取方面存在着不小的难度，目前而言使用该数据的研究并没有很多。</p>
<h3 id="21-创新">2.1 创新</h3>
<p>姜爱华和费堃桀（2021） 手工整理了 2015-2019 年的政府采购数据，利用公告中供应商的名称与上市公司全称进行匹配，最终得到了 13 004 个企业年度观测值，发现企业获 得政府采购订单能够显著促进企业创新。</p>
<p>Beraja 等（2020）基 于 2013-2019 年政府采购合同，与中国人工智能企业进行名单匹配，得到 28 023 份政 府人脸识别采购合同样本，发现政府采购对人脸识别相关的人工智能专利的增长起到了推动作用。</p>
<br>
<h3 id="22-政企关系">2.2 政企关系</h3>
<p>Fang 等（2022）利用中国政府采购网 2013-2020 年的采购公告与工商注册企业数据进行匹配，发现当本地官员处于激烈的政治竞争中时，本地政府将更少地向 竞争地区的企业进行采购，这造成了市场分割，影响了资源分配。</p>
<br>
<h3 id="23-其他">2.3 其他</h3>
<p>政府采购影响企业履行企业社会责任（韩旭和武威，2021）、中国特色精准扶贫（武威等，2022）、经济 发展（武威和刘国平，2021）等。此外，还有研究单独使用政府采购数据测量经济生产生活。江鸿 泽和梁平汉（2022）基于政府采购公告整理了各地的公共视频监控系统使用情况，Liu 等（2022） 则抓取了 2013-2021 年政府采购公告，用以识别企业的政治联系。</p>
<p><br><br></p>
<h2 id="二查看数据">二、查看数据</h2>
<h3 id="21-读取数据">2.1 读取数据</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>


<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;政府采购公告1996-2024.3.csv.gz&#39;</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">)</span>

<span class="c1">#gz文件可用bandizp或winrar解压得到csv</span>
<span class="c1">#df = pd.read_csv(&#39;政府采购公告1996-2024.3.csv&#39;)</span>

<span class="n">df</span><span class="p">[</span><span class="s1">&#39;合同公告日期&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;合同公告日期&#39;</span><span class="p">])</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/02-df.png" alt=""  />
</p>
<br>
<h3 id="22-记录数">2.2 记录数</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="nb">print</span><span class="p">(</span><span class="s1">&#39;数据集记录数: &#39;</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">))</span>
</code></pre></div><pre><code>数据集记录数:  2883958
</code></pre>
<br>
<h3 id="23-字段">2.3 字段</h3>
<p>数据所含字段</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="k">for</span> <span class="n">col</span> <span class="ow">in</span> <span class="n">df</span><span class="o">.</span><span class="n">columns</span><span class="p">:</span>
    <span class="nb">print</span><span class="p">(</span><span class="n">col</span><span class="p">)</span>
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">合同编号
合同名称
项目编号
项目名称
采购人(甲方)
采购人地址
采购人联系方式
供应商(乙方)
供应商地址
供应商联系方式
主要标的名称
规格型号或服务要求
主要标的数量
主要标的单价
合同金额(万元)
履约期限、地点等简要信息
采购方式
合同签订日期
合同公告日期
其他补充事宜
所属地域
所属行业
代理机构
</code></pre></div><br>
<h3 id="24-公告日期">2.4 公告日期</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#数据集公告日期起止</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;合同公告日期&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;合同公告日期&#39;</span><span class="p">])</span>

<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;发布时间&#39;</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;合同公告日期&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">min</span><span class="p">())</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;发布时间&#39;</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;合同公告日期&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">())</span>
</code></pre></div><pre><code>发布时间 1996-06-05 00:00:00
发布时间 2024-03-07 00:00:00
</code></pre>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#政府采购合同公告数据，主要出现在2015年之后</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;合同公告日期&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">dt</span><span class="o">.</span><span class="n">year</span><span class="o">.</span><span class="n">value_counts</span><span class="p">()</span><span class="o">.</span><span class="n">sort_index</span><span class="p">()</span>
</code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">合同公告日期
1996          1
2000          1
2002          2
2004          7
2008          5
2009          3
2010          2
2011         13
2012          3
2013          4
2014         24
2015      15543
2016      42195
2017      94193
2018     154922
2019     151181
2020     187874
2021     549078
2022    1060710
2023    1355749
2024     112885
Name: count, dtype: int64
</code></pre></div><p>梁平汉和郭宇辰(2023) 认为 <strong>2015年财政部相关采购信息发布文件出台之后采购公告上传率大幅上升至80%以上，因此采用2015年以后的中国政府采购网数据进行研究更为合适</strong>。</p>
<br>
<h3 id="24-甲乙方人数">2.4 甲(乙)方人数</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#甲方乙方数量</span>
<span class="c1">#甲方乙方数量</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;采购人(甲方)数: &#39;</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;采购人&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">nunique</span><span class="p">())</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;供应商(乙方)数: &#39;</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;供应商&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">nunique</span><span class="p">())</span>
</code></pre></div><pre><code>采购人(甲方)数:  234082
供应商(乙方)数:  499943
</code></pre>
<p><br><br></p>
<h2 id="三实验代码">三、实验代码</h2>
<h3 id="31-是否含某类词">3.1 是否含某(类)词</h3>
<p>根据公告中是否出现某(类)词，可以提起一些指标。例如 Beraja 等（2020）基于 2013-2019 年政府采购合同，与中国人工智能企业进行名单匹配，得到 28 023 份政府人脸识别采购合同样本。 本文仅简单示范， 以 <em><strong>人工智能</strong></em> 相关词为例</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s1">&#39;合同名称&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">&#39;人工智能|自然语言处理|自动驾驶|AI|ai&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">0          False
1          False
2          False
3          False
4          False
           ...  
3724390    False
3724391    False
3724392    False
3724393    False
3724394    False
Name: 合同名称, Length: 3724395, dtype: bool
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#AI相关公告的数量</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;合同名称&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">&#39;人工智能|自然语言处理|自动驾驶|AI|ai&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">1323
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#显示匹配到的与 AI 有关的【合同名称】</span>
<span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;合同名称&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">&#39;人工智能|自然语言处理|自动驾驶|AI|ai&#39;</span><span class="p">)][</span><span class="s1">&#39;合同名称&#39;</span><span class="p">]</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">1129                                     贵州大学人工智能研究院建设项目采购合同
4935            龙岩初级中学人工智能创客实验室设备货物类采购项目合同\n （macrodatas.cn）
12231                       中国医学科学院系统医学研究院人工智能高性能计算设备采购合同协议书
13171      双高基于AIoT轨道交通智慧运维环境信号检测分析设备购置(二次)\n\n微信公众号“马克 数据网”
16921                           广州国际生物岛自动驾驶新能源环卫作业创新试点服务采购项目
                                 ...                        
3708596               榆林市教育技术中心人工智能助推教师队伍建设-教师发展智慧管理平台建设项目合同
3708922        邢台市信都区“人工智能公共技术服务平台”项目一标段数字教育、数字文旅采购合同\n\n （）
3709875         吴忠市第三中学南湖校区AI课堂教学行为分析评测系统及智慧教室设备采购项目系统集成服务合同
3712051                               人工智能与机器人领域创新成果产业化成熟度评价
3724277        民乐县现代农业投资有限责任公司民乐县人工智能一二三产业融合功能区食用菌菌棒生产项目  （）
Name: 合同名称, Length: 1323, dtype: object
</code></pre></div><br>
<h3 id="32-构建省份字段">3.2 构建省份字段</h3>
<p>数据集中有  <em><strong>采购人地址</strong></em>、<em><strong>采购人(甲方)</strong></em> 两个地址字段，我们以 <em><strong>采购人(甲方)</strong></em> 为例，构建 <em><strong>采购人省份</strong></em> 字段。 注意: 经过测试，使用cpca库提取省份信息， 两种方式提取省份信息缺失率依次是 24.8%、 7%， 因此我们决定采用 <em><strong>采购人(甲方)</strong></em>  来提取省份。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cpca</span>

<span class="n">provs_df</span> <span class="o">=</span> <span class="n">cpca</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;采购人(甲方)&#39;</span><span class="p">])</span>

<span class="n">df</span><span class="p">[</span><span class="s1">&#39;采购人省份&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">cpca</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;采购人(甲方)&#39;</span><span class="p">])[</span><span class="s1">&#39;省&#39;</span><span class="p">]</span>

<span class="n">df</span><span class="p">[</span><span class="s1">&#39;采购人省份&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;采购人省份&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">k</span><span class="p">:</span> <span class="n">re</span><span class="o">.</span><span class="n">sub</span><span class="p">(</span><span class="s1">&#39;自治区|特别行政区&#39;</span><span class="p">,</span> <span class="s1">&#39;&#39;</span><span class="p">,</span> <span class="n">k</span><span class="p">))</span>

<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/03-df.png" alt=""  />
</p>
<br>
<h3 id="33-按省分组查看记录量">3.3 按省分组查看记录量</h3>
<p>假设 <em><strong>采购人省份</strong></em> 构建的准确的话， 就可以分组查看每个省的记录量。   df.groupby(&lsquo;采购人省份&rsquo;)</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="k">for</span> <span class="n">prov</span><span class="p">,</span> <span class="n">prov_df</span> <span class="ow">in</span> <span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">&#39;采购人省份&#39;</span><span class="p">):</span>
    <span class="nb">print</span><span class="p">(</span><span class="n">prov</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">prov_df</span><span class="p">))</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"> 267312  (未知省份，cpca缺失字段，占比大概7%)
上海市 29493
云南省 49789
内蒙古 480459
北京市 71869
台湾省 93
吉林省 14219
四川省 155028
天津市 10734
宁夏回族 76783
安徽省 44133
山东省 14634
山西省 5784
广东省 1349039
广西壮族 12534
新疆维吾尔 8000
江苏省 28655
江西省 8949
河北省 203761
河南省 8159
浙江省 12158
海南省 38603
湖北省 6156
湖南省 11300
甘肃省 289772
福建省 97527
西藏 2558
贵州省 2599
辽宁省 34547
重庆市 58673
陕西省 55478
青海省 22441
香港 80
黑龙江省 253076
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">import</span> <span class="nn">matplotlib</span>
<span class="kn">import</span> <span class="nn">scienceplots</span>
<span class="kn">import</span> <span class="nn">platform</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">import</span> <span class="nn">matplotlib_inline</span>
<span class="n">matplotlib_inline</span><span class="o">.</span><span class="n">backend_inline</span><span class="o">.</span><span class="n">set_matplotlib_formats</span><span class="p">(</span><span class="s1">&#39;png&#39;</span><span class="p">,</span> <span class="s1">&#39;svg&#39;</span><span class="p">)</span>
<span class="kn">import</span> <span class="nn">jieba</span>
<span class="kn">import</span> <span class="nn">warnings</span>
<span class="n">warnings</span><span class="o">.</span><span class="n">filterwarnings</span><span class="p">(</span><span class="s1">&#39;ignore&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">style</span><span class="o">.</span><span class="n">use</span><span class="p">([</span><span class="s1">&#39;science&#39;</span><span class="p">,</span> <span class="s1">&#39;no-latex&#39;</span><span class="p">,</span> <span class="s1">&#39;cjk-sc-font&#39;</span><span class="p">])</span>
<span class="n">system</span> <span class="o">=</span> <span class="n">platform</span><span class="o">.</span><span class="n">system</span><span class="p">()</span>  <span class="c1"># 获取操作系统类型</span>
<span class="k">if</span> <span class="n">system</span> <span class="o">==</span> <span class="s1">&#39;Windows&#39;</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;SimHei&#39;</span><span class="p">}</span>
<span class="k">elif</span> <span class="n">system</span> <span class="o">==</span> <span class="s1">&#39;Darwin&#39;</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;Arial Unicode MS&#39;</span><span class="p">}</span>
<span class="k">else</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;sans-serif&#39;</span><span class="p">}</span>
<span class="n">matplotlib</span><span class="o">.</span><span class="n">rc</span><span class="p">(</span><span class="s1">&#39;font&#39;</span><span class="p">,</span> <span class="o">**</span><span class="n">font</span><span class="p">)</span>  <span class="c1"># 设置全局字体</span>



<span class="n">prov_volumes</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">prov</span><span class="p">,</span> <span class="n">prov_df</span> <span class="ow">in</span> <span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">&#39;采购人省份&#39;</span><span class="p">):</span>
    <span class="n">prov_volumes</span><span class="o">.</span><span class="n">append</span><span class="p">({</span><span class="s1">&#39;prov&#39;</span><span class="p">:</span> <span class="n">prov</span><span class="p">,</span> <span class="s1">&#39;volume&#39;</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="n">prov_df</span><span class="p">)})</span>
    
<span class="n">prov_volumes_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">prov_volumes</span><span class="p">)</span>
<span class="n">prov_volumes_df</span><span class="o">.</span><span class="n">set_index</span><span class="p">(</span><span class="s1">&#39;prov&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">sort_values</span><span class="p">(</span><span class="s1">&#39;volume&#39;</span><span class="p">,</span> <span class="n">ascending</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">kind</span><span class="o">=</span><span class="s1">&#39;bar&#39;</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">4</span><span class="p">))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">&#39;政府采购数量(采购人按省)&#39;</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">15</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xticks</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">rotation</span><span class="o">=</span><span class="mi">45</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s1">&#39;省份&#39;</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">13</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s1">&#39;采购公告数量&#39;</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">13</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/plot.png" alt=""  />
</p>
<p>采购按省， 最多的几个省份依次是广东、内蒙、甘肃、黑龙江等。  甘肃和黑龙江之间有个空白， 这是因为根据采购人(甲方)使用cpca提取省份信息时，有7%记录是缺失的。<br><br></p>
<h2 id="三相关研究">三、相关研究</h2>
<p>相关研究近期文献</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[1]周亚虹,蒲余路,陈诗一等.政府扶持与新型产业发展——以新能源为例[J].经济研究,2015,50(06):147-161.
[2]武威,刘国平.政府采购与经济发展：转型效应与协同效应——基于产业结构升级视角[J].财政研究,2021(08):77-90.
[3]孙薇,叶初升.政府采购何以牵动企业创新——兼论需求侧政策“拉力”与供给侧政策“推力”的协同[J].中国工业经济,2023(01):95-113.
[4]姜爱华,费堃桀,张鑫娜.政府采购、营商环境与企业创新——基于A股上市公司的经验证据[J].中央财经大学学报,2022(09):3-15.
[5]梁平汉, 郭宇辰. 中国政府采购公告数据的使用和潜在问题[J]. 产业经济评论, 2023, (01): 68-80.
</code></pre></div><p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>数据集 | 众筹金融投资平台kiva借贷数据</title>
      <link>https://textdata.cn/blog/2024-04-10-kiva-crowdfunding/</link>
      <pubDate>Wed, 10 Apr 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-04-10-kiva-crowdfunding/</guid>
      <description>众筹</description>
      <content:encoded><![CDATA[<h2 id="一kiva简介">一、Kiva简介</h2>
<p>Kiva.org 是一个成立于 2005 年的国际非营利亲社会金融投资平台，其主要工作是通过众筹贷款，并以极低的利息来发放给那些需要的人们， 以助其购买生活必需品，或是找到一份能维持生计的工作。具体来说，这一类 <strong>亲社会</strong> 金融投资平台在世界各地寻找合作伙伴，例如当地的享有盛誉的非营利组织，来筛选当地对于低息贷款有需要或生活上遭受苦难的人，并收集其资料， 然后向平台发出这些资料以请求帮助。而平台则通过众筹的方式为这些项目筹集贷款资金，投资者则可以以个人或团队的形式进行投资。</p>
<p><br><br></p>
<h2 id="二研究主题">二、研究主题</h2>
<ul>
<li>亲社会行为心理（Pro-Social Behaviorial Psychology)</li>
<li>社会公益 ML 应用（Social Good ML Applications ）</li>
<li>公平性研究（Fairness Research）</li>
<li>社会影响评估（Social Impact Assessments）</li>
</ul>
<p>部分参考文献</p>
<blockquote>
<p>Defazio, Daniela, Chiara Franzoni, and Cristina Rossi-Lamastra. &ldquo;How pro-social framing affects the success of crowdfunding projects: The role of emphasis and information crowdedness.&rdquo; <em>Journal of Business Ethics</em> 171 (2021): 357-378.</p>
</blockquote>
<p><br><br></p>
<h2 id="三获取数据">三、获取数据</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">网站: Kiva Tools

网址: http://kivatools.com/downloads

项目数(截止2024.4.10): 2187819

介绍: Kiva Tools 是一个帮助Kiva贷方更好地了解小额信贷和 Kiva 运营的网站。 Kiva 目前在多个国家开展业务，并生成大量数据。查看这些数据以更好地了解地理和经济是非常有教育意义的。注意：Kiva Tools不隶属于 Kiva，也不受 Kiva 认可。

声明: 科研用途； 如有问题， 请加微信372335839，备注「姓名-学校-专业」
</code></pre></div><p><img loading="lazy" src="img/kivatools.png" alt=""  />
</p>
<p><br>2024.4.10 打开 <a href="http://kivatools.com/downloads">http://kivatools.com/downloads</a> ，点击 <em><strong>All loans</strong></em> 对应的数据，进行下载，最终得到 875M 的 csv 文件。</p>
<p><br><br></p>
<h2 id="四查看数据">四、查看数据</h2>
<h3 id="41-导入数据">4.1 导入数据</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;all_loans.csv&#39;</span><span class="p">)</span>
<span class="n">df</span>
</code></pre></div><p><img loading="lazy" src="img/df.png" alt=""  />
</p>
 <br>
<h3 id="42-所含字段">4.2 所含字段</h3>
<p>所含字段包含</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="k">for</span> <span class="n">col</span> <span class="ow">in</span> <span class="n">df</span><span class="o">.</span><span class="n">columns</span><span class="p">:</span>
    <span class="nb">print</span><span class="p">(</span><span class="n">col</span><span class="p">)</span>
</code></pre></div><p>字段详情</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"> - LOAN_ID:    贷款ID
 - LOAN_NAME:   Kiva贷方(出借人)姓名
 - FUNDED_AMOUNT:  Kiva贷方(出借人)已购买的贷款金额
 - LOAN_AMOUNT: 贷款额度
 - STATUS:   贷款状态包括违约、还款和已付级别，请参阅 http://build.kiva.org/docs/data/loans 了解每个级别的含义
 - IMAGE_ID: 图片ID
 - VIDEO_ID: 视频ID
 - ACTIVITY_NAME: 活动
 - SECTOR_NAME: 部门
 - LOAN_USE: 借款用途
 - COUNTRY_CODE: 国家代码
 - COUNTRY_NAME: 国家名称
 - TOWN_NAME: 城镇名称
 - CURRENCY_POLICY: 货币政策
 - CURRENCY_EXCHANGE_COVERAGE_RATE: 货币兑换
 - CURRENCY: 货币类型
 - PARTNER_ID: 当地贷款机构的现场合作伙伴 ID，请参阅http://api.kivaws.org/v1/partners.json
 - POSTED_TIME: 项目发布时间
 - PLANNED_EXPIRATION_TIME: 项目截止时间
 - DISBURSE_TIME: 发放给借款人的时间;  请注意，在 Kiva 上发布贷款之前，这笔钱可能会支付给借款人。
 - RAISED_TIME:   
 - LENDER_TERM:   借款人条款
 - NUM_LENDERS_TOTAL: 借款人数量
 - NUM_JOURNAL_ENTRIES: 借款人的日记账分录数量（Kiva 网站上的更新）。Number of journal entries (updates on the Kiva website) by borrower.
 - NUM_BULK_ENTRIES:
 - TAGS: 标签
 - BORROWER_NAMES:  借款人姓名
 - BORROWER_GENDERS: 借款人性别（有可能会存在多个借款人，所以数据类型为字符串或列表）
 - BORROWER_PICTURED:  借款人是否提供了图片
 - REPAYMENT_INTERVAL:  还款间隔
 - DISTRIBUTION_MODEL: 分销模式
</code></pre></div> <br>
<h3 id="43-行业">4.3 行业</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">import</span> <span class="nn">matplotlib</span>
<span class="kn">import</span> <span class="nn">matplotlib_inline</span>
<span class="n">matplotlib_inline</span><span class="o">.</span><span class="n">backend_inline</span><span class="o">.</span><span class="n">set_matplotlib_formats</span><span class="p">(</span><span class="s1">&#39;png&#39;</span><span class="p">,</span> <span class="s1">&#39;svg&#39;</span><span class="p">)</span>
<span class="kn">import</span> <span class="nn">scienceplots</span>
<span class="kn">import</span> <span class="nn">platform</span>
<span class="n">plt</span><span class="o">.</span><span class="n">style</span><span class="o">.</span><span class="n">use</span><span class="p">([</span><span class="s1">&#39;science&#39;</span><span class="p">,</span> <span class="s1">&#39;no-latex&#39;</span><span class="p">,</span> <span class="s1">&#39;cjk-sc-font&#39;</span><span class="p">])</span>
<span class="n">system</span> <span class="o">=</span> <span class="n">platform</span><span class="o">.</span><span class="n">system</span><span class="p">()</span>  <span class="c1"># 获取操作系统类型</span>

<span class="k">if</span> <span class="n">system</span> <span class="o">==</span> <span class="s1">&#39;Windows&#39;</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;SimHei&#39;</span><span class="p">}</span>
<span class="k">elif</span> <span class="n">system</span> <span class="o">==</span> <span class="s1">&#39;Darwin&#39;</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;Arial Unicode MS&#39;</span><span class="p">}</span>
<span class="k">else</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;sans-serif&#39;</span><span class="p">}</span>
<span class="n">matplotlib</span><span class="o">.</span><span class="n">rc</span><span class="p">(</span><span class="s1">&#39;font&#39;</span><span class="p">,</span> <span class="o">**</span><span class="n">font</span><span class="p">)</span>  <span class="c1"># 设置全局字体</span>


<span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span> <span class="mi">8</span><span class="p">))</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;SECTOR_NAME&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">value_counts</span><span class="p">(</span><span class="n">normalize</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">kind</span><span class="o">=</span><span class="s1">&#39;pie&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">&#39;Kiva项目所属行业部门分布&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/pie.png" alt=""  />
</p>
<br>
<h3 id="44-国家项目数量">4.4 国家项目数量</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span> <span class="mi">8</span><span class="p">))</span>

<span class="n">props</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;COUNTRY_NAME&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">value_counts</span><span class="p">(</span><span class="n">normalize</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">props_</span> <span class="o">=</span> <span class="n">props</span><span class="p">[</span><span class="n">props</span><span class="o">&gt;=</span><span class="mf">0.01</span><span class="p">]</span>
<span class="n">props_</span><span class="p">[</span><span class="s1">&#39;Others&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">props</span><span class="p">[</span><span class="n">props</span><span class="o">&lt;</span><span class="mf">0.01</span><span class="p">]</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>

<span class="n">props_</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">kind</span><span class="o">=</span><span class="s1">&#39;pie&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">&#39;国家Kiva项目数量分布&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/pie2.png" alt=""  />
</p>
<p>Kiva 向菲律宾提供的贷款数量较多，按数量(递减)依次是是肯尼亚、柬埔寨、秘鲁、萨瓦尔多、乌干达等。</p>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>pandas技巧  | DataFrame的四则运算</title>
      <link>https://textdata.cn/blog/2024-03-29-dataframe-add-sub-mul-div/</link>
      <pubDate>Fri, 29 Mar 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-03-29-dataframe-add-sub-mul-div/</guid>
      <description>&lt;p&gt;DataFrame的四则运算， 涉及到标量数字与数组(列表、series、字典、dataframe)。我们先构造实验数据df&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;DataFrame&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;({&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;angles&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;3&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;4&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt;
                   &lt;span class=&#34;s1&#34;&gt;&amp;#39;degrees&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;360&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;180&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;360&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]},&lt;/span&gt;
                  &lt;span class=&#34;n&#34;&gt;index&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;circle&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;triangle&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;rectangle&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/01-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h2 id=&#34;一标量&#34;&gt;一、标量&lt;/h2&gt;
&lt;p&gt;这里体现的就是pandas独有的广播特性， 使得df可以直接与标量进行运算。以加法为例，&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;10&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#df.add(10)&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/02-add.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;其他算法&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;#df - 10&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#df.sub(10)&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#df * 10&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#df.mul(10)&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#df / 10&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#df.div(10)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二数组&#34;&gt;二、数组&lt;/h2&gt;
&lt;p&gt;df与数组(列表、series、字典、dataframe)等进行运算&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/01-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
 &lt;br&gt;
&lt;p&gt;df有两列， [1, 2]有两个元素。默认轴方向为columns， 两者相减&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;-&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#df.sub([1, 2], axis=&amp;#39;columns&amp;#39;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/03-sub-list.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;生成一个series数据， 有三行， 索引名设置为circle、triangle、rectangle。&lt;/p&gt;
&lt;p&gt;df与series相减， 轴方向设置为index&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;series&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Series&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;([&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt; 
                 &lt;span class=&#34;n&#34;&gt;index&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;circle&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;triangle&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;rectangle&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;sub&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;series&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;axis&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;index&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/04-sub-series.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;字典有两个字段名， 与df字段名相同。 轴方向设置为columns， 两者相乘&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;mul&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;({&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;angles&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;degrees&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;},&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;axis&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;columns&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/05-mul-dict.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;字典有三个字段名， 与df的index相同。 轴方向设置为index， 两者相乘&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;mul&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;({&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;circle&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;triangle&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;rectangle&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;3&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;},&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;axis&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;index&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/05-mul-dict-index.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
</description>
      <content:encoded><![CDATA[<p>DataFrame的四则运算， 涉及到标量数字与数组(列表、series、字典、dataframe)。我们先构造实验数据df</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="s1">&#39;angles&#39;</span><span class="p">:</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">],</span>
                   <span class="s1">&#39;degrees&#39;</span><span class="p">:</span> <span class="p">[</span><span class="mi">360</span><span class="p">,</span> <span class="mi">180</span><span class="p">,</span> <span class="mi">360</span><span class="p">]},</span>
                  <span class="n">index</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;circle&#39;</span><span class="p">,</span> <span class="s1">&#39;triangle&#39;</span><span class="p">,</span> <span class="s1">&#39;rectangle&#39;</span><span class="p">])</span>
<span class="n">df</span>
</code></pre></div><p><img loading="lazy" src="img/01-df.png" alt=""  />
</p>
<br>
<h2 id="一标量">一、标量</h2>
<p>这里体现的就是pandas独有的广播特性， 使得df可以直接与标量进行运算。以加法为例，</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span> <span class="o">+</span> <span class="mi">10</span>
<span class="c1">#df.add(10)</span>


</code></pre></div><p><img loading="lazy" src="img/02-add.png" alt=""  />
</p>
<p>其他算法</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#df - 10</span>
<span class="c1">#df.sub(10)</span>

<span class="c1">#df * 10</span>
<span class="c1">#df.mul(10)</span>

<span class="c1">#df / 10</span>
<span class="c1">#df.div(10)</span>
</code></pre></div><p><br><br></p>
<h2 id="二数组">二、数组</h2>
<p>df与数组(列表、series、字典、dataframe)等进行运算</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span>
</code></pre></div><p><img loading="lazy" src="img/01-df.png" alt=""  />
</p>
 <br>
<p>df有两列， [1, 2]有两个元素。默认轴方向为columns， 两者相减</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span> <span class="o">-</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">]</span>
<span class="c1">#df.sub([1, 2], axis=&#39;columns&#39;)</span>
</code></pre></div><p><img loading="lazy" src="img/03-sub-list.png" alt=""  />
</p>
<br>
<p>生成一个series数据， 有三行， 索引名设置为circle、triangle、rectangle。</p>
<p>df与series相减， 轴方向设置为index</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">series</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">],</span> 
                 <span class="n">index</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;circle&#39;</span><span class="p">,</span> <span class="s1">&#39;triangle&#39;</span><span class="p">,</span> <span class="s1">&#39;rectangle&#39;</span><span class="p">])</span>

<span class="n">df</span><span class="o">.</span><span class="n">sub</span><span class="p">(</span><span class="n">series</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="s1">&#39;index&#39;</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/04-sub-series.png" alt=""  />
</p>
<br>
<p>字典有两个字段名， 与df字段名相同。 轴方向设置为columns， 两者相乘</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="o">.</span><span class="n">mul</span><span class="p">({</span><span class="s1">&#39;angles&#39;</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span> <span class="s1">&#39;degrees&#39;</span><span class="p">:</span> <span class="mi">2</span><span class="p">},</span> <span class="n">axis</span><span class="o">=</span><span class="s1">&#39;columns&#39;</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/05-mul-dict.png" alt=""  />
</p>
<br>
<p>字典有三个字段名， 与df的index相同。 轴方向设置为index， 两者相乘</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="o">.</span><span class="n">mul</span><span class="p">({</span><span class="s1">&#39;circle&#39;</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span> <span class="s1">&#39;triangle&#39;</span><span class="p">:</span> <span class="mi">2</span><span class="p">,</span> <span class="s1">&#39;rectangle&#39;</span><span class="p">:</span> <span class="mi">3</span><span class="p">},</span> <span class="n">axis</span><span class="o">=</span><span class="s1">&#39;index&#39;</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/05-mul-dict-index.png" alt=""  />
</p>
<br>
<br>
]]></content:encoded>
    </item>
    
    <item>
      <title>ANCW | 4030词的中文情感词典(效价、唤醒度、主导度、具体性)</title>
      <link>https://textdata.cn/blog/2024-02-27-ancw-affective-norms-for-4030-chinese-words/</link>
      <pubDate>Tue, 27 Feb 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-02-27-ancw-affective-norms-for-4030-chinese-words/</guid>
      <description>&lt;p&gt;Ying, Lv, Ye Ruyang, Ni Chuanbin, Wang Yeqing, Liu Qing, Zhou Yufan, and Gao Fei. &amp;ldquo;ANCW: Affective norms for 4030 Chinese words.&amp;rdquo; &lt;em&gt;Behavior Research Methods&lt;/em&gt; (2023): 1-16.&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;h2 id=&#34;一摘要&#34;&gt;一、摘要&lt;/h2&gt;
&lt;p&gt;单词中包含的情感信息越来越受到世界各地神经语言学家和心理语言学家的关注。本研究建立了情感词典ANCW(Affective Norms for Chinese Words)，对 4030 个词语进行了&lt;strong&gt;效价valence&lt;/strong&gt;、&lt;strong&gt;唤醒度arousal&lt;/strong&gt;、&lt;strong&gt;主导度dominance&lt;/strong&gt;和&lt;strong&gt;具体性concreteness&lt;/strong&gt; 打分，这些词语是根据 CET-4（国家大学英语四级考试）官方大纲进行中文改编的。尽管现有的中文情感词典CAWS(Chinese Affective Words System)，ANCW 提供了更多、更丰富的中文词汇。通过在程序中使用 7 级李克特量表（范围从 1 到 7），我们获得了 3717 名中国本科生对所有变量的评分。词典ANCW具有良好的响应信度，并且与中文先前的规范研究相兼容。成对相关分析揭示了效价与唤醒、唤醒与支配性以及效价与具体性之间的二次关系。此外，效价和支配性、唤醒性和具体性均呈现线性相关，具体性和支配性相关。ANCW 为涉及情感语言处理的进一步研究提供可靠且标准化的刺激材料。&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;h2 id=&#34;二文献梳理&#34;&gt;二、文献梳理&lt;/h2&gt;
&lt;p&gt;语言和情感是人类生活不可分割的一部分。在过去的二十年里，词语的情感评级受到了极大的关注。研究人员建立了许多标准化数据库，从不同维度对不同语言的单词进行评级。传统上，情感的概念是情感观，被视为多个维度的连续体（Ćoso et al., 2019；Rubin &amp;amp; Talarico, 2009），所有情感都具有两个或三个维度的特征（Duffy, 1934)；奥斯古德等人，1957）。根据卡罗尔、奥斯古德、苏西和坦南鲍姆（ 1959）的情感理论，对词语进行了大量的情感评级，&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;效价valence&lt;/strong&gt; 是指令人愉快的程度，范围从不愉快到愉快；&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;唤醒度arousal&lt;/strong&gt; 是生理激活程度的指标，范围从平静到兴奋；&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;支配性dominance&lt;/strong&gt; 描述了个人所感受到的控制程度，从失控到受控。近年来，心理语言学变量具体性的研究引起了人们的浓厚兴趣。&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;根据 Gilhooly 和 Logie（1980）的观点，&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;具体性concreteness&lt;/strong&gt; 代表了形成单词心理形象的难度程度，范围从抽象（难以形成）到具体（易于形成）。&lt;/li&gt;
&lt;/ul&gt;
&lt;br&gt;
&lt;p&gt;构建具有单词情感评级的数据库的需求很大，因为它们至少有助于四个方面的研究，包括针对情绪本身的研究、情绪特征对单词处理和记忆的影响、整个消息表达的情绪或文本，以及通过将新词与已验证词进行比较来了解新词的情感价值（有关评论，请参阅 Warriner 等人，2013 年）。到目前为止，已经用多种语言构建了各种数据库，并为进一步的研究提供了丰富的刺激和可靠测量的情绪特征。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/01-old-dicts.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;从上述文献中，我们可以看到针对不同语言建立了各种各样的包含情感评级的数据库，以满足日益增长的情感研究需求。然而，据我们所知，该领域还存在一些有待进一步研究的地方：&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;大多数数据库是由西方国家建立的，并且已经证实，一些研究发现情感评级因文化而异。因此，建立中国本土情感规范数据库迫在眉睫。&lt;/li&gt;
&lt;li&gt;国内以往的研究在制定标准化的情绪刺激上付出了很大的努力，并且使用了多样化的刺激。在这些刺激中，言语刺激可以得到更严格的控制，并且与其他刺激具有可比性，例如需要在复杂性、亮度、颜色和对比度上进行控制的图片(Soares et al., 2012 &lt;a href=&#34;https://link.springer.com/article/10.3758/s13428-023-02226-x#ref-CR60&#34;&gt;)&lt;/a&gt;。&lt;/li&gt;
&lt;li&gt;最重要的是，以往的研究限制了汉字的数量。例如，AANC（Liu et al., 2021）由四个汉字单词组成，而Yao等人建立的另一个数据库则由四个汉字组成。( 2016)仅包含两个字符的单词。众所周知，汉字非常复杂。例如，一个汉字可以组成一个词，如“书”、“美”、“杀”。两个或多个汉字也可以组成一个词，如“生活”、“白日梦”、“色彩斑美丽”。特别是，日常使用的词语非常灵活，不仅限于二字词或四字词。在这种情况下，汉字数量的限制在一定程度上限制了表达的丰富性和灵活性。&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;鉴于这些局限性，本研究旨在建立一个标准化、多维、不限制字数的汉语词语情感规范数据库。此外，本研究将采用多种方法检验ANCW的可靠性，为进一步研究情感和心理语言变量之间的关系提供更多证据。总体而言，本研究在一定程度上弥补了上述局限性。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;三-方法&#34;&gt;三、 方法&lt;/h2&gt;
&lt;h3 id=&#34;31-参与者&#34;&gt;3.1 参与者&lt;/h3&gt;
&lt;p&gt;共有 3717 名母语为中文的人参与了这项研究。所有参与者均为中国 41 所大学除英语专业以外的其他专业本科生（女性 2346 名，男性 1258 名，无性别信息 113 名；M年龄= 19.91，范围 16-25，SD = 1.21）。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;32-确定词语列表&#34;&gt;3.2 确定词语列表&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;从英语四级CET-4的教学大纲中找出4030个英语单词&lt;/strong&gt;，大学英语四级大纲中的词汇出现频率较高，且与学员的日常生活密切相关。&lt;/p&gt;
&lt;p&gt;翻译经过三道严格的程序完成。第一轮翻译依据的是牛津高阶英汉词典（第9版 *）*和英国国家语料库（BNC）。该研究采用《牛津高级英汉词典（第9版 *）》*中的首个中文释义，将词表翻译成中文。有些词有多个词性。例如，“stem”可以是名词和动词。名词“茎”的意思是“植物在地面上长出叶子或花朵的主要长而薄的部分；从中生长出来并支撑花朵或叶子的较小部分”（Stem，2018），动词的意思是“阻止某些正在流动或增加的东西”（Stem，2018）。在本例中，我们根据英国国家语料库选择了词频较高的词性。在此过程之后，研究发现了 672 个单词的一致翻译。&lt;/p&gt;
&lt;p&gt;在第二个翻译阶段，本研究采用了德尔菲法。我们邀请了五位精通英语文化和中国文化的专业翻译人员来进行这项工作。翻译过程中，五位专业人士未经讨论就翻译了这672个一致词。然后，研究对他们的翻译进行了比较，并找出了五位译者意见不一致的词语。经过四轮匿名讨论，我们获得了唯一不重复的汉译本553个单词。&lt;/p&gt;
&lt;p&gt;经过这一步，剩下了 186 个与中文翻译一致的单词。为了确保每个翻译不重复，研究在中文翻译后标记了原始英文单词或该单词的词性。最终获得了英语四级英语单词大纲的翻译版，包含4030个中文单词。&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;我们将 4030 个中文单词的列表随机分为 20 个子列表，每个子列表包含 201 或 202 个单词。根据该研究的设计，每个单词的每个维度（唤醒度、效价、支配性和具体性）都会被评估至少 45 次。&lt;/strong&gt;&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;33-设计问卷&#34;&gt;3.3 设计问卷&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;每份试卷均包含一个信息部分、说明和评分表。本研究采用7点李克特自评量表进行打分&lt;/strong&gt;。&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;效价描述了刺激引起愉悦感的程度（Russell，1980；Bradley &amp;amp; Lang，1999）。数字1表示非常不愉快，4表示一般，7表示非常愉快。&lt;/li&gt;
&lt;li&gt;唤醒，也称为激活、强度或能量水平（Montefinese 等，2014），用于描述身体被激活或唤醒的程度（Duffy，1934）。该研究用1表示极度平静，4表示中性，7表示极度兴奋。&lt;/li&gt;
&lt;li&gt;支配性被定义为个体对刺激的控制或影响程度，范围从完全失控到完全控制（Russell &amp;amp; Mehrabian，1977）。研究用1代表受试者感觉自己完全被这个词控制（这个词是“盛行”），4代表中立，7代表受试者感觉能够完全控制这个词（这个词是“弱”）。 ”）。&lt;/li&gt;
&lt;li&gt;具体性是指形成单词物理所指的心理图像的困难程度。该研究使用1表示极端抽象，4表示中性，7表示极端具体。&lt;/li&gt;
&lt;/ul&gt;
&lt;br&gt;
&lt;h3 id=&#34;34-步骤&#34;&gt;3.4 步骤&lt;/h3&gt;
&lt;p&gt;本研究采用&lt;strong&gt;纸笔评分法&lt;/strong&gt;(paper-pencil rating method) 。每个参与者随机收到一个单词子列表。在试卷的第一页，该研究为每个维度（效价、唤醒度、支配性和具体性）提供了清晰的中文说明和生动的例子。参与者收到试卷后，研究口头提供了清晰的说明解释。试卷的第二页和第三页是A4纸上打印的中文单词和等级量表。每个参与者在安静的教室里对一张试卷进行评分。由于所有单词都是汉语，而且四级单词在社会生活中广泛使用，因此没有参与者对单词的含义有疑问。&lt;/p&gt;
&lt;p&gt;鉴于之前的研究（谢，2020；张，2020），数据修剪规则如下所示，如果试卷满足其中一条规则，则将被视为无效。&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;70%以上的评级结果缺失；&lt;/li&gt;
&lt;li&gt;70%以上的评级结果相同；&lt;/li&gt;
&lt;li&gt;试卷表现出明显的敌意。例如，一些参与者在试卷上留下侮辱性的评论，例如“我只是随意圈出数字来欺骗你们，傻瓜”。&lt;/li&gt;
&lt;li&gt;此外，答案是在一系列之字形中随机选择的。在这种情况下，调查问卷将被视为敌对调查问卷。&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;最终我们共收集到3304份试卷。在所有试卷中，效价评分为 858 份，唤醒评分为 803 份，支配性评分为 777 份，具体性评分为 866 份。每个维度中的几个缺失评级均由平均值代替。删除无效数据后的最终数据库共包含4030个单词，每个单词的效价评分为42.9，唤醒评分为40.2，具体性评分为43.3，支配性评分为38.9。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;四ancw词典&#34;&gt;四、ANCW词典&lt;/h2&gt;
&lt;p&gt;ancw下载链接:https://pan.baidu.com/s/1UfbmVQh9XM77eoGmMsZ2-w?pwd=bp63  提取码:bp63&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/02-ancw.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/03-ancw.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;相关文献&#34;&gt;相关文献&lt;/h2&gt;
&lt;p&gt;Xu, X., Li, J., &amp;amp; Chen, H. (2021). Valence and arousal ratings for 11,310 simplified Chinese words. &lt;em&gt;Behavior Research Methods, 54&lt;/em&gt;(1), 26–41. &lt;a href=&#34;https://doi.org/10.3758/s13428-021-01607-4&#34;&gt;https://doi.org/10.3758/s13428-021-01607-4&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Yao, Z., Wu, J., Zhang, Y., &amp;amp; Wang, Z. (2016). Norms of valence, arousal, concreteness, familiarity, imageability, and context availability for 1,100 Chinese words. &lt;em&gt;Behavior Research Methods, 49&lt;/em&gt;(4), 1374–1385. &lt;a href=&#34;https://doi.org/10.3758/s13428-016-0793-2&#34;&gt;https://doi.org/10.3758/s13428-016-0793-2&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Yuan, J., Zhang, Y., Chen, S., Luo, L., &amp;amp; Ru, Y. (2021). The establishment of Chinese Emotion Regulation Word System (CERWS) and its pilot test. &lt;em&gt;Acta Psychologica Sinica, 53&lt;/em&gt;(&lt;em&gt;5&lt;/em&gt;), 445. &lt;a href=&#34;https://doi.org/10.3724/sp.j.1041.2021.00445&#34;&gt;https://doi.org/10.3724/sp.j.1041.2021.00445&lt;/a&gt;&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
</description>
      <content:encoded><![CDATA[<p>Ying, Lv, Ye Ruyang, Ni Chuanbin, Wang Yeqing, Liu Qing, Zhou Yufan, and Gao Fei. &ldquo;ANCW: Affective norms for 4030 Chinese words.&rdquo; <em>Behavior Research Methods</em> (2023): 1-16.</p>
<br>
<br>
<h2 id="一摘要">一、摘要</h2>
<p>单词中包含的情感信息越来越受到世界各地神经语言学家和心理语言学家的关注。本研究建立了情感词典ANCW(Affective Norms for Chinese Words)，对 4030 个词语进行了<strong>效价valence</strong>、<strong>唤醒度arousal</strong>、<strong>主导度dominance</strong>和<strong>具体性concreteness</strong> 打分，这些词语是根据 CET-4（国家大学英语四级考试）官方大纲进行中文改编的。尽管现有的中文情感词典CAWS(Chinese Affective Words System)，ANCW 提供了更多、更丰富的中文词汇。通过在程序中使用 7 级李克特量表（范围从 1 到 7），我们获得了 3717 名中国本科生对所有变量的评分。词典ANCW具有良好的响应信度，并且与中文先前的规范研究相兼容。成对相关分析揭示了效价与唤醒、唤醒与支配性以及效价与具体性之间的二次关系。此外，效价和支配性、唤醒性和具体性均呈现线性相关，具体性和支配性相关。ANCW 为涉及情感语言处理的进一步研究提供可靠且标准化的刺激材料。</p>
<br>
<br>
<h2 id="二文献梳理">二、文献梳理</h2>
<p>语言和情感是人类生活不可分割的一部分。在过去的二十年里，词语的情感评级受到了极大的关注。研究人员建立了许多标准化数据库，从不同维度对不同语言的单词进行评级。传统上，情感的概念是情感观，被视为多个维度的连续体（Ćoso et al., 2019；Rubin &amp; Talarico, 2009），所有情感都具有两个或三个维度的特征（Duffy, 1934)；奥斯古德等人，1957）。根据卡罗尔、奥斯古德、苏西和坦南鲍姆（ 1959）的情感理论，对词语进行了大量的情感评级，</p>
<ul>
<li><strong>效价valence</strong> 是指令人愉快的程度，范围从不愉快到愉快；</li>
<li><strong>唤醒度arousal</strong> 是生理激活程度的指标，范围从平静到兴奋；</li>
<li><strong>支配性dominance</strong> 描述了个人所感受到的控制程度，从失控到受控。近年来，心理语言学变量具体性的研究引起了人们的浓厚兴趣。</li>
</ul>
<p>根据 Gilhooly 和 Logie（1980）的观点，</p>
<ul>
<li><strong>具体性concreteness</strong> 代表了形成单词心理形象的难度程度，范围从抽象（难以形成）到具体（易于形成）。</li>
</ul>
<br>
<p>构建具有单词情感评级的数据库的需求很大，因为它们至少有助于四个方面的研究，包括针对情绪本身的研究、情绪特征对单词处理和记忆的影响、整个消息表达的情绪或文本，以及通过将新词与已验证词进行比较来了解新词的情感价值（有关评论，请参阅 Warriner 等人，2013 年）。到目前为止，已经用多种语言构建了各种数据库，并为进一步的研究提供了丰富的刺激和可靠测量的情绪特征。</p>
<p><img loading="lazy" src="img/01-old-dicts.png" alt=""  />
</p>
<br>
<p>从上述文献中，我们可以看到针对不同语言建立了各种各样的包含情感评级的数据库，以满足日益增长的情感研究需求。然而，据我们所知，该领域还存在一些有待进一步研究的地方：</p>
<ul>
<li>大多数数据库是由西方国家建立的，并且已经证实，一些研究发现情感评级因文化而异。因此，建立中国本土情感规范数据库迫在眉睫。</li>
<li>国内以往的研究在制定标准化的情绪刺激上付出了很大的努力，并且使用了多样化的刺激。在这些刺激中，言语刺激可以得到更严格的控制，并且与其他刺激具有可比性，例如需要在复杂性、亮度、颜色和对比度上进行控制的图片(Soares et al., 2012 <a href="https://link.springer.com/article/10.3758/s13428-023-02226-x#ref-CR60">)</a>。</li>
<li>最重要的是，以往的研究限制了汉字的数量。例如，AANC（Liu et al., 2021）由四个汉字单词组成，而Yao等人建立的另一个数据库则由四个汉字组成。( 2016)仅包含两个字符的单词。众所周知，汉字非常复杂。例如，一个汉字可以组成一个词，如“书”、“美”、“杀”。两个或多个汉字也可以组成一个词，如“生活”、“白日梦”、“色彩斑美丽”。特别是，日常使用的词语非常灵活，不仅限于二字词或四字词。在这种情况下，汉字数量的限制在一定程度上限制了表达的丰富性和灵活性。</li>
</ul>
<p>鉴于这些局限性，本研究旨在建立一个标准化、多维、不限制字数的汉语词语情感规范数据库。此外，本研究将采用多种方法检验ANCW的可靠性，为进一步研究情感和心理语言变量之间的关系提供更多证据。总体而言，本研究在一定程度上弥补了上述局限性。</p>
<p><br><br></p>
<h2 id="三-方法">三、 方法</h2>
<h3 id="31-参与者">3.1 参与者</h3>
<p>共有 3717 名母语为中文的人参与了这项研究。所有参与者均为中国 41 所大学除英语专业以外的其他专业本科生（女性 2346 名，男性 1258 名，无性别信息 113 名；M年龄= 19.91，范围 16-25，SD = 1.21）。</p>
<br>
<h3 id="32-确定词语列表">3.2 确定词语列表</h3>
<p><strong>从英语四级CET-4的教学大纲中找出4030个英语单词</strong>，大学英语四级大纲中的词汇出现频率较高，且与学员的日常生活密切相关。</p>
<p>翻译经过三道严格的程序完成。第一轮翻译依据的是牛津高阶英汉词典（第9版 *）*和英国国家语料库（BNC）。该研究采用《牛津高级英汉词典（第9版 *）》*中的首个中文释义，将词表翻译成中文。有些词有多个词性。例如，“stem”可以是名词和动词。名词“茎”的意思是“植物在地面上长出叶子或花朵的主要长而薄的部分；从中生长出来并支撑花朵或叶子的较小部分”（Stem，2018），动词的意思是“阻止某些正在流动或增加的东西”（Stem，2018）。在本例中，我们根据英国国家语料库选择了词频较高的词性。在此过程之后，研究发现了 672 个单词的一致翻译。</p>
<p>在第二个翻译阶段，本研究采用了德尔菲法。我们邀请了五位精通英语文化和中国文化的专业翻译人员来进行这项工作。翻译过程中，五位专业人士未经讨论就翻译了这672个一致词。然后，研究对他们的翻译进行了比较，并找出了五位译者意见不一致的词语。经过四轮匿名讨论，我们获得了唯一不重复的汉译本553个单词。</p>
<p>经过这一步，剩下了 186 个与中文翻译一致的单词。为了确保每个翻译不重复，研究在中文翻译后标记了原始英文单词或该单词的词性。最终获得了英语四级英语单词大纲的翻译版，包含4030个中文单词。</p>
<p><strong>我们将 4030 个中文单词的列表随机分为 20 个子列表，每个子列表包含 201 或 202 个单词。根据该研究的设计，每个单词的每个维度（唤醒度、效价、支配性和具体性）都会被评估至少 45 次。</strong></p>
<br>
<h3 id="33-设计问卷">3.3 设计问卷</h3>
<p><strong>每份试卷均包含一个信息部分、说明和评分表。本研究采用7点李克特自评量表进行打分</strong>。</p>
<ul>
<li>效价描述了刺激引起愉悦感的程度（Russell，1980；Bradley &amp; Lang，1999）。数字1表示非常不愉快，4表示一般，7表示非常愉快。</li>
<li>唤醒，也称为激活、强度或能量水平（Montefinese 等，2014），用于描述身体被激活或唤醒的程度（Duffy，1934）。该研究用1表示极度平静，4表示中性，7表示极度兴奋。</li>
<li>支配性被定义为个体对刺激的控制或影响程度，范围从完全失控到完全控制（Russell &amp; Mehrabian，1977）。研究用1代表受试者感觉自己完全被这个词控制（这个词是“盛行”），4代表中立，7代表受试者感觉能够完全控制这个词（这个词是“弱”）。 ”）。</li>
<li>具体性是指形成单词物理所指的心理图像的困难程度。该研究使用1表示极端抽象，4表示中性，7表示极端具体。</li>
</ul>
<br>
<h3 id="34-步骤">3.4 步骤</h3>
<p>本研究采用<strong>纸笔评分法</strong>(paper-pencil rating method) 。每个参与者随机收到一个单词子列表。在试卷的第一页，该研究为每个维度（效价、唤醒度、支配性和具体性）提供了清晰的中文说明和生动的例子。参与者收到试卷后，研究口头提供了清晰的说明解释。试卷的第二页和第三页是A4纸上打印的中文单词和等级量表。每个参与者在安静的教室里对一张试卷进行评分。由于所有单词都是汉语，而且四级单词在社会生活中广泛使用，因此没有参与者对单词的含义有疑问。</p>
<p>鉴于之前的研究（谢，2020；张，2020），数据修剪规则如下所示，如果试卷满足其中一条规则，则将被视为无效。</p>
<ul>
<li>70%以上的评级结果缺失；</li>
<li>70%以上的评级结果相同；</li>
<li>试卷表现出明显的敌意。例如，一些参与者在试卷上留下侮辱性的评论，例如“我只是随意圈出数字来欺骗你们，傻瓜”。</li>
<li>此外，答案是在一系列之字形中随机选择的。在这种情况下，调查问卷将被视为敌对调查问卷。</li>
</ul>
<p>最终我们共收集到3304份试卷。在所有试卷中，效价评分为 858 份，唤醒评分为 803 份，支配性评分为 777 份，具体性评分为 866 份。每个维度中的几个缺失评级均由平均值代替。删除无效数据后的最终数据库共包含4030个单词，每个单词的效价评分为42.9，唤醒评分为40.2，具体性评分为43.3，支配性评分为38.9。</p>
<p><br><br></p>
<h2 id="四ancw词典">四、ANCW词典</h2>
<p>ancw下载链接:https://pan.baidu.com/s/1UfbmVQh9XM77eoGmMsZ2-w?pwd=bp63  提取码:bp63</p>
<p><img loading="lazy" src="img/02-ancw.png" alt=""  />
</p>
<p><img loading="lazy" src="img/03-ancw.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="相关文献">相关文献</h2>
<p>Xu, X., Li, J., &amp; Chen, H. (2021). Valence and arousal ratings for 11,310 simplified Chinese words. <em>Behavior Research Methods, 54</em>(1), 26–41. <a href="https://doi.org/10.3758/s13428-021-01607-4">https://doi.org/10.3758/s13428-021-01607-4</a></p>
<p>Yao, Z., Wu, J., Zhang, Y., &amp; Wang, Z. (2016). Norms of valence, arousal, concreteness, familiarity, imageability, and context availability for 1,100 Chinese words. <em>Behavior Research Methods, 49</em>(4), 1374–1385. <a href="https://doi.org/10.3758/s13428-016-0793-2">https://doi.org/10.3758/s13428-016-0793-2</a></p>
<p>Yuan, J., Zhang, Y., Chen, S., Luo, L., &amp; Ru, Y. (2021). The establishment of Chinese Emotion Regulation Word System (CERWS) and its pilot test. <em>Acta Psychologica Sinica, 53</em>(<em>5</em>), 445. <a href="https://doi.org/10.3724/sp.j.1041.2021.00445">https://doi.org/10.3724/sp.j.1041.2021.00445</a></p>
<br>
<br>
]]></content:encoded>
    </item>
    
    <item>
      <title>使用 Langchain-Chatchat 搭建本地知识库问答系统</title>
      <link>https://textdata.cn/blog/2024-01-31-langchain-chatchat/</link>
      <pubDate>Wed, 31 Jan 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-01-31-langchain-chatchat/</guid>
      <description>&lt;h2 id=&#34;一langchain-chatchat&#34;&gt;一、LangChain-Chatchat&lt;/h2&gt;
&lt;p&gt;基于 ChatGLM 等大语言模型与 Langchain 等应用框架实现，开源、可离线部署的检索增强生成(RAG)大模型知识库项目。&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;因为咱们经年累月积累的文献阅读笔记，本地知识库特别适合咱们科研群体。 不过目前本地部署受限于电脑性能， 使用受限， 但不远的未来应该会有一些收费的在线知识库应用。&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;依托于本项目支持的开源 LLM 与 Embedding 模型，本项目可实现全部使用&lt;strong&gt;开源&lt;/strong&gt;模型&lt;strong&gt;离线私有部署&lt;/strong&gt;。与此同时，本项目也支持
OpenAI GPT API 的调用，并将在后续持续扩充对各类模型及模型 API 的接入。&lt;/p&gt;
&lt;p&gt;本项目实现原理如下图所示，过程包括 加载文件 -&amp;gt; 读取文本 -&amp;gt; 文本分割 -&amp;gt; 文本向量化 -&amp;gt; 问句向量化 -&amp;gt;
在文本向量中匹配出与问句向量最相似的 &lt;code&gt;top k&lt;/code&gt;个 -&amp;gt; 匹配出的文本作为上下文和问题一起添加到 &lt;code&gt;prompt&lt;/code&gt;中 -&amp;gt; 提交给 &lt;code&gt;LLM&lt;/code&gt;生成回答。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/langchain&amp;#43;chatglm.png&#34; alt=&#34;实现原理图&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;从文档处理角度来看，实现流程如下：&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/langchain&amp;#43;chatglm2.png&#34; alt=&#34;实现原理图2&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二搭建步骤&#34;&gt;二、搭建步骤&lt;/h2&gt;
&lt;h3 id=&#34;21-环境配置&#34;&gt;2.1 环境配置&lt;/h3&gt;
&lt;p&gt;强烈推荐使用 Python3.11， 创建一个虚拟环境，并在虚拟环境内安装项目的依赖。需要注意&lt;strong&gt;电脑显存要大于12G&lt;/strong&gt;， 不然该项目跑不动。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# 拉取仓库&lt;/span&gt;
&lt;span class=&#34;err&#34;&gt;$&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;git&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;clone&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;https&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;//&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;github&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;com&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;/&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;chatchat&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;space&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;/&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Langchain&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Chatchat&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;git&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;# 进入目录&lt;/span&gt;
&lt;span class=&#34;err&#34;&gt;$&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;cd&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Langchain&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Chatchat&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;# 安装全部依赖&lt;/span&gt;
&lt;span class=&#34;err&#34;&gt;$&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pip&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;install&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;r&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;requirements&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;txt&lt;/span&gt; 
&lt;span class=&#34;err&#34;&gt;$&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pip&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;install&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;r&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;requirements_api&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;txt&lt;/span&gt;
&lt;span class=&#34;err&#34;&gt;$&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pip&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;install&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;r&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;requirements_webui&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;txt&lt;/span&gt;  

&lt;span class=&#34;c1&#34;&gt;# 默认依赖包括基本运行环境（FAISS向量库）。如果要使用 milvus/pg_vector 等向量库，请将 requirements.txt 中相应依赖取消注释再安装。&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;22-模型下载&#34;&gt;2.2 模型下载&lt;/h3&gt;
&lt;p&gt;如需在本地或离线环境下运行本项目，需要首先将项目所需的模型下载至本地，通常开源 LLM 与 Embedding 模型可以从 &lt;a href=&#34;https://huggingface.co/models&#34;&gt;HuggingFace&lt;/a&gt; 下载。&lt;/p&gt;
&lt;p&gt;以本项目中默认使用的 LLM 模型 &lt;a href=&#34;https://huggingface.co/THUDM/chatglm3-6b&#34;&gt;THUDM/ChatGLM3-6B&lt;/a&gt; 与 Embedding 模型 &lt;a href=&#34;https://huggingface.co/BAAI/bge-large-zh&#34;&gt;BAAI/bge-large-zh&lt;/a&gt; 为例：&lt;/p&gt;
&lt;p&gt;下载模型需要先&lt;a href=&#34;https://docs.github.com/zh/repositories/working-with-files/managing-large-files/installing-git-large-file-storage&#34;&gt;安装 Git LFS&lt;/a&gt; ，然后运行&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;$ git lfs install
$ git clone https://huggingface.co/THUDM/chatglm3-6b
$ git clone https://huggingface.co/BAAI/bge-large-zh
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;23-初始化知识库和配置文件&#34;&gt;2.3 初始化知识库和配置文件&lt;/h3&gt;
&lt;p&gt;按照下列方式初始化自己的知识库和简单的复制配置文件&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;$ python copy_config_example.py
$ python init_database.py --recreate-vs
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;24-一键启动&#34;&gt;2.4 一键启动&lt;/h3&gt;
&lt;p&gt;按照以下命令启动项目&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;$ python startup.py -a
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h3 id=&#34;25-启动界面示例&#34;&gt;2.5 启动界面示例&lt;/h3&gt;
&lt;p&gt;如果正常启动，你将能看到以下界面&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/LLM_success.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/init_knowledge_base.jpg&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h3 id=&#34;三-外包&#34;&gt;三、 外包&lt;/h3&gt;
&lt;p&gt;如果电脑显存大于12G，不差钱但缺时间，可以在某鱼搜「langchain-chatchat」，配置费用大概100-200元。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/xianyu.jpg&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;b&gt;
&lt;b&gt;
</description>
      <content:encoded><![CDATA[<h2 id="一langchain-chatchat">一、LangChain-Chatchat</h2>
<p>基于 ChatGLM 等大语言模型与 Langchain 等应用框架实现，开源、可离线部署的检索增强生成(RAG)大模型知识库项目。</p>
<blockquote>
<p>因为咱们经年累月积累的文献阅读笔记，本地知识库特别适合咱们科研群体。 不过目前本地部署受限于电脑性能， 使用受限， 但不远的未来应该会有一些收费的在线知识库应用。</p>
</blockquote>
<p>依托于本项目支持的开源 LLM 与 Embedding 模型，本项目可实现全部使用<strong>开源</strong>模型<strong>离线私有部署</strong>。与此同时，本项目也支持
OpenAI GPT API 的调用，并将在后续持续扩充对各类模型及模型 API 的接入。</p>
<p>本项目实现原理如下图所示，过程包括 加载文件 -&gt; 读取文本 -&gt; 文本分割 -&gt; 文本向量化 -&gt; 问句向量化 -&gt;
在文本向量中匹配出与问句向量最相似的 <code>top k</code>个 -&gt; 匹配出的文本作为上下文和问题一起添加到 <code>prompt</code>中 -&gt; 提交给 <code>LLM</code>生成回答。</p>
<p><img loading="lazy" src="img/langchain&#43;chatglm.png" alt="实现原理图"  />
</p>
<p>从文档处理角度来看，实现流程如下：</p>
<p><img loading="lazy" src="img/langchain&#43;chatglm2.png" alt="实现原理图2"  />
</p>
<p><br><br></p>
<h2 id="二搭建步骤">二、搭建步骤</h2>
<h3 id="21-环境配置">2.1 环境配置</h3>
<p>强烈推荐使用 Python3.11， 创建一个虚拟环境，并在虚拟环境内安装项目的依赖。需要注意<strong>电脑显存要大于12G</strong>， 不然该项目跑不动。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1"># 拉取仓库</span>
<span class="err">$</span> <span class="n">git</span> <span class="n">clone</span> <span class="n">https</span><span class="p">:</span><span class="o">//</span><span class="n">github</span><span class="o">.</span><span class="n">com</span><span class="o">/</span><span class="n">chatchat</span><span class="o">-</span><span class="n">space</span><span class="o">/</span><span class="n">Langchain</span><span class="o">-</span><span class="n">Chatchat</span><span class="o">.</span><span class="n">git</span>

<span class="c1"># 进入目录</span>
<span class="err">$</span> <span class="n">cd</span> <span class="n">Langchain</span><span class="o">-</span><span class="n">Chatchat</span>

<span class="c1"># 安装全部依赖</span>
<span class="err">$</span> <span class="n">pip</span> <span class="n">install</span> <span class="o">-</span><span class="n">r</span> <span class="n">requirements</span><span class="o">.</span><span class="n">txt</span> 
<span class="err">$</span> <span class="n">pip</span> <span class="n">install</span> <span class="o">-</span><span class="n">r</span> <span class="n">requirements_api</span><span class="o">.</span><span class="n">txt</span>
<span class="err">$</span> <span class="n">pip</span> <span class="n">install</span> <span class="o">-</span><span class="n">r</span> <span class="n">requirements_webui</span><span class="o">.</span><span class="n">txt</span>  

<span class="c1"># 默认依赖包括基本运行环境（FAISS向量库）。如果要使用 milvus/pg_vector 等向量库，请将 requirements.txt 中相应依赖取消注释再安装。</span>

</code></pre></div><br>
<h3 id="22-模型下载">2.2 模型下载</h3>
<p>如需在本地或离线环境下运行本项目，需要首先将项目所需的模型下载至本地，通常开源 LLM 与 Embedding 模型可以从 <a href="https://huggingface.co/models">HuggingFace</a> 下载。</p>
<p>以本项目中默认使用的 LLM 模型 <a href="https://huggingface.co/THUDM/chatglm3-6b">THUDM/ChatGLM3-6B</a> 与 Embedding 模型 <a href="https://huggingface.co/BAAI/bge-large-zh">BAAI/bge-large-zh</a> 为例：</p>
<p>下载模型需要先<a href="https://docs.github.com/zh/repositories/working-with-files/managing-large-files/installing-git-large-file-storage">安装 Git LFS</a> ，然后运行</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">$ git lfs install
$ git clone https://huggingface.co/THUDM/chatglm3-6b
$ git clone https://huggingface.co/BAAI/bge-large-zh
</code></pre></div><br>
<h3 id="23-初始化知识库和配置文件">2.3 初始化知识库和配置文件</h3>
<p>按照下列方式初始化自己的知识库和简单的复制配置文件</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">$ python copy_config_example.py
$ python init_database.py --recreate-vs
</code></pre></div><br>
<h3 id="24-一键启动">2.4 一键启动</h3>
<p>按照以下命令启动项目</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">$ python startup.py -a
</code></pre></div><h3 id="25-启动界面示例">2.5 启动界面示例</h3>
<p>如果正常启动，你将能看到以下界面</p>
<p><img loading="lazy" src="img/LLM_success.png" alt=""  />
</p>
<p><img loading="lazy" src="img/init_knowledge_base.jpg" alt=""  />
</p>
<p><br><br></p>
<h3 id="三-外包">三、 外包</h3>
<p>如果电脑显存大于12G，不差钱但缺时间，可以在某鱼搜「langchain-chatchat」，配置费用大概100-200元。</p>
<p><img loading="lazy" src="img/xianyu.jpg" alt=""  />
</p>
<b>
<b>
]]></content:encoded>
    </item>
    
    <item>
      <title>可视化 | 使用umap对200维词向量的进行降维和可视化</title>
      <link>https://textdata.cn/blog/2024-01-23-umap/</link>
      <pubDate>Tue, 23 Jan 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-01-23-umap/</guid>
      <description>&lt;h2 id=&#34;一介绍&#34;&gt;一、介绍&lt;/h2&gt;
&lt;p&gt;UMAP（Uniform Manifold Approximation and Projection for Dimension Reduction）是一种非线性降维技术，类似于t-SNE、PCA，可用于可视化。在降维应用中， 相比于t-SNE，umap既快又准。&lt;/p&gt;
&lt;p&gt;如果对 UMAP算法感兴趣，可以阅读论文&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;McInnes, L, Healy, J, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, ArXiv &lt;a href=&#34;https://www.zhihu.com/search?q=e-prints&amp;amp;search_source=Entity&amp;amp;hybrid_search_source=Entity&amp;amp;hybrid_search_extra=%7B%22sourceType%22%3A%22article%22%2C%22sourceId%22%3A%22109584077%22%7D&#34;&gt;e-prints&lt;/a&gt; 1802.03426, 2018&lt;/p&gt;
&lt;/blockquote&gt;
&lt;br&gt;
&lt;br&gt;
&lt;h2 id=&#34;二准备数据&#34;&gt;二、准备数据&lt;/h2&gt;
&lt;h3 id=&#34;21-读取数据&#34;&gt;2.1 读取数据&lt;/h3&gt;
&lt;p&gt;我从 &lt;a href=&#34;https://textdata.cn/blog/2023-12-14-daily-news-dataset/&#34;&gt;&lt;strong&gt;人民日报(1946-2023.12.18)&lt;/strong&gt;&lt;/a&gt; 训练的 word2vec模型 中， 选出了100个词的词向量，构建得到了 &lt;a href=&#34;data.csv.gz&#34;&gt;&lt;strong&gt;data.csv.gz&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;word:  词语，一共有100个&lt;/li&gt;
&lt;li&gt;category: 词语的类别， 一共五种(亲人、环保、研发、国王、数字化)&lt;/li&gt;
&lt;li&gt;f1,f2,f3,&amp;hellip;,f200  词向量的200维（每个词语的词向量是200维的向量）&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;data.csv.gz&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;compression&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;gzip&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/01-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;22-查看词语对应类别&#34;&gt;2.2 查看词语&amp;amp;对应类别&lt;/h3&gt;
&lt;p&gt;大邓准备了五类词， 每类词20个词， 词语类别按顺序依次是 &lt;strong&gt;亲人、环保、研发、国王、数字&lt;/strong&gt;&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;word&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;tolist&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;[&amp;#39;爸爸&amp;#39;, &amp;#39;姐姐&amp;#39;, &amp;#39;奶奶&amp;#39;, &amp;#39;女儿&amp;#39;, &amp;#39;外公&amp;#39;, &amp;#39;哥哥&amp;#39;, &amp;#39;儿子&amp;#39;, &amp;#39;祖母&amp;#39;, &amp;#39;父母亲&amp;#39;, &amp;#39;外婆&amp;#39;, &amp;#39;妹妹&amp;#39;, &amp;#39;孙女&amp;#39;, &amp;#39;姥爷&amp;#39;, &amp;#39;小女儿&amp;#39;, &amp;#39;姥姥&amp;#39;, &amp;#39;二姐&amp;#39;, &amp;#39;姑姑&amp;#39;, &amp;#39;弟弟&amp;#39;, &amp;#39;弟弟妹妹&amp;#39;, &amp;#39;爸爸妈妈&amp;#39;, &amp;#39;低碳&amp;#39;, &amp;#39;节能&amp;#39;, &amp;#39;环境保护&amp;#39;, &amp;#39;绿色环保&amp;#39;, &amp;#39;节能降耗&amp;#39;, &amp;#39;环保节能&amp;#39;, &amp;#39;生态环保&amp;#39;, &amp;#39;节能环保&amp;#39;, &amp;#39;节能低碳&amp;#39;, &amp;#39;绿色低碳&amp;#39;, &amp;#39;减排&amp;#39;, &amp;#39;绿色发展&amp;#39;, &amp;#39;保护环境&amp;#39;, &amp;#39;清洁生产&amp;#39;, &amp;#39;建筑节能&amp;#39;, &amp;#39;环境治理&amp;#39;, &amp;#39;减碳&amp;#39;, &amp;#39;循环经济&amp;#39;, &amp;#39;低碳环保&amp;#39;, &amp;#39;治理污染&amp;#39;, &amp;#39;科研开发&amp;#39;, &amp;#39;科技研发&amp;#39;, &amp;#39;科研创新&amp;#39;, &amp;#39;研发创新&amp;#39;, &amp;#39;技术创新&amp;#39;, &amp;#39;技术开发&amp;#39;, &amp;#39;技术研发&amp;#39;, &amp;#39;产品开发&amp;#39;, &amp;#39;产品研发&amp;#39;, &amp;#39;原始创新&amp;#39;, &amp;#39;科技创新&amp;#39;, &amp;#39;研究开发&amp;#39;, &amp;#39;新药研发&amp;#39;, &amp;#39;核心技术研发&amp;#39;, &amp;#39;产学研结合&amp;#39;, &amp;#39;科技开发&amp;#39;, &amp;#39;基础研究&amp;#39;, &amp;#39;新产品开发&amp;#39;, &amp;#39;研发成果&amp;#39;, &amp;#39;科研成果产业化&amp;#39;, &amp;#39;二世&amp;#39;, &amp;#39;王储&amp;#39;, &amp;#39;公主&amp;#39;, &amp;#39;女王&amp;#39;, &amp;#39;王妃&amp;#39;, &amp;#39;陛下&amp;#39;, &amp;#39;王宫&amp;#39;, &amp;#39;王室&amp;#39;, &amp;#39;王室成员&amp;#39;, &amp;#39;皇室成员&amp;#39;, &amp;#39;登基&amp;#39;, &amp;#39;六世&amp;#39;, &amp;#39;继承王位&amp;#39;, &amp;#39;五世&amp;#39;, &amp;#39;摄政王&amp;#39;, &amp;#39;七世&amp;#39;, &amp;#39;英国女王&amp;#39;, &amp;#39;三世&amp;#39;, &amp;#39;四世&amp;#39;, &amp;#39;继位&amp;#39;, &amp;#39;人工智能技术&amp;#39;, &amp;#39;AI&amp;#39;, &amp;#39;数字技术&amp;#39;, &amp;#39;虚拟现实&amp;#39;, &amp;#39;云计算&amp;#39;, &amp;#39;万物互联&amp;#39;, &amp;#39;信息技术&amp;#39;, &amp;#39;语音技术&amp;#39;, &amp;#39;物联网&amp;#39;, &amp;#39;智能硬件&amp;#39;, &amp;#39;5G技术&amp;#39;, &amp;#39;IoT&amp;#39;, &amp;#39;智能应用&amp;#39;, &amp;#39;软件技术&amp;#39;, &amp;#39;融合应用&amp;#39;, &amp;#39;6G&amp;#39;, &amp;#39;人工智能机器人&amp;#39;, &amp;#39;数据应用&amp;#39;, &amp;#39;人工智能应用&amp;#39;, &amp;#39;智能&amp;#39;]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;p&gt;词语对应的类别&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;category&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;tolist&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;亲人&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;亲人&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;亲人&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;亲人&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;亲人&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;亲人&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;亲人&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;亲人&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;亲人&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;亲人&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;亲人&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;亲人&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;亲人&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;亲人&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;亲人&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;亲人&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;亲人&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;亲人&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;亲人&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;亲人&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;环保&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;环保&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;环保&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;环保&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;环保&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;环保&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;环保&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;环保&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;环保&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;环保&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;环保&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;环保&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;环保&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;环保&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;环保&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;环保&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;环保&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;环保&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;环保&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;环保&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;研发&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;研发&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;研发&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;研发&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;研发&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;研发&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;研发&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;研发&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;研发&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;研发&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;研发&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;研发&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;研发&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;研发&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;研发&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;研发&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;研发&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;研发&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;研发&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;研发&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;国王&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;国王&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;国王&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;国王&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;国王&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;国王&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;国王&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;国王&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;国王&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;国王&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;国王&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;国王&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;国王&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;国王&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;国王&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;国王&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;国王&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;国王&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;国王&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;国王&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;数字化&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;数字化&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;数字化&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;数字化&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;数字化&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;数字化&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;数字化&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;数字化&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;数字化&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;数字化&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;数字化&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;数字化&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;数字化&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;数字化&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;数字化&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;数字化&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;数字化&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;数字化&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;数字化&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;数字化&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;三实验代码&#34;&gt;三、实验代码&lt;/h2&gt;
&lt;h3 id=&#34;31-环境准备&#34;&gt;3.1 环境准备&lt;/h3&gt;
&lt;p&gt;在 &lt;em&gt;&lt;strong&gt;cmd(terminal)&lt;/strong&gt;&lt;/em&gt; 安装本文需要的库&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;pip3 install umap-learn
pip3 install datashader,bokeh,holoviews  #可视化可能会用到的库
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;32-降维&#34;&gt;3.2 降维&lt;/h3&gt;
&lt;p&gt;将 100 个词的词向量数据从 200 维压缩到 2 维&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;umap&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;word_emb_redution_data&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;umap&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;UMAP&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;n_neighbors&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;15&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;  &lt;span class=&#34;c1&#34;&gt;#默认，不需要理解&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;min_dist&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;mf&#34;&gt;0.1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;c1&#34;&gt;#默认，不需要理解&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;n_components&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;c1&#34;&gt;#2维&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;random_state&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;666&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;c1&#34;&gt;#默认， 保证任意时空代码运行结果的随机状态是一致的&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fit&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;iloc&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[:,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:])&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;word_emb_redution_data&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/02-umap.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;33-静态可视化&#34;&gt;3.3 静态可视化&lt;/h3&gt;
&lt;p&gt;绘制静态的图(没有鼠标交互)， 底层应该是调用了 &lt;em&gt;&lt;strong&gt;matplotlib&lt;/strong&gt;&lt;/em&gt; 。 因为实验数据是中文词语， 可视化可能绘乱码。为避免问题， 提前运行代码&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;plt&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;matplotlib&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;matplotlib_inline&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;matplotlib_inline&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;backend_inline&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;set_matplotlib_formats&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;png&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;svg&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;scienceplots&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;platform&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;style&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;use&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;([&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;science&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;no-latex&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;cjk-sc-font&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;system&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;platform&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;system&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;  &lt;span class=&#34;c1&#34;&gt;# 获取操作系统类型&lt;/span&gt;

&lt;span class=&#34;k&#34;&gt;if&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;system&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;==&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;Windows&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;font&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;family&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;SimHei&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;elif&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;system&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;==&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;Darwin&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;font&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;family&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;Arial Unicode MS&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;else&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;font&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;family&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;sans-serif&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;matplotlib&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;rc&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;font&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;**&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;font&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;  &lt;span class=&#34;c1&#34;&gt;# 设置全局字体&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;p&gt;绘制 &lt;strong&gt;五类词的词向量投射到2维空间中的可视化&lt;/strong&gt; 的静态图(没有鼠标交互)&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;umap.plot&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;umap&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;plot&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;points&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;word_emb_redution_data&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;labels&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;category&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;width&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;800&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;height&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;500&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;title&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;五类词的词向量投射到2维空间中的可视化&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/03-plot.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;34-动态交互可视化&#34;&gt;3.4 动态交互可视化&lt;/h3&gt;
&lt;p&gt;umap.plot 内置了bokeh的动态交互功能， 需要先构造鼠标交互悬浮的信息&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;mapper&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;亲人&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;环保&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;研发&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;3&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;国王&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;4&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;数字化&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;5&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;


&lt;span class=&#34;n&#34;&gt;hover_data&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;DataFrame&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;({&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;index&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;index&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
                           &lt;span class=&#34;s1&#34;&gt;&amp;#39;item&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;category&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt; 
                           &lt;span class=&#34;s1&#34;&gt;&amp;#39;label&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;category&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;map&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;mapper&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)})&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;hover_data&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/04-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;接下来的代码将会生成一个 html 文件， 因为是动态效果，在博客(公众号)都无法完全显示， 大家如果想查看，可以点击链接下载&lt;/p&gt;
&lt;p&gt;&lt;a href=&#34;https://textdata.cn/blog/2024-01-23-umap/umap_interactive.html&#34;&gt;https://textdata.cn/blog/2024-01-23-umap/umap_interactive.html&lt;/a&gt;&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;p&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;umap&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;plot&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;interactive&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;clusterable_embedding&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
                          &lt;span class=&#34;n&#34;&gt;labels&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;category&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
                          &lt;span class=&#34;n&#34;&gt;hover_data&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;hover_data&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
                          &lt;span class=&#34;n&#34;&gt;point_size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;5&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
                          &lt;span class=&#34;n&#34;&gt;width&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;800&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
                          &lt;span class=&#34;n&#34;&gt;height&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;500&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;umap&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;plot&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;show&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;p&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/05-interactive.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;h2 id=&#34;四下载资料&#34;&gt;四、下载资料&lt;/h2&gt;
&lt;p&gt;点击下载实验数据  &lt;a href=&#34;data.csv.gz&#34;&gt;&lt;strong&gt;data.csv.gz&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
</description>
      <content:encoded><![CDATA[<h2 id="一介绍">一、介绍</h2>
<p>UMAP（Uniform Manifold Approximation and Projection for Dimension Reduction）是一种非线性降维技术，类似于t-SNE、PCA，可用于可视化。在降维应用中， 相比于t-SNE，umap既快又准。</p>
<p>如果对 UMAP算法感兴趣，可以阅读论文</p>
<blockquote>
<p>McInnes, L, Healy, J, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, ArXiv <a href="https://www.zhihu.com/search?q=e-prints&amp;search_source=Entity&amp;hybrid_search_source=Entity&amp;hybrid_search_extra=%7B%22sourceType%22%3A%22article%22%2C%22sourceId%22%3A%22109584077%22%7D">e-prints</a> 1802.03426, 2018</p>
</blockquote>
<br>
<br>
<h2 id="二准备数据">二、准备数据</h2>
<h3 id="21-读取数据">2.1 读取数据</h3>
<p>我从 <a href="https://textdata.cn/blog/2023-12-14-daily-news-dataset/"><strong>人民日报(1946-2023.12.18)</strong></a> 训练的 word2vec模型 中， 选出了100个词的词向量，构建得到了 <a href="data.csv.gz"><strong>data.csv.gz</strong></a></p>
<ul>
<li>word:  词语，一共有100个</li>
<li>category: 词语的类别， 一共五种(亲人、环保、研发、国王、数字化)</li>
<li>f1,f2,f3,&hellip;,f200  词向量的200维（每个词语的词向量是200维的向量）</li>
</ul>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;data.csv.gz&#39;</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">)</span>
<span class="n">df</span>
</code></pre></div><p><img loading="lazy" src="img/01-df.png" alt=""  />
</p>
<br>
<h3 id="22-查看词语对应类别">2.2 查看词语&amp;对应类别</h3>
<p>大邓准备了五类词， 每类词20个词， 词语类别按顺序依次是 <strong>亲人、环保、研发、国王、数字</strong></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s1">&#39;word&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">tolist</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[&#39;爸爸&#39;, &#39;姐姐&#39;, &#39;奶奶&#39;, &#39;女儿&#39;, &#39;外公&#39;, &#39;哥哥&#39;, &#39;儿子&#39;, &#39;祖母&#39;, &#39;父母亲&#39;, &#39;外婆&#39;, &#39;妹妹&#39;, &#39;孙女&#39;, &#39;姥爷&#39;, &#39;小女儿&#39;, &#39;姥姥&#39;, &#39;二姐&#39;, &#39;姑姑&#39;, &#39;弟弟&#39;, &#39;弟弟妹妹&#39;, &#39;爸爸妈妈&#39;, &#39;低碳&#39;, &#39;节能&#39;, &#39;环境保护&#39;, &#39;绿色环保&#39;, &#39;节能降耗&#39;, &#39;环保节能&#39;, &#39;生态环保&#39;, &#39;节能环保&#39;, &#39;节能低碳&#39;, &#39;绿色低碳&#39;, &#39;减排&#39;, &#39;绿色发展&#39;, &#39;保护环境&#39;, &#39;清洁生产&#39;, &#39;建筑节能&#39;, &#39;环境治理&#39;, &#39;减碳&#39;, &#39;循环经济&#39;, &#39;低碳环保&#39;, &#39;治理污染&#39;, &#39;科研开发&#39;, &#39;科技研发&#39;, &#39;科研创新&#39;, &#39;研发创新&#39;, &#39;技术创新&#39;, &#39;技术开发&#39;, &#39;技术研发&#39;, &#39;产品开发&#39;, &#39;产品研发&#39;, &#39;原始创新&#39;, &#39;科技创新&#39;, &#39;研究开发&#39;, &#39;新药研发&#39;, &#39;核心技术研发&#39;, &#39;产学研结合&#39;, &#39;科技开发&#39;, &#39;基础研究&#39;, &#39;新产品开发&#39;, &#39;研发成果&#39;, &#39;科研成果产业化&#39;, &#39;二世&#39;, &#39;王储&#39;, &#39;公主&#39;, &#39;女王&#39;, &#39;王妃&#39;, &#39;陛下&#39;, &#39;王宫&#39;, &#39;王室&#39;, &#39;王室成员&#39;, &#39;皇室成员&#39;, &#39;登基&#39;, &#39;六世&#39;, &#39;继承王位&#39;, &#39;五世&#39;, &#39;摄政王&#39;, &#39;七世&#39;, &#39;英国女王&#39;, &#39;三世&#39;, &#39;四世&#39;, &#39;继位&#39;, &#39;人工智能技术&#39;, &#39;AI&#39;, &#39;数字技术&#39;, &#39;虚拟现实&#39;, &#39;云计算&#39;, &#39;万物互联&#39;, &#39;信息技术&#39;, &#39;语音技术&#39;, &#39;物联网&#39;, &#39;智能硬件&#39;, &#39;5G技术&#39;, &#39;IoT&#39;, &#39;智能应用&#39;, &#39;软件技术&#39;, &#39;融合应用&#39;, &#39;6G&#39;, &#39;人工智能机器人&#39;, &#39;数据应用&#39;, &#39;人工智能应用&#39;, &#39;智能&#39;]
</code></pre></div><br>
<p>词语对应的类别</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="nb">print</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;category&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">tolist</span><span class="p">())</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="p">[</span><span class="s1">&#39;亲人&#39;</span><span class="p">,</span> <span class="s1">&#39;亲人&#39;</span><span class="p">,</span> <span class="s1">&#39;亲人&#39;</span><span class="p">,</span> <span class="s1">&#39;亲人&#39;</span><span class="p">,</span> <span class="s1">&#39;亲人&#39;</span><span class="p">,</span> <span class="s1">&#39;亲人&#39;</span><span class="p">,</span> <span class="s1">&#39;亲人&#39;</span><span class="p">,</span> <span class="s1">&#39;亲人&#39;</span><span class="p">,</span> <span class="s1">&#39;亲人&#39;</span><span class="p">,</span> <span class="s1">&#39;亲人&#39;</span><span class="p">,</span> <span class="s1">&#39;亲人&#39;</span><span class="p">,</span> <span class="s1">&#39;亲人&#39;</span><span class="p">,</span> <span class="s1">&#39;亲人&#39;</span><span class="p">,</span> <span class="s1">&#39;亲人&#39;</span><span class="p">,</span> <span class="s1">&#39;亲人&#39;</span><span class="p">,</span> <span class="s1">&#39;亲人&#39;</span><span class="p">,</span> <span class="s1">&#39;亲人&#39;</span><span class="p">,</span> <span class="s1">&#39;亲人&#39;</span><span class="p">,</span> <span class="s1">&#39;亲人&#39;</span><span class="p">,</span> <span class="s1">&#39;亲人&#39;</span><span class="p">,</span> <span class="s1">&#39;环保&#39;</span><span class="p">,</span> <span class="s1">&#39;环保&#39;</span><span class="p">,</span> <span class="s1">&#39;环保&#39;</span><span class="p">,</span> <span class="s1">&#39;环保&#39;</span><span class="p">,</span> <span class="s1">&#39;环保&#39;</span><span class="p">,</span> <span class="s1">&#39;环保&#39;</span><span class="p">,</span> <span class="s1">&#39;环保&#39;</span><span class="p">,</span> <span class="s1">&#39;环保&#39;</span><span class="p">,</span> <span class="s1">&#39;环保&#39;</span><span class="p">,</span> <span class="s1">&#39;环保&#39;</span><span class="p">,</span> <span class="s1">&#39;环保&#39;</span><span class="p">,</span> <span class="s1">&#39;环保&#39;</span><span class="p">,</span> <span class="s1">&#39;环保&#39;</span><span class="p">,</span> <span class="s1">&#39;环保&#39;</span><span class="p">,</span> <span class="s1">&#39;环保&#39;</span><span class="p">,</span> <span class="s1">&#39;环保&#39;</span><span class="p">,</span> <span class="s1">&#39;环保&#39;</span><span class="p">,</span> <span class="s1">&#39;环保&#39;</span><span class="p">,</span> <span class="s1">&#39;环保&#39;</span><span class="p">,</span> <span class="s1">&#39;环保&#39;</span><span class="p">,</span> <span class="s1">&#39;研发&#39;</span><span class="p">,</span> <span class="s1">&#39;研发&#39;</span><span class="p">,</span> <span class="s1">&#39;研发&#39;</span><span class="p">,</span> <span class="s1">&#39;研发&#39;</span><span class="p">,</span> <span class="s1">&#39;研发&#39;</span><span class="p">,</span> <span class="s1">&#39;研发&#39;</span><span class="p">,</span> <span class="s1">&#39;研发&#39;</span><span class="p">,</span> <span class="s1">&#39;研发&#39;</span><span class="p">,</span> <span class="s1">&#39;研发&#39;</span><span class="p">,</span> <span class="s1">&#39;研发&#39;</span><span class="p">,</span> <span class="s1">&#39;研发&#39;</span><span class="p">,</span> <span class="s1">&#39;研发&#39;</span><span class="p">,</span> <span class="s1">&#39;研发&#39;</span><span class="p">,</span> <span class="s1">&#39;研发&#39;</span><span class="p">,</span> <span class="s1">&#39;研发&#39;</span><span class="p">,</span> <span class="s1">&#39;研发&#39;</span><span class="p">,</span> <span class="s1">&#39;研发&#39;</span><span class="p">,</span> <span class="s1">&#39;研发&#39;</span><span class="p">,</span> <span class="s1">&#39;研发&#39;</span><span class="p">,</span> <span class="s1">&#39;研发&#39;</span><span class="p">,</span> <span class="s1">&#39;国王&#39;</span><span class="p">,</span> <span class="s1">&#39;国王&#39;</span><span class="p">,</span> <span class="s1">&#39;国王&#39;</span><span class="p">,</span> <span class="s1">&#39;国王&#39;</span><span class="p">,</span> <span class="s1">&#39;国王&#39;</span><span class="p">,</span> <span class="s1">&#39;国王&#39;</span><span class="p">,</span> <span class="s1">&#39;国王&#39;</span><span class="p">,</span> <span class="s1">&#39;国王&#39;</span><span class="p">,</span> <span class="s1">&#39;国王&#39;</span><span class="p">,</span> <span class="s1">&#39;国王&#39;</span><span class="p">,</span> <span class="s1">&#39;国王&#39;</span><span class="p">,</span> <span class="s1">&#39;国王&#39;</span><span class="p">,</span> <span class="s1">&#39;国王&#39;</span><span class="p">,</span> <span class="s1">&#39;国王&#39;</span><span class="p">,</span> <span class="s1">&#39;国王&#39;</span><span class="p">,</span> <span class="s1">&#39;国王&#39;</span><span class="p">,</span> <span class="s1">&#39;国王&#39;</span><span class="p">,</span> <span class="s1">&#39;国王&#39;</span><span class="p">,</span> <span class="s1">&#39;国王&#39;</span><span class="p">,</span> <span class="s1">&#39;国王&#39;</span><span class="p">,</span> <span class="s1">&#39;数字化&#39;</span><span class="p">,</span> <span class="s1">&#39;数字化&#39;</span><span class="p">,</span> <span class="s1">&#39;数字化&#39;</span><span class="p">,</span> <span class="s1">&#39;数字化&#39;</span><span class="p">,</span> <span class="s1">&#39;数字化&#39;</span><span class="p">,</span> <span class="s1">&#39;数字化&#39;</span><span class="p">,</span> <span class="s1">&#39;数字化&#39;</span><span class="p">,</span> <span class="s1">&#39;数字化&#39;</span><span class="p">,</span> <span class="s1">&#39;数字化&#39;</span><span class="p">,</span> <span class="s1">&#39;数字化&#39;</span><span class="p">,</span> <span class="s1">&#39;数字化&#39;</span><span class="p">,</span> <span class="s1">&#39;数字化&#39;</span><span class="p">,</span> <span class="s1">&#39;数字化&#39;</span><span class="p">,</span> <span class="s1">&#39;数字化&#39;</span><span class="p">,</span> <span class="s1">&#39;数字化&#39;</span><span class="p">,</span> <span class="s1">&#39;数字化&#39;</span><span class="p">,</span> <span class="s1">&#39;数字化&#39;</span><span class="p">,</span> <span class="s1">&#39;数字化&#39;</span><span class="p">,</span> <span class="s1">&#39;数字化&#39;</span><span class="p">,</span> <span class="s1">&#39;数字化&#39;</span><span class="p">]</span>
</code></pre></div><p><br><br></p>
<h2 id="三实验代码">三、实验代码</h2>
<h3 id="31-环境准备">3.1 环境准备</h3>
<p>在 <em><strong>cmd(terminal)</strong></em> 安装本文需要的库</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">pip3 install umap-learn
pip3 install datashader,bokeh,holoviews  #可视化可能会用到的库
</code></pre></div><br>
<h3 id="32-降维">3.2 降维</h3>
<p>将 100 个词的词向量数据从 200 维压缩到 2 维</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">umap</span>

<span class="n">word_emb_redution_data</span> <span class="o">=</span> <span class="n">umap</span><span class="o">.</span><span class="n">UMAP</span><span class="p">(</span>
    <span class="n">n_neighbors</span> <span class="o">=</span> <span class="mi">15</span><span class="p">,</span>  <span class="c1">#默认，不需要理解</span>
    <span class="n">min_dist</span> <span class="o">=</span> <span class="mf">0.1</span><span class="p">,</span> <span class="c1">#默认，不需要理解</span>
    <span class="n">n_components</span> <span class="o">=</span> <span class="mi">2</span><span class="p">,</span> <span class="c1">#2维</span>
    <span class="n">random_state</span> <span class="o">=</span> <span class="mi">666</span><span class="p">,</span> <span class="c1">#默认， 保证任意时空代码运行结果的随机状态是一致的</span>
<span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">iloc</span><span class="p">[:,</span> <span class="mi">2</span><span class="p">:])</span>

<span class="n">word_emb_redution_data</span>
</code></pre></div><p><img loading="lazy" src="img/02-umap.png" alt=""  />
</p>
<br>
<h3 id="33-静态可视化">3.3 静态可视化</h3>
<p>绘制静态的图(没有鼠标交互)， 底层应该是调用了 <em><strong>matplotlib</strong></em> 。 因为实验数据是中文词语， 可视化可能绘乱码。为避免问题， 提前运行代码</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">import</span> <span class="nn">matplotlib</span>
<span class="kn">import</span> <span class="nn">matplotlib_inline</span>
<span class="n">matplotlib_inline</span><span class="o">.</span><span class="n">backend_inline</span><span class="o">.</span><span class="n">set_matplotlib_formats</span><span class="p">(</span><span class="s1">&#39;png&#39;</span><span class="p">,</span> <span class="s1">&#39;svg&#39;</span><span class="p">)</span>
<span class="kn">import</span> <span class="nn">scienceplots</span>
<span class="kn">import</span> <span class="nn">platform</span>
<span class="n">plt</span><span class="o">.</span><span class="n">style</span><span class="o">.</span><span class="n">use</span><span class="p">([</span><span class="s1">&#39;science&#39;</span><span class="p">,</span> <span class="s1">&#39;no-latex&#39;</span><span class="p">,</span> <span class="s1">&#39;cjk-sc-font&#39;</span><span class="p">])</span>
<span class="n">system</span> <span class="o">=</span> <span class="n">platform</span><span class="o">.</span><span class="n">system</span><span class="p">()</span>  <span class="c1"># 获取操作系统类型</span>

<span class="k">if</span> <span class="n">system</span> <span class="o">==</span> <span class="s1">&#39;Windows&#39;</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;SimHei&#39;</span><span class="p">}</span>
<span class="k">elif</span> <span class="n">system</span> <span class="o">==</span> <span class="s1">&#39;Darwin&#39;</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;Arial Unicode MS&#39;</span><span class="p">}</span>
<span class="k">else</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;sans-serif&#39;</span><span class="p">}</span>
<span class="n">matplotlib</span><span class="o">.</span><span class="n">rc</span><span class="p">(</span><span class="s1">&#39;font&#39;</span><span class="p">,</span> <span class="o">**</span><span class="n">font</span><span class="p">)</span>  <span class="c1"># 设置全局字体</span>
</code></pre></div><br>
<p>绘制 <strong>五类词的词向量投射到2维空间中的可视化</strong> 的静态图(没有鼠标交互)</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">umap.plot</span>

<span class="n">umap</span><span class="o">.</span><span class="n">plot</span><span class="o">.</span><span class="n">points</span><span class="p">(</span><span class="n">word_emb_redution_data</span><span class="p">,</span> <span class="n">labels</span><span class="o">=</span><span class="n">df</span><span class="o">.</span><span class="n">category</span><span class="p">,</span> <span class="n">width</span><span class="o">=</span><span class="mi">800</span><span class="p">,</span> <span class="n">height</span><span class="o">=</span><span class="mi">500</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">&#39;五类词的词向量投射到2维空间中的可视化&#39;</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/03-plot.png" alt=""  />
</p>
<br>
<h3 id="34-动态交互可视化">3.4 动态交互可视化</h3>
<p>umap.plot 内置了bokeh的动态交互功能， 需要先构造鼠标交互悬浮的信息</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">mapper</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;亲人&#39;</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span> <span class="s1">&#39;环保&#39;</span><span class="p">:</span> <span class="mi">2</span><span class="p">,</span> <span class="s1">&#39;研发&#39;</span><span class="p">:</span><span class="mi">3</span><span class="p">,</span> <span class="s1">&#39;国王&#39;</span><span class="p">:</span><span class="mi">4</span><span class="p">,</span> <span class="s1">&#39;数字化&#39;</span><span class="p">:</span><span class="mi">5</span> <span class="p">}</span>


<span class="n">hover_data</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="s1">&#39;index&#39;</span><span class="p">:</span><span class="n">df</span><span class="o">.</span><span class="n">index</span><span class="p">,</span>
                           <span class="s1">&#39;item&#39;</span><span class="p">:</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;category&#39;</span><span class="p">],</span> 
                           <span class="s1">&#39;label&#39;</span><span class="p">:</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;category&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="n">mapper</span><span class="p">)})</span>

<span class="n">hover_data</span>
</code></pre></div><p><img loading="lazy" src="img/04-df.png" alt=""  />
</p>
<br>
<p>接下来的代码将会生成一个 html 文件， 因为是动态效果，在博客(公众号)都无法完全显示， 大家如果想查看，可以点击链接下载</p>
<p><a href="https://textdata.cn/blog/2024-01-23-umap/umap_interactive.html">https://textdata.cn/blog/2024-01-23-umap/umap_interactive.html</a></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">p</span> <span class="o">=</span> <span class="n">umap</span><span class="o">.</span><span class="n">plot</span><span class="o">.</span><span class="n">interactive</span><span class="p">(</span><span class="n">clusterable_embedding</span><span class="p">,</span> 
                          <span class="n">labels</span><span class="o">=</span><span class="n">df</span><span class="o">.</span><span class="n">category</span><span class="p">,</span> 
                          <span class="n">hover_data</span><span class="o">=</span><span class="n">hover_data</span><span class="p">,</span> 
                          <span class="n">point_size</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> 
                          <span class="n">width</span><span class="o">=</span><span class="mi">800</span><span class="p">,</span> 
                          <span class="n">height</span><span class="o">=</span><span class="mi">500</span><span class="p">)</span>
<span class="n">umap</span><span class="o">.</span><span class="n">plot</span><span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="n">p</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/05-interactive.png" alt=""  />
</p>
<br>
<br>
<h2 id="四下载资料">四、下载资料</h2>
<p>点击下载实验数据  <a href="data.csv.gz"><strong>data.csv.gz</strong></a></p>
<br>
<br>
]]></content:encoded>
    </item>
    
    <item>
      <title>opencc | 中文简体、繁体转换库</title>
      <link>https://textdata.cn/blog/2024-01-21-chinese-traditional-to-simplified-text/</link>
      <pubDate>Sun, 21 Jan 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-01-21-chinese-traditional-to-simplified-text/</guid>
      <description>&lt;h2 id=&#34;一介绍&#34;&gt;一、介绍&lt;/h2&gt;
&lt;p&gt;opencc-python是中文简体、繁体转换库， 可以进行简转繁、繁转简、杂转简、杂转繁等操作。&lt;/p&gt;
&lt;br&gt;
&lt;ul&gt;
&lt;li&gt;t2s：繁体中文转简体中文&lt;/li&gt;
&lt;li&gt;s2t：简体中文转繁体中文&lt;/li&gt;
&lt;li&gt;hk2s：繁体中文（香港标准）至简体中文&lt;/li&gt;
&lt;li&gt;s2hk：简体中文转繁体中文（香港标准）&lt;/li&gt;
&lt;li&gt;s2tw：简体中文转繁体中文（台湾标准）&lt;/li&gt;
&lt;li&gt;s2twp：简体中文转繁体中文（台湾标准，带短语）&lt;/li&gt;
&lt;li&gt;t2hk：繁体中文转繁体中文（香港标准）&lt;/li&gt;
&lt;li&gt;t2tw：繁体中文转繁体中文（台湾标准）&lt;/li&gt;
&lt;li&gt;tw2s：繁体中文（台湾标准）到简体中文&lt;/li&gt;
&lt;li&gt;tw2sp：繁体中文（台湾标准）到简体中文（带短语）&lt;/li&gt;
&lt;/ul&gt;
&lt;br&gt;
&lt;br&gt;
&lt;h2 id=&#34;二安装&#34;&gt;二、安装&lt;/h2&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;pip3 install opencc-python-reimplemented
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;三快速上手&#34;&gt;三、快速上手&lt;/h2&gt;
&lt;h3 id=&#34;31-繁to简&#34;&gt;3.1 繁to简&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;opencc&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;OpenCC&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;cc&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;OpenCC&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;t2s&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;  &lt;span class=&#34;c1&#34;&gt;#繁体2简体&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;簡體漢字&amp;#39;&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;cc&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;convert&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;&amp;#39;简体汉字&amp;#39;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;32-简to繁&#34;&gt;3.2 简to繁&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;opencc&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;OpenCC&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;cc&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;OpenCC&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;s2t&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;  &lt;span class=&#34;c1&#34;&gt;# 简体2繁体&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;简体汉字&amp;#39;&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;cc&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;convert&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;&amp;#39;簡體漢字&amp;#39;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;br&gt;
</description>
      <content:encoded><![CDATA[<h2 id="一介绍">一、介绍</h2>
<p>opencc-python是中文简体、繁体转换库， 可以进行简转繁、繁转简、杂转简、杂转繁等操作。</p>
<br>
<ul>
<li>t2s：繁体中文转简体中文</li>
<li>s2t：简体中文转繁体中文</li>
<li>hk2s：繁体中文（香港标准）至简体中文</li>
<li>s2hk：简体中文转繁体中文（香港标准）</li>
<li>s2tw：简体中文转繁体中文（台湾标准）</li>
<li>s2twp：简体中文转繁体中文（台湾标准，带短语）</li>
<li>t2hk：繁体中文转繁体中文（香港标准）</li>
<li>t2tw：繁体中文转繁体中文（台湾标准）</li>
<li>tw2s：繁体中文（台湾标准）到简体中文</li>
<li>tw2sp：繁体中文（台湾标准）到简体中文（带短语）</li>
</ul>
<br>
<br>
<h2 id="二安装">二、安装</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">pip3 install opencc-python-reimplemented
</code></pre></div><p><br><br></p>
<h2 id="三快速上手">三、快速上手</h2>
<h3 id="31-繁to简">3.1 繁to简</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">opencc</span> <span class="kn">import</span> <span class="n">OpenCC</span>
<span class="n">cc</span> <span class="o">=</span> <span class="n">OpenCC</span><span class="p">(</span><span class="s1">&#39;t2s&#39;</span><span class="p">)</span>  <span class="c1">#繁体2简体</span>

<span class="n">text</span> <span class="o">=</span> <span class="s1">&#39;簡體漢字&#39;</span>
<span class="n">cc</span><span class="o">.</span><span class="n">convert</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">&#39;简体汉字&#39;
</code></pre></div><br>
<h3 id="32-简to繁">3.2 简to繁</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">opencc</span> <span class="kn">import</span> <span class="n">OpenCC</span>
<span class="n">cc</span> <span class="o">=</span> <span class="n">OpenCC</span><span class="p">(</span><span class="s1">&#39;s2t&#39;</span><span class="p">)</span>  <span class="c1"># 简体2繁体</span>
<span class="n">text</span> <span class="o">=</span> <span class="s1">&#39;简体汉字&#39;</span>
<span class="n">cc</span><span class="o">.</span><span class="n">convert</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">&#39;簡體漢字&#39;
</code></pre></div><br>
<br>
]]></content:encoded>
    </item>
    
    <item>
      <title>可视化 | 使用 DataMapPlot 绘制数据地图</title>
      <link>https://textdata.cn/blog/2024-01-21-datamapplot/</link>
      <pubDate>Sun, 21 Jan 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-01-21-datamapplot/</guid>
      <description>&lt;p&gt;DataMapPlot库可绘制漂亮的数据地图，以便应用于演示文稿、海报和论文中。重点是用尽可能少的工作量生成美观的静态图， 您只需在数据地图中标记点簇。虽然这涉及到大多数美学选择的自动化，但该库提供了多种方法来根据您的需求定制结果图。&lt;/p&gt;
&lt;br&gt;
&lt;h2 id=&#34;一安装&#34;&gt;一、安装&lt;/h2&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;pip3&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;install&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;datamapplot&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二准备数据&#34;&gt;二、准备数据&lt;/h2&gt;
&lt;h3 id=&#34;21-读取arxivcsvgz&#34;&gt;2.1 读取arxiv.csv.gz&lt;/h3&gt;
&lt;p&gt;点击下载 &lt;a href=&#34;arxiv.csv.gz&#34;&gt;&lt;strong&gt;arxiv.csv.gz&lt;/strong&gt;&lt;/a&gt; , 该数据有 &lt;em&gt;&lt;strong&gt;x1&lt;/strong&gt;&lt;/em&gt;、 &lt;em&gt;&lt;strong&gt;x2&lt;/strong&gt;&lt;/em&gt;、 &lt;em&gt;&lt;strong&gt;label&lt;/strong&gt;&lt;/em&gt; 三个字段，其中&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;x1、x2是降维后的特征信息，常见的降维算法有pca、UMAP, t-SNE等&lt;/li&gt;
&lt;li&gt;label是标注(类别)信息&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;arxiv.csv.gz&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;compression&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;gzip&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/01-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;22-录入logo&#34;&gt;2.2 录入logo&lt;/h3&gt;
&lt;p&gt;使用PIL读取 &lt;a href=&#34;arxiv_logo.png&#34;&gt;&lt;em&gt;&lt;strong&gt;arxiv_logo.png(点击下载该图片)&lt;/strong&gt;&lt;/em&gt;&lt;/a&gt;，并转化为array数组型数据。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/arxiv_logo.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;PIL&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;numpy&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;np&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;arxiv_logo&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;np&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;asarray&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;PIL&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Image&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;open&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;arxiv_logo.png&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h3 id=&#34;三绘图&#34;&gt;三、绘图&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;numpy&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;np&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;plt&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;matplotlib_inline&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;matplotlib_inline&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;backend_inline&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;set_matplotlib_formats&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;png&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;svg&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;PIL&lt;/span&gt;



&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;arxiv.csv.gz&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;compression&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;gzip&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;data_map_coords&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;labels&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;np&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;array&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;x1&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;x2&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]]),&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;label&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;arxiv_logo&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;np&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;asarray&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;PIL&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Image&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;open&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;arxiv.png&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;highlight_labels&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;  &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;Clustering&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
                     &lt;span class=&#34;s2&#34;&gt;&amp;#34;Manifold learning and dimension reduction&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
                     &lt;span class=&#34;s2&#34;&gt;&amp;#34;Active learning&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
                     &lt;span class=&#34;s2&#34;&gt;&amp;#34;Topic modelling and text classification&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;


&lt;span class=&#34;n&#34;&gt;datamapplot&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;create_plot&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;data_map_coords&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
    &lt;span class=&#34;n&#34;&gt;labels&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;title&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;ArXiv ML Landscape&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;sub_title&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;A data map of papers from the Machine Learning section of ArXiv&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;highlight_labels&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;highlight_labels&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;label_font_size&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;8&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;highlight_label_keywords&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;
        &lt;span class=&#34;s2&#34;&gt;&amp;#34;fontsize&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;12&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;fontweight&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;bold&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;bbox&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:{&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;boxstyle&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;circle&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;pad&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;mf&#34;&gt;0.75&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
    &lt;span class=&#34;p&#34;&gt;},&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;logo&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;arxiv_logo&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;savefig&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;arxiv_white.png&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;dpi&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;200&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/arxiv_white.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;三gallery&#34;&gt;三、Gallery&lt;/h2&gt;
&lt;p&gt;更多内容，可阅读文档  &lt;a href=&#34;https://github.com/TutteInstitute/datamapplot&#34;&gt;DataMapPlot: &lt;/a&gt;  &lt;a href=&#34;https://github.com/TutteInstitute/datamapplot&#34;&gt;https://github.com/TutteInstitute/datamapplot&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/plot_arxiv_ml.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/plot_wikipedia.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
</description>
      <content:encoded><![CDATA[<p>DataMapPlot库可绘制漂亮的数据地图，以便应用于演示文稿、海报和论文中。重点是用尽可能少的工作量生成美观的静态图， 您只需在数据地图中标记点簇。虽然这涉及到大多数美学选择的自动化，但该库提供了多种方法来根据您的需求定制结果图。</p>
<br>
<h2 id="一安装">一、安装</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">pip3</span> <span class="n">install</span> <span class="n">datamapplot</span>
</code></pre></div><p><br><br></p>
<h2 id="二准备数据">二、准备数据</h2>
<h3 id="21-读取arxivcsvgz">2.1 读取arxiv.csv.gz</h3>
<p>点击下载 <a href="arxiv.csv.gz"><strong>arxiv.csv.gz</strong></a> , 该数据有 <em><strong>x1</strong></em>、 <em><strong>x2</strong></em>、 <em><strong>label</strong></em> 三个字段，其中</p>
<ul>
<li>x1、x2是降维后的特征信息，常见的降维算法有pca、UMAP, t-SNE等</li>
<li>label是标注(类别)信息</li>
</ul>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;arxiv.csv.gz&#39;</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">)</span>
<span class="n">df</span>
</code></pre></div><p><img loading="lazy" src="img/01-df.png" alt=""  />
</p>
<br>
<h3 id="22-录入logo">2.2 录入logo</h3>
<p>使用PIL读取 <a href="arxiv_logo.png"><em><strong>arxiv_logo.png(点击下载该图片)</strong></em></a>，并转化为array数组型数据。</p>
<p><img loading="lazy" src="img/arxiv_logo.png" alt=""  />
</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">PIL</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>

<span class="n">arxiv_logo</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">asarray</span><span class="p">(</span><span class="n">PIL</span><span class="o">.</span><span class="n">Image</span><span class="o">.</span><span class="n">open</span><span class="p">(</span><span class="s1">&#39;arxiv_logo.png&#39;</span><span class="p">))</span>
</code></pre></div><p><br><br></p>
<h3 id="三绘图">三、绘图</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">import</span> <span class="nn">matplotlib_inline</span>
<span class="n">matplotlib_inline</span><span class="o">.</span><span class="n">backend_inline</span><span class="o">.</span><span class="n">set_matplotlib_formats</span><span class="p">(</span><span class="s1">&#39;png&#39;</span><span class="p">,</span> <span class="s1">&#39;svg&#39;</span><span class="p">)</span>
<span class="kn">import</span> <span class="nn">PIL</span>



<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;arxiv.csv.gz&#39;</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">)</span>
<span class="n">data_map_coords</span><span class="p">,</span> <span class="n">labels</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">df</span><span class="p">[[</span><span class="s1">&#39;x1&#39;</span><span class="p">,</span> <span class="s1">&#39;x2&#39;</span><span class="p">]]),</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;label&#39;</span><span class="p">]</span>
<span class="n">arxiv_logo</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">asarray</span><span class="p">(</span><span class="n">PIL</span><span class="o">.</span><span class="n">Image</span><span class="o">.</span><span class="n">open</span><span class="p">(</span><span class="s1">&#39;arxiv.png&#39;</span><span class="p">))</span>
<span class="n">highlight_labels</span> <span class="o">=</span>  <span class="p">[</span><span class="s2">&#34;Clustering&#34;</span><span class="p">,</span>
                     <span class="s2">&#34;Manifold learning and dimension reduction&#34;</span><span class="p">,</span>
                     <span class="s2">&#34;Active learning&#34;</span><span class="p">,</span>
                     <span class="s2">&#34;Topic modelling and text classification&#34;</span><span class="p">]</span>


<span class="n">datamapplot</span><span class="o">.</span><span class="n">create_plot</span><span class="p">(</span>
    <span class="n">data_map_coords</span><span class="p">,</span> 
    <span class="n">labels</span><span class="p">,</span>
    <span class="n">title</span> <span class="o">=</span> <span class="s2">&#34;ArXiv ML Landscape&#34;</span><span class="p">,</span>
    <span class="n">sub_title</span> <span class="o">=</span> <span class="s2">&#34;A data map of papers from the Machine Learning section of ArXiv&#34;</span><span class="p">,</span>
    <span class="n">highlight_labels</span> <span class="o">=</span> <span class="n">highlight_labels</span><span class="p">,</span>
    <span class="n">label_font_size</span> <span class="o">=</span> <span class="mi">8</span><span class="p">,</span>
    <span class="n">highlight_label_keywords</span> <span class="o">=</span> <span class="p">{</span>
        <span class="s2">&#34;fontsize&#34;</span><span class="p">:</span> <span class="mi">12</span><span class="p">,</span> <span class="s2">&#34;fontweight&#34;</span><span class="p">:</span> <span class="s2">&#34;bold&#34;</span><span class="p">,</span> <span class="s2">&#34;bbox&#34;</span><span class="p">:{</span><span class="s2">&#34;boxstyle&#34;</span><span class="p">:</span><span class="s2">&#34;circle&#34;</span><span class="p">,</span> <span class="s2">&#34;pad&#34;</span><span class="p">:</span><span class="mf">0.75</span><span class="p">}</span>
    <span class="p">},</span>
    <span class="n">logo</span><span class="o">=</span><span class="n">arxiv_logo</span><span class="p">,</span>
<span class="p">)</span>

<span class="n">plt</span><span class="o">.</span><span class="n">savefig</span><span class="p">(</span><span class="s1">&#39;arxiv_white.png&#39;</span><span class="p">,</span> <span class="n">dpi</span><span class="o">=</span><span class="mi">200</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/arxiv_white.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="三gallery">三、Gallery</h2>
<p>更多内容，可阅读文档  <a href="https://github.com/TutteInstitute/datamapplot">DataMapPlot: </a>  <a href="https://github.com/TutteInstitute/datamapplot">https://github.com/TutteInstitute/datamapplot</a></p>
<p><img loading="lazy" src="img/plot_arxiv_ml.png" alt=""  />
</p>
<br>
<p><img loading="lazy" src="img/plot_wikipedia.png" alt=""  />
</p>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>数据集 |  港股年报文本数据集(2007 ~ 2025.04)</title>
      <link>https://textdata.cn/blog/2024-01-21-hk-stock-market-anual-report/</link>
      <pubDate>Sun, 21 Jan 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-01-21-hk-stock-market-anual-report/</guid>
      <description>&lt;h2 id=&#34;一数据集概况&#34;&gt;一、数据集概况&lt;/h2&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;数据名称: 港股年报
数据来源: 披露易（https://www1.hkexnews.hk/）
公司数量: 3067
报告数量: 31410
会计年度: 2007 ~ 2024
报告发布日期: 2007-01-08 ~ 2025-04-30
数据类型: pdf、txt、csv(csv是对所有txt的汇总文件)
数据体积: 155G
本文声明: 科研用途； 如有问题， 请加微信372335839，备注「姓名-学校-专业」
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;12-数据来源&#34;&gt;1.2 数据来源&lt;/h3&gt;
&lt;p&gt;数据整理自 &lt;em&gt;&lt;strong&gt;披露易 &lt;a href=&#34;https://www1.hkexnews.hk&#34;&gt;https://www1.hkexnews.hk&lt;/a&gt;&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;https://www1.hkexnews.hk/search/titlesearch.xhtml?lang=zh
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/05-site.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二查看数据&#34;&gt;二、查看数据&lt;/h2&gt;
&lt;p&gt;csv是对港股中(英)文TXT的汇总，且已对中文进行了繁体转简体处理。&lt;/p&gt;
&lt;h3 id=&#34;21-读取&#34;&gt;2.1 读取&lt;/h3&gt;
&lt;p&gt;csv是对所有 txt 的汇总文件， 如果电脑内存16G +， 可直接读取。 &lt;code&gt;港股中文年报.csv.gz(2.69G，解压后大概8.8G)&lt;/code&gt;。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;cdf&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;港股中文年报.csv.gz&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;cdf&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/06-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;p&gt;如果电脑内存小于16G， 可参考 &lt;a href=&#34;https://textdata.cn/blog/2023-11-17-how-handle-mega-csv-that-far-exceed-memory/&#34;&gt;&lt;strong&gt;代码 | 如何处理远超电脑内存的csv文件&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#只读取5行&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;cdf2&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;港股中文年报.csv.gz&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
                  &lt;span class=&#34;n&#34;&gt;compression&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;gzip&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
                  &lt;span class=&#34;n&#34;&gt;nrows&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;5&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;22-记录数&#34;&gt;2.2 记录数&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;cdf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;31410
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h3 id=&#34;23-公司数量&#34;&gt;2.3 公司数量&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;cdf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;code&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;nunique&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;3067
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;24-会计年度&#34;&gt;2.4 会计年度&lt;/h3&gt;
&lt;p&gt;数据集覆盖的会计年度主要集中在 2007 ~ 2024，但2001 ~ 2006也会有少量记录。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;nb&#34;&gt;sorted&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;cdf&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;year&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;unique&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;[2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;cdf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;cdf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;year&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;==&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;2001&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/07-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;25-发布日期&#34;&gt;2.5 发布日期&lt;/h3&gt;
&lt;p&gt;港股年报报告发布日期&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;cdf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;pubdate&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;to_datetime&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;cdf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;pubdate&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;

&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;cdf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;pubdate&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;min&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;())&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;cdf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;pubdate&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;max&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;2007-01-08 00:00:00
2025-04-30 00:00:00
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;26-年度报告量&#34;&gt;2.6 年度报告量&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;plotnine&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;*&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;plt&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;matplotlib.font_manager&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;FontProperties&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#文泉驿微米黑.ttf位于代码同文件夹&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;font_prop&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;FontProperties&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fname&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;文泉驿微米黑.ttf&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; 

&lt;span class=&#34;n&#34;&gt;data&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;cdf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;year&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;value_counts&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;reset_index&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;data&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;year&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;data&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;year&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;astype&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;category&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;ggplot&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;data&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;  &lt;span class=&#34;n&#34;&gt;aes&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;year&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;y&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;count&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
    &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;geom_col&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
    &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;geom_text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;aes&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;label&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;count&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;data&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;data&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;va&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;bottom&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;color&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;grey&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;10&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;theme&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;figure_size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;10&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;6&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt;
           &lt;span class=&#34;n&#34;&gt;text&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;element_text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;family&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;font_prop&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;get_name&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()),&lt;/span&gt; 
           &lt;span class=&#34;n&#34;&gt;plot_title&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;element_text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;family&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;font_prop&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;get_name&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(),&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;14&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
          &lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;labs&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;title&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;港股中文年报发布数量&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
          &lt;span class=&#34;n&#34;&gt;x&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;会计年度&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
          &lt;span class=&#34;n&#34;&gt;y&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;报告数&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/08-plot.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;h2 id=&#34;相关内容&#34;&gt;相关内容&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/call_for_paper/&#34;&gt;长期征稿&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/we_need_you/&#34;&gt;长期招募小伙伴&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/management_python_course&#34;&gt;&lt;strong&gt;付费视频课 | Python实证指标构建与文本分析&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-03-23-china-a-share-market-dataset-mda-from-01-to-21/&#34;&gt;数据集 | 2001-2022年A股上市公司年报&amp;amp;管理层讨论与分析&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2024-01-18-neeq-china-listed-on-nation-equities-exchange-and-quotation-system-anunal-year-report/&#34;&gt;数据集 | 三板上市公司年报2002-2023.12&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2024-01-14-usa-sec-10k-report-dataset/&#34;&gt;数据集 | 美股年报10-K、20-F数据(2000-2023.12)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;br&gt;
&lt;br&gt;
</description>
      <content:encoded><![CDATA[<h2 id="一数据集概况">一、数据集概况</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">数据名称: 港股年报
数据来源: 披露易（https://www1.hkexnews.hk/）
公司数量: 3067
报告数量: 31410
会计年度: 2007 ~ 2024
报告发布日期: 2007-01-08 ~ 2025-04-30
数据类型: pdf、txt、csv(csv是对所有txt的汇总文件)
数据体积: 155G
本文声明: 科研用途； 如有问题， 请加微信372335839，备注「姓名-学校-专业」
</code></pre></div><br>
<h3 id="12-数据来源">1.2 数据来源</h3>
<p>数据整理自 <em><strong>披露易 <a href="https://www1.hkexnews.hk">https://www1.hkexnews.hk</a></strong></em></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">https://www1.hkexnews.hk/search/titlesearch.xhtml?lang=zh
</code></pre></div><p><img loading="lazy" src="img/05-site.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="二查看数据">二、查看数据</h2>
<p>csv是对港股中(英)文TXT的汇总，且已对中文进行了繁体转简体处理。</p>
<h3 id="21-读取">2.1 读取</h3>
<p>csv是对所有 txt 的汇总文件， 如果电脑内存16G +， 可直接读取。 <code>港股中文年报.csv.gz(2.69G，解压后大概8.8G)</code>。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">cdf</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;港股中文年报.csv.gz&#39;</span><span class="p">)</span>
<span class="n">cdf</span>
</code></pre></div><p><img loading="lazy" src="img/06-df.png" alt=""  />
</p>
<br>
<br>
<p>如果电脑内存小于16G， 可参考 <a href="https://textdata.cn/blog/2023-11-17-how-handle-mega-csv-that-far-exceed-memory/"><strong>代码 | 如何处理远超电脑内存的csv文件</strong></a></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="c1">#只读取5行</span>
<span class="n">cdf2</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;港股中文年报.csv.gz&#39;</span><span class="p">,</span> 
                  <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">,</span> 
                  <span class="n">nrows</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
</code></pre></div><br>
<h3 id="22-记录数">2.2 记录数</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="nb">len</span><span class="p">(</span><span class="n">cdf</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">31410
</code></pre></div><h3 id="23-公司数量">2.3 公司数量</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">cdf</span><span class="p">[</span><span class="s1">&#39;code&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">nunique</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">3067
</code></pre></div><br>
<h3 id="24-会计年度">2.4 会计年度</h3>
<p>数据集覆盖的会计年度主要集中在 2007 ~ 2024，但2001 ~ 2006也会有少量记录。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="nb">sorted</span><span class="p">(</span><span class="n">cdf</span><span class="o">.</span><span class="n">year</span><span class="o">.</span><span class="n">unique</span><span class="p">())</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024]
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">cdf</span><span class="p">[</span><span class="n">cdf</span><span class="p">[</span><span class="s1">&#39;year&#39;</span><span class="p">]</span><span class="o">==</span><span class="mi">2001</span><span class="p">]</span>
</code></pre></div><p><img loading="lazy" src="img/07-df.png" alt=""  />
</p>
<br>
<h3 id="25-发布日期">2.5 发布日期</h3>
<p>港股年报报告发布日期</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">cdf</span><span class="p">[</span><span class="s1">&#39;pubdate&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">cdf</span><span class="p">[</span><span class="s1">&#39;pubdate&#39;</span><span class="p">])</span>

<span class="nb">print</span><span class="p">(</span><span class="n">cdf</span><span class="p">[</span><span class="s1">&#39;pubdate&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">min</span><span class="p">())</span>
<span class="nb">print</span><span class="p">(</span><span class="n">cdf</span><span class="p">[</span><span class="s1">&#39;pubdate&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">())</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">2007-01-08 00:00:00
2025-04-30 00:00:00
</code></pre></div><br>
<h3 id="26-年度报告量">2.6 年度报告量</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">plotnine</span> <span class="kn">import</span> <span class="o">*</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">from</span> <span class="nn">matplotlib.font_manager</span> <span class="kn">import</span> <span class="n">FontProperties</span>

<span class="c1">#文泉驿微米黑.ttf位于代码同文件夹</span>
<span class="n">font_prop</span> <span class="o">=</span> <span class="n">FontProperties</span><span class="p">(</span><span class="n">fname</span><span class="o">=</span><span class="s1">&#39;文泉驿微米黑.ttf&#39;</span><span class="p">)</span> 

<span class="n">data</span> <span class="o">=</span> <span class="n">cdf</span><span class="p">[</span><span class="s1">&#39;year&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">value_counts</span><span class="p">()</span><span class="o">.</span><span class="n">reset_index</span><span class="p">()</span>
<span class="n">data</span><span class="p">[</span><span class="s1">&#39;year&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="s1">&#39;year&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="s1">&#39;category&#39;</span><span class="p">)</span>

<span class="p">(</span>
    <span class="n">ggplot</span><span class="p">(</span><span class="n">data</span><span class="p">,</span>  <span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="s1">&#39;year&#39;</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="s1">&#39;count&#39;</span><span class="p">))</span>
    <span class="o">+</span><span class="n">geom_col</span><span class="p">()</span>
    <span class="o">+</span><span class="n">geom_text</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">label</span><span class="o">=</span><span class="s1">&#39;count&#39;</span><span class="p">),</span> <span class="n">data</span><span class="o">=</span><span class="n">data</span><span class="p">,</span> <span class="n">va</span><span class="o">=</span><span class="s1">&#39;bottom&#39;</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s1">&#39;grey&#39;</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
    <span class="o">+</span><span class="n">theme</span><span class="p">(</span><span class="n">figure_size</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">6</span><span class="p">),</span>
           <span class="n">text</span> <span class="o">=</span> <span class="n">element_text</span><span class="p">(</span><span class="n">family</span> <span class="o">=</span> <span class="n">font_prop</span><span class="o">.</span><span class="n">get_name</span><span class="p">()),</span> 
           <span class="n">plot_title</span> <span class="o">=</span> <span class="n">element_text</span><span class="p">(</span><span class="n">family</span> <span class="o">=</span> <span class="n">font_prop</span><span class="o">.</span><span class="n">get_name</span><span class="p">(),</span> <span class="n">size</span><span class="o">=</span><span class="mi">14</span><span class="p">)</span>
          <span class="p">)</span>
    <span class="o">+</span><span class="n">labs</span><span class="p">(</span><span class="n">title</span><span class="o">=</span><span class="s1">&#39;港股中文年报发布数量&#39;</span><span class="p">,</span>
          <span class="n">x</span> <span class="o">=</span> <span class="s1">&#39;会计年度&#39;</span><span class="p">,</span> 
          <span class="n">y</span> <span class="o">=</span> <span class="s1">&#39;报告数&#39;</span><span class="p">)</span>
<span class="p">)</span>

</code></pre></div><p><img loading="lazy" src="img/08-plot.png" alt=""  />
</p>
<br>
<br>
<h2 id="相关内容">相关内容</h2>
<ul>
<li><a href="https://textdata.cn/blog/call_for_paper/">长期征稿</a></li>
<li><a href="https://textdata.cn/blog/we_need_you/">长期招募小伙伴</a></li>
<li><a href="https://textdata.cn/blog/management_python_course"><strong>付费视频课 | Python实证指标构建与文本分析</strong></a></li>
<li><a href="https://textdata.cn/blog/2023-03-23-china-a-share-market-dataset-mda-from-01-to-21/">数据集 | 2001-2022年A股上市公司年报&amp;管理层讨论与分析</a></li>
<li><a href="https://textdata.cn/blog/2024-01-18-neeq-china-listed-on-nation-equities-exchange-and-quotation-system-anunal-year-report/">数据集 | 三板上市公司年报2002-2023.12</a></li>
<li><a href="https://textdata.cn/blog/2024-01-14-usa-sec-10k-report-dataset/">数据集 | 美股年报10-K、20-F数据(2000-2023.12)</a></li>
</ul>
<br>
<br>
]]></content:encoded>
    </item>
    
    <item>
      <title>数据集 | 用来练习pandas的招聘数据</title>
      <link>https://textdata.cn/blog/2024-01-19-recruitment-dataset/</link>
      <pubDate>Fri, 19 Jan 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-01-19-recruitment-dataset/</guid>
      <description>&lt;h2 id=&#34;相关推文&#34;&gt;相关推文&lt;/h2&gt;
&lt;p&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-11-17-how-handle-mega-csv-that-far-exceed-memory/&#34;&gt;推荐 | 如何处理远超电脑内存的csv文件&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;一数据集概况&#34;&gt;一、数据集概况&lt;/h2&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;- 数据集名：招聘数据集
- 采集时间：2018.7
- 数据来源：58同城、智联招聘
- 记录数: 1701992

百度网盘链接: https://pan.baidu.com/s/1arYXcrexLW__SFF5AbjAaA?pwd=sfg5 提取码: sfg5 
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h3 id=&#34;声明&#34;&gt;声明&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;仅供科研使用，大家可以用来练习Pandas&lt;/strong&gt;。&lt;/p&gt;
&lt;p&gt;该数据集是有偏的， 不太适合做研究。 如果你想用这个数据集做研究， 拿去不谢，但不要加我微信提问呀！！我知道的都在推文里！！&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二pandas练习&#34;&gt;二、Pandas练习&lt;/h2&gt;
&lt;h3 id=&#34;21-读取&#34;&gt;2.1 读取&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;2018.7招聘数据.csv.gz&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;compression&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;gzip&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#使用bandizip或winrar解压gz，得到csv&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#df = pd.read_csv(&amp;#39;2018.7招聘数据.csv&amp;#39;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;head&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/01-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;记录数&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;1701992
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;22-省份&#34;&gt;2.2 省份&lt;/h3&gt;
&lt;p&gt;不同省份的记录数&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;省份&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;value_counts&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;省份
北京市         410142
上海市         364047
河南省         156374
福建省         120816
广东省         101390
湖北省          63507
河北省          57152
江苏省          52360
四川省          51849
山东省          46956
重庆市          43153
湖南省          41438
陕西省          32108
浙江省          31838
黑龙江省         20466
贵州省          17837
辽宁省          15015
海南省          14412
云南省          13542
广西壮族自治区      12842
吉林省          11502
江西省           9638
新疆维吾尔自治区      5071
天津市           3681
安徽省           3547
山西省           1308
Name: count, dtype: int64
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;23-学历&#34;&gt;2.3 学历&lt;/h3&gt;
&lt;p&gt;不同学历的记录数&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;学历&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;value_counts&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;学历
学历不限    999542
大专      286629
高中      123481
中专      100423
不限       84206
本科       83400
中技       10810
技校        6736
硕士        6151
博士         613
Name: count, dtype: int64
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;p&gt;筛选出需要博士学历的记录&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;学历&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;==&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;博士&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/02-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;img loading=&#34;lazy&#34; src=&#34;img/03-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;24-岗位描述&#34;&gt;2.4 岗位描述&lt;/h3&gt;
&lt;h4 id=&#34;241-文本长度&#34;&gt;2.4.1 文本长度&lt;/h4&gt;
&lt;p&gt;岗位描述文本长度&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;岗位描述&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fillna&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;0           974
1           457
2           731
3           430
4           348
           ... 
1701987     294
1701988    1029
1701989     322
1701990      25
1701991     377
Name: 岗位描述, Length: 1701992, dtype: int64
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h4 id=&#34;242-是否含某个类词&#34;&gt;2.4.2 是否含某个(类)词&lt;/h4&gt;
&lt;p&gt;岗位描述是否含 &lt;code&gt;抗压能力强&lt;/code&gt; 或 &lt;code&gt;压力大&lt;/code&gt;&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;#一个词&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#df[df[&amp;#39;岗位描述&amp;#39;].fillna(&amp;#39;&amp;#39;).str.contains(&amp;#39;抗压能力强&amp;#39;)].head()&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#多个词用|间隔&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;岗位描述&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fillna&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;contains&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;抗压能力强|压力大&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;head&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/04-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;岗位描述含 &lt;code&gt;抗压能力强|压力大&lt;/code&gt; 的工作占比&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;压力占比&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;岗位描述&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fillna&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;contains&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;抗压能力强|压力大&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;sum&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;/&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;

&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;轻松占比&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;岗位描述&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fillna&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;contains&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;工作轻松|压力小&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;sum&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;/&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;压力占比 0.012797357449388716
轻松占比 0.018608195573187183
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&amp;hellip;&lt;/p&gt;
&lt;h2 id=&#34;三获取数据&#34;&gt;三、获取数据&lt;/h2&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;百度网盘链接: https://pan.baidu.com/s/1arYXcrexLW__SFF5AbjAaA?pwd=sfg5 提取码: sfg5 
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h3 id=&#34;声明-1&#34;&gt;声明&lt;/h3&gt;
&lt;p&gt;仅供科研使用，大家可以用来练习Pandas**。&lt;/p&gt;
&lt;p&gt;该数据集是有偏的， 不太适合做研究。 如果你想用这个数据集做研究， 拿去不谢，但不要加我微信提问呀！！我知道的都在推文里！！&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
</description>
      <content:encoded><![CDATA[<h2 id="相关推文">相关推文</h2>
<p><a href="https://textdata.cn/blog/2023-11-17-how-handle-mega-csv-that-far-exceed-memory/">推荐 | 如何处理远超电脑内存的csv文件</a></p>
<p><br><br></p>
<h2 id="一数据集概况">一、数据集概况</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">- 数据集名：招聘数据集
- 采集时间：2018.7
- 数据来源：58同城、智联招聘
- 记录数: 1701992

百度网盘链接: https://pan.baidu.com/s/1arYXcrexLW__SFF5AbjAaA?pwd=sfg5 提取码: sfg5 
</code></pre></div><h3 id="声明">声明</h3>
<p><strong>仅供科研使用，大家可以用来练习Pandas</strong>。</p>
<p>该数据集是有偏的， 不太适合做研究。 如果你想用这个数据集做研究， 拿去不谢，但不要加我微信提问呀！！我知道的都在推文里！！</p>
<p><br><br></p>
<h2 id="二pandas练习">二、Pandas练习</h2>
<h3 id="21-读取">2.1 读取</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;2018.7招聘数据.csv.gz&#39;</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">)</span>

<span class="c1">#使用bandizip或winrar解压gz，得到csv</span>
<span class="c1">#df = pd.read_csv(&#39;2018.7招聘数据.csv&#39;)</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/01-df.png" alt=""  />
</p>
<br>
<p>记录数</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">1701992
</code></pre></div><br>
<h3 id="22-省份">2.2 省份</h3>
<p>不同省份的记录数</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s1">&#39;省份&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">value_counts</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">省份
北京市         410142
上海市         364047
河南省         156374
福建省         120816
广东省         101390
湖北省          63507
河北省          57152
江苏省          52360
四川省          51849
山东省          46956
重庆市          43153
湖南省          41438
陕西省          32108
浙江省          31838
黑龙江省         20466
贵州省          17837
辽宁省          15015
海南省          14412
云南省          13542
广西壮族自治区      12842
吉林省          11502
江西省           9638
新疆维吾尔自治区      5071
天津市           3681
安徽省           3547
山西省           1308
Name: count, dtype: int64
</code></pre></div><br>
<h3 id="23-学历">2.3 学历</h3>
<p>不同学历的记录数</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s1">&#39;学历&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">value_counts</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">学历
学历不限    999542
大专      286629
高中      123481
中专      100423
不限       84206
本科       83400
中技       10810
技校        6736
硕士        6151
博士         613
Name: count, dtype: int64
</code></pre></div><br>
<p>筛选出需要博士学历的记录</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;学历&#39;</span><span class="p">]</span><span class="o">==</span><span class="s1">&#39;博士&#39;</span><span class="p">]</span>
</code></pre></div><p><img loading="lazy" src="img/02-df.png" alt=""  />
<img loading="lazy" src="img/03-df.png" alt=""  />
</p>
<br>
<h3 id="24-岗位描述">2.4 岗位描述</h3>
<h4 id="241-文本长度">2.4.1 文本长度</h4>
<p>岗位描述文本长度</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s1">&#39;岗位描述&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">len</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">0           974
1           457
2           731
3           430
4           348
           ... 
1701987     294
1701988    1029
1701989     322
1701990      25
1701991     377
Name: 岗位描述, Length: 1701992, dtype: int64
</code></pre></div><br>
<h4 id="242-是否含某个类词">2.4.2 是否含某个(类)词</h4>
<p>岗位描述是否含 <code>抗压能力强</code> 或 <code>压力大</code></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#一个词</span>
<span class="c1">#df[df[&#39;岗位描述&#39;].fillna(&#39;&#39;).str.contains(&#39;抗压能力强&#39;)].head()</span>

<span class="c1">#多个词用|间隔</span>
<span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;岗位描述&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">&#39;抗压能力强|压力大&#39;</span><span class="p">)]</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<p><img loading="lazy" src="img/04-df.png" alt=""  />
</p>
<br>
<p>岗位描述含 <code>抗压能力强|压力大</code> 的工作占比</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="nb">print</span><span class="p">(</span><span class="s1">&#39;压力占比&#39;</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;岗位描述&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">&#39;抗压能力强|压力大&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span><span class="o">/</span> <span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">))</span>

<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;轻松占比&#39;</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;岗位描述&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">&#39;工作轻松|压力小&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span><span class="o">/</span> <span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">))</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">压力占比 0.012797357449388716
轻松占比 0.018608195573187183
</code></pre></div><p>&hellip;</p>
<h2 id="三获取数据">三、获取数据</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">百度网盘链接: https://pan.baidu.com/s/1arYXcrexLW__SFF5AbjAaA?pwd=sfg5 提取码: sfg5 
</code></pre></div><h3 id="声明-1">声明</h3>
<p>仅供科研使用，大家可以用来练习Pandas**。</p>
<p>该数据集是有偏的， 不太适合做研究。 如果你想用这个数据集做研究， 拿去不谢，但不要加我微信提问呀！！我知道的都在推文里！！</p>
<br>
<br>
]]></content:encoded>
    </item>
    
    <item>
      <title>数据集 | 三板上市公司年报2002-2025.06</title>
      <link>https://textdata.cn/blog/2024-01-18-neeq-china-listed-on-nation-equities-exchange-and-quotation-system-anunal-year-report/</link>
      <pubDate>Thu, 18 Jan 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-01-18-neeq-china-listed-on-nation-equities-exchange-and-quotation-system-anunal-year-report/</guid>
      <description>&lt;h2 id=&#34;一数据集&#34;&gt;一、数据集&lt;/h2&gt;
&lt;h3 id=&#34;11-概况&#34;&gt;1.1 概况&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;数据来源: 全国中小企业股份转让系统(https://www.neeq.com.cn/）

覆盖时间: 2002-04-02 ~ 2025-06-13

年报数量: 82728

累积挂牌数量: 14556

数据集体积: 152G

文件格式: pdf、txt、csv(csv是一个汇总文件，方便数据分析)
   
csv所含字段:
 - code
 - year
 - text
 
声明: 科研用途; 如有问题， 请加微信372335839，备注「姓名-学校-专业」
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/01-screen.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/03-txt.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;h3 id=&#34;12--声明&#34;&gt;1.2  声明&lt;/h3&gt;
&lt;p&gt;&lt;span style=&#34;font-size: 18px;color: green;&#34;&gt;科研用途；如有问题， 请加微信372335839，备注「姓名-学校-专业」&lt;/span&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二查看数据&#34;&gt;二、查看数据&lt;/h2&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;三板年报.csv.gz&lt;/strong&gt;&lt;/em&gt; 是一个汇总的 csv 文件，特别适合进行数据分析。 解压后大概 15G， 如果你的电脑内存小于32G， &lt;a href=&#34;https://textdata.cn/blog/2023-11-17-how-handle-mega-csv-that-far-exceed-memory/&#34;&gt;推荐阅读 | 如何处理远超电脑内存的csv文件&lt;/a&gt;&lt;/p&gt;
&lt;h3 id=&#34;21-读取数据&#34;&gt;2.1 读取数据&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;三板年报.csv.gz&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;compression&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;gzip&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/04-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;22-记录数&#34;&gt;2.2 记录数&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;
&lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;82728
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;23--累计挂牌企业数量&#34;&gt;2.3  累计挂牌企业数量&lt;/h3&gt;
&lt;p&gt;累计挂牌企业数量&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;code&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;nunique&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;14556
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;24-日期范围&#34;&gt;2.4 日期范围&lt;/h3&gt;
&lt;p&gt;数据集覆盖的日期范围&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;date&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;to_datetime&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;date&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#年报发布日期&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;date&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;min&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;())&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;date&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;max&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;2002-04-02 00:00:00
2025-06-13 00:00:00
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;25-年度记录数&#34;&gt;2.5 年度记录数&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;plotnine&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;*&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;plt&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;matplotlib.font_manager&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;FontProperties&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#文泉驿微米黑.ttf位于代码同文件夹&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;font_prop&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;FontProperties&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fname&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;文泉驿微米黑.ttf&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; 

&lt;span class=&#34;n&#34;&gt;data&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;year&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;value_counts&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;reset_index&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;data&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;year&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;data&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;year&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;astype&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;category&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;ggplot&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;data&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;  &lt;span class=&#34;n&#34;&gt;aes&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;year&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;y&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;count&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
    &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;geom_col&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
    &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;geom_text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;aes&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;label&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;count&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;data&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;data&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;va&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;bottom&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;color&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;grey&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;10&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;theme&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;figure_size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;10&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;6&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt;
           &lt;span class=&#34;n&#34;&gt;text&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;element_text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;family&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;font_prop&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;get_name&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()),&lt;/span&gt; 
           &lt;span class=&#34;n&#34;&gt;plot_title&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;element_text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;family&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;font_prop&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;get_name&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(),&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;14&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
          &lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;labs&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;title&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;三板年报发布数量&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
          &lt;span class=&#34;n&#34;&gt;x&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;会计年度&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
          &lt;span class=&#34;n&#34;&gt;y&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;报告数&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/05-plot.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;三相关内容&#34;&gt;三、相关内容&lt;/h2&gt;
&lt;p&gt;想用 python 对 csv、xlsx 进行分析， 要学会尽量用 pandas 写代码。 以下是近期 pandas 的一些处理推文免费教程， 感兴趣的可以进去浏览浏览。&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-11-17-how-handle-mega-csv-that-far-exceed-memory/&#34;&gt;推荐阅读 | 如何处理远超电脑内存的csv文件&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-03-23-china-a-share-market-dataset-mda-from-01-to-21/&#34;&gt;数据集 | 2001-2024年A股上市公司年报&amp;amp;管理层讨论与分析&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-03-24-load-w2v-and-expand-your-concpet-dicitonary/&#34;&gt;词向量 | 使用MD&amp;amp;A2001-2024语料训练Word2Vec模型&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2024-01-21-hk-stock-market-anual-report/&#34;&gt;&lt;strong&gt;数据集 | 港股年报文本数据集(2007 ~ 2025.06)&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2024-01-14-usa-sec-10k-report-dataset/&#34;&gt;&lt;strong&gt;数据集 | 美股年报10-K、20-F数据(2000-2023.12)&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
</description>
      <content:encoded><![CDATA[<h2 id="一数据集">一、数据集</h2>
<h3 id="11-概况">1.1 概况</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">数据来源: 全国中小企业股份转让系统(https://www.neeq.com.cn/）

覆盖时间: 2002-04-02 ~ 2025-06-13

年报数量: 82728

累积挂牌数量: 14556

数据集体积: 152G

文件格式: pdf、txt、csv(csv是一个汇总文件，方便数据分析)
   
csv所含字段:
 - code
 - year
 - text
 
声明: 科研用途; 如有问题， 请加微信372335839，备注「姓名-学校-专业」
</code></pre></div><p><img loading="lazy" src="img/01-screen.png" alt=""  />
</p>
<p><img loading="lazy" src="img/03-txt.png" alt=""  />
</p>
<br>
<br>
<h3 id="12--声明">1.2  声明</h3>
<p><span style="font-size: 18px;color: green;">科研用途；如有问题， 请加微信372335839，备注「姓名-学校-专业」</span><br><br></p>
<h2 id="二查看数据">二、查看数据</h2>
<p><em><strong>三板年报.csv.gz</strong></em> 是一个汇总的 csv 文件，特别适合进行数据分析。 解压后大概 15G， 如果你的电脑内存小于32G， <a href="https://textdata.cn/blog/2023-11-17-how-handle-mega-csv-that-far-exceed-memory/">推荐阅读 | 如何处理远超电脑内存的csv文件</a></p>
<h3 id="21-读取数据">2.1 读取数据</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;三板年报.csv.gz&#39;</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">)</span>
<span class="n">df</span>
</code></pre></div><p><img loading="lazy" src="img/04-df.png" alt=""  />
</p>
<br>
<h3 id="22-记录数">2.2 记录数</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python">
<span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">82728
</code></pre></div><br>
<h3 id="23--累计挂牌企业数量">2.3  累计挂牌企业数量</h3>
<p>累计挂牌企业数量</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s1">&#39;code&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">nunique</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">14556
</code></pre></div><br>
<h3 id="24-日期范围">2.4 日期范围</h3>
<p>数据集覆盖的日期范围</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s1">&#39;date&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;date&#39;</span><span class="p">])</span>

<span class="c1">#年报发布日期</span>
<span class="nb">print</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;date&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">min</span><span class="p">())</span>
<span class="nb">print</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;date&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">())</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">2002-04-02 00:00:00
2025-06-13 00:00:00
</code></pre></div><br>
<h3 id="25-年度记录数">2.5 年度记录数</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">plotnine</span> <span class="kn">import</span> <span class="o">*</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">from</span> <span class="nn">matplotlib.font_manager</span> <span class="kn">import</span> <span class="n">FontProperties</span>

<span class="c1">#文泉驿微米黑.ttf位于代码同文件夹</span>
<span class="n">font_prop</span> <span class="o">=</span> <span class="n">FontProperties</span><span class="p">(</span><span class="n">fname</span><span class="o">=</span><span class="s1">&#39;文泉驿微米黑.ttf&#39;</span><span class="p">)</span> 

<span class="n">data</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;year&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">value_counts</span><span class="p">()</span><span class="o">.</span><span class="n">reset_index</span><span class="p">()</span>
<span class="n">data</span><span class="p">[</span><span class="s1">&#39;year&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="s1">&#39;year&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="s1">&#39;category&#39;</span><span class="p">)</span>

<span class="p">(</span>
    <span class="n">ggplot</span><span class="p">(</span><span class="n">data</span><span class="p">,</span>  <span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="s1">&#39;year&#39;</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="s1">&#39;count&#39;</span><span class="p">))</span>
    <span class="o">+</span><span class="n">geom_col</span><span class="p">()</span>
    <span class="o">+</span><span class="n">geom_text</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">label</span><span class="o">=</span><span class="s1">&#39;count&#39;</span><span class="p">),</span> <span class="n">data</span><span class="o">=</span><span class="n">data</span><span class="p">,</span> <span class="n">va</span><span class="o">=</span><span class="s1">&#39;bottom&#39;</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s1">&#39;grey&#39;</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
    <span class="o">+</span><span class="n">theme</span><span class="p">(</span><span class="n">figure_size</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">6</span><span class="p">),</span>
           <span class="n">text</span> <span class="o">=</span> <span class="n">element_text</span><span class="p">(</span><span class="n">family</span> <span class="o">=</span> <span class="n">font_prop</span><span class="o">.</span><span class="n">get_name</span><span class="p">()),</span> 
           <span class="n">plot_title</span> <span class="o">=</span> <span class="n">element_text</span><span class="p">(</span><span class="n">family</span> <span class="o">=</span> <span class="n">font_prop</span><span class="o">.</span><span class="n">get_name</span><span class="p">(),</span> <span class="n">size</span><span class="o">=</span><span class="mi">14</span><span class="p">)</span>
          <span class="p">)</span>
    <span class="o">+</span><span class="n">labs</span><span class="p">(</span><span class="n">title</span><span class="o">=</span><span class="s1">&#39;三板年报发布数量&#39;</span><span class="p">,</span>
          <span class="n">x</span> <span class="o">=</span> <span class="s1">&#39;会计年度&#39;</span><span class="p">,</span> 
          <span class="n">y</span> <span class="o">=</span> <span class="s1">&#39;报告数&#39;</span><span class="p">)</span>
<span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/05-plot.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="三相关内容">三、相关内容</h2>
<p>想用 python 对 csv、xlsx 进行分析， 要学会尽量用 pandas 写代码。 以下是近期 pandas 的一些处理推文免费教程， 感兴趣的可以进去浏览浏览。</p>
<ul>
<li><a href="https://textdata.cn/blog/2023-11-17-how-handle-mega-csv-that-far-exceed-memory/">推荐阅读 | 如何处理远超电脑内存的csv文件</a></li>
<li><a href="https://textdata.cn/blog/2023-03-23-china-a-share-market-dataset-mda-from-01-to-21/">数据集 | 2001-2024年A股上市公司年报&amp;管理层讨论与分析</a></li>
<li><a href="https://textdata.cn/blog/2023-03-24-load-w2v-and-expand-your-concpet-dicitonary/">词向量 | 使用MD&amp;A2001-2024语料训练Word2Vec模型</a></li>
<li><a href="https://textdata.cn/blog/2024-01-21-hk-stock-market-anual-report/"><strong>数据集 | 港股年报文本数据集(2007 ~ 2025.06)</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-01-14-usa-sec-10k-report-dataset/"><strong>数据集 | 美股年报10-K、20-F数据(2000-2023.12)</strong></a></li>
</ul>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>cpca库 | 中国省、市区划匹配库</title>
      <link>https://textdata.cn/blog/2024-01-16-cpca-china-province-city-area/</link>
      <pubDate>Tue, 16 Jan 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-01-16-cpca-china-province-city-area/</guid>
      <description>&lt;p&gt;cpca库， 可提取简体中文字符串中 **省、市和区(县)**区划信息，且能够进行映射，检验和简单绘图。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;一安装&#34;&gt;一、安装&lt;/h2&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;pip3 install jinja2==3.0.1
pip3 install pyecharts==0.5.11
pip3 install echarts-countries-pypkg
pip3 install pyecharts-snapshot
pip3 install cpca
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二快速上手&#34;&gt;二、快速上手&lt;/h2&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;cpca&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;location_str&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;徐汇区虹漕路461号58号楼5楼&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
                &lt;span class=&#34;s2&#34;&gt;&amp;#34;泉州市洛江区万安塘西工业区&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
                &lt;span class=&#34;s2&#34;&gt;&amp;#34;北京朝阳区北苑华贸城&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;cpca&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;transform&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;location_str&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;|    | 省     | 市     | 区     | 地址                 |   adcode |
|---:|:-------|:------|:-------|:--------------------|---------:|
|  0 | 上海市  | 市辖区 |  徐汇区 | 虹漕路461号58号楼5楼   |   310104 |
|  1 | 福建省  | 泉州市 |  洛江区 | 万安塘西工业区         |   350504 |
|  2 | 北京市  | 市辖区 |  朝阳区 | 北苑华贸城            |   110105 |
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;cpca&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;cpca&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;transform&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;([&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;朝阳区汉庭酒店大山子店&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;    &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;省&lt;/span&gt;     &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;市&lt;/span&gt;      &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;区&lt;/span&gt;     &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;    &lt;span class=&#34;n&#34;&gt;地址&lt;/span&gt;          &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;   &lt;span class=&#34;n&#34;&gt;adcode&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;
&lt;span class=&#34;o&#34;&gt;|---&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-------|&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-------|&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-------|&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-----------------|---------&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;
&lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;  &lt;span class=&#34;mi&#34;&gt;0&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;吉林省&lt;/span&gt;  &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;长春市&lt;/span&gt;  &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;朝阳区&lt;/span&gt;  &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;汉庭酒店大山子店&lt;/span&gt;   &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;   &lt;span class=&#34;mi&#34;&gt;220104&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;p&gt;中国的区级行政单位非常的多，经常有重名的情况，比如 “&lt;em&gt;&lt;strong&gt;北京市朝阳区&lt;/strong&gt;&lt;/em&gt;”和“&lt;em&gt;&lt;strong&gt;吉林省长春市朝阳区&lt;/strong&gt;&lt;/em&gt;”，当有上级地址信息的时候，cpca 能够根据上级地址推断出这是哪个区。但是如果没有上级地址信息，只有一个区名的时候， cpca 就没法推断了，只能随便选一个， 通过 umap 参数你可以指定这种情况下该选择哪一个：&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;cpca&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;transform&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;([&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;朝阳区汉庭酒店大山子店&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;umap&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;朝阳区&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;110105&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;|    | 省     | 市      | 区     |    地址          |   adcode |
|---:|:-------|:-------|:-------|:-----------------|---------:|
|  0 | 北京市  | 市辖区  | 朝阳区  | 汉庭酒店大山子店   |   110105 |
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;br&gt;
&lt;h2 id=&#34;三案例&#34;&gt;三、案例&lt;/h2&gt;
&lt;p&gt;cpca运行速度很快，这里提供了案例数据 &lt;em&gt;&lt;strong&gt;addr.csv&lt;/strong&gt;&lt;/em&gt; , 有 18367  条地址记录。&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;a href=&#34;https://github.com/DQinYuan/chinese_province_city_area_mapper/blob/master/cpca/resources/adcodes.csv&#34;&gt;https://github.com/DQinYuan/chinese_province_city_area_mapper/blob/master/cpca/resources/adcodes.csv&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3 id=&#34;31-读取数据&#34;&gt;3.1 读取数据&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;raw_addr_df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;addr.csv&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;raw_addr_df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/01-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;33--地址操作&#34;&gt;3.3  地址操作&lt;/h3&gt;
&lt;p&gt;生成标准地址信息&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;cpca&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;addr_df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;cpca&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;transform&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;raw_addr_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;原始地址&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;addr_df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/03-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;33-绘制热力图&#34;&gt;3.3 绘制热力图&lt;/h3&gt;
&lt;p&gt;使用 folium库绘热力图（需要注意，打开 html时，需要有梯子的网络环境）&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;cpca&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;drawer&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#df为上一段代码输出的df&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;drawer&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;draw_locations&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;addr_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;adcode&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;df.html&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;这一段代码运行结束后会在运行代码的当前目录下生成一个df.html文件，用浏览器打开即可看到 绘制好的地图（如果某条数据&amp;rsquo;省&#39;，&amp;lsquo;市&amp;rsquo;或&amp;rsquo;区&amp;rsquo;字段有缺，则会忽略该条数据不进行绘制），速度会比较慢，需要耐心等待，绘制的图像如下：&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/02-folium.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
</description>
      <content:encoded><![CDATA[<p>cpca库， 可提取简体中文字符串中 **省、市和区(县)**区划信息，且能够进行映射，检验和简单绘图。</p>
<p><br><br></p>
<h2 id="一安装">一、安装</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">pip3 install jinja2==3.0.1
pip3 install pyecharts==0.5.11
pip3 install echarts-countries-pypkg
pip3 install pyecharts-snapshot
pip3 install cpca
</code></pre></div><p><br><br></p>
<h2 id="二快速上手">二、快速上手</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cpca</span>

<span class="n">location_str</span> <span class="o">=</span> <span class="p">[</span><span class="s2">&#34;徐汇区虹漕路461号58号楼5楼&#34;</span><span class="p">,</span> 
                <span class="s2">&#34;泉州市洛江区万安塘西工业区&#34;</span><span class="p">,</span> 
                <span class="s2">&#34;北京朝阳区北苑华贸城&#34;</span><span class="p">]</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">cpca</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">location_str</span><span class="p">)</span>
<span class="n">df</span>
</code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">|    | 省     | 市     | 区     | 地址                 |   adcode |
|---:|:-------|:------|:-------|:--------------------|---------:|
|  0 | 上海市  | 市辖区 |  徐汇区 | 虹漕路461号58号楼5楼   |   310104 |
|  1 | 福建省  | 泉州市 |  洛江区 | 万安塘西工业区         |   350504 |
|  2 | 北京市  | 市辖区 |  朝阳区 | 北苑华贸城            |   110105 |
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cpca</span>

<span class="n">cpca</span><span class="o">.</span><span class="n">transform</span><span class="p">([</span><span class="s2">&#34;朝阳区汉庭酒店大山子店&#34;</span><span class="p">])</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">|</span>    <span class="o">|</span> <span class="n">省</span>     <span class="o">|</span> <span class="n">市</span>      <span class="o">|</span> <span class="n">区</span>     <span class="o">|</span>    <span class="n">地址</span>          <span class="o">|</span>   <span class="n">adcode</span> <span class="o">|</span>
<span class="o">|---</span><span class="p">:</span><span class="o">|</span><span class="p">:</span><span class="o">-------|</span><span class="p">:</span><span class="o">-------|</span><span class="p">:</span><span class="o">-------|</span><span class="p">:</span><span class="o">-----------------|---------</span><span class="p">:</span><span class="o">|</span>
<span class="o">|</span>  <span class="mi">0</span> <span class="o">|</span> <span class="n">吉林省</span>  <span class="o">|</span> <span class="n">长春市</span>  <span class="o">|</span> <span class="n">朝阳区</span>  <span class="o">|</span> <span class="n">汉庭酒店大山子店</span>   <span class="o">|</span>   <span class="mi">220104</span> <span class="o">|</span>
</code></pre></div><br>
<p>中国的区级行政单位非常的多，经常有重名的情况，比如 “<em><strong>北京市朝阳区</strong></em>”和“<em><strong>吉林省长春市朝阳区</strong></em>”，当有上级地址信息的时候，cpca 能够根据上级地址推断出这是哪个区。但是如果没有上级地址信息，只有一个区名的时候， cpca 就没法推断了，只能随便选一个， 通过 umap 参数你可以指定这种情况下该选择哪一个：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">cpca</span><span class="o">.</span><span class="n">transform</span><span class="p">([</span><span class="s2">&#34;朝阳区汉庭酒店大山子店&#34;</span><span class="p">],</span> <span class="n">umap</span><span class="o">=</span><span class="p">{</span><span class="s2">&#34;朝阳区&#34;</span><span class="p">:</span><span class="s2">&#34;110105&#34;</span><span class="p">})</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">|    | 省     | 市      | 区     |    地址          |   adcode |
|---:|:-------|:-------|:-------|:-----------------|---------:|
|  0 | 北京市  | 市辖区  | 朝阳区  | 汉庭酒店大山子店   |   110105 |
</code></pre></div><br>
<br>
<h2 id="三案例">三、案例</h2>
<p>cpca运行速度很快，这里提供了案例数据 <em><strong>addr.csv</strong></em> , 有 18367  条地址记录。</p>
<blockquote>
<p><a href="https://github.com/DQinYuan/chinese_province_city_area_mapper/blob/master/cpca/resources/adcodes.csv">https://github.com/DQinYuan/chinese_province_city_area_mapper/blob/master/cpca/resources/adcodes.csv</a></p>
</blockquote>
<h3 id="31-读取数据">3.1 读取数据</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">raw_addr_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;addr.csv&#39;</span><span class="p">)</span>
<span class="n">raw_addr_df</span>
</code></pre></div><p><img loading="lazy" src="img/01-df.png" alt=""  />
</p>
<br>
<h3 id="33--地址操作">3.3  地址操作</h3>
<p>生成标准地址信息</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cpca</span>

<span class="n">addr_df</span> <span class="o">=</span> <span class="n">cpca</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">raw_addr_df</span><span class="p">[</span><span class="s1">&#39;原始地址&#39;</span><span class="p">])</span>
<span class="n">addr_df</span>
</code></pre></div><p><img loading="lazy" src="img/03-df.png" alt=""  />
</p>
<br>
<h3 id="33-绘制热力图">3.3 绘制热力图</h3>
<p>使用 folium库绘热力图（需要注意，打开 html时，需要有梯子的网络环境）</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">cpca</span> <span class="kn">import</span> <span class="n">drawer</span>
<span class="c1">#df为上一段代码输出的df</span>
<span class="n">drawer</span><span class="o">.</span><span class="n">draw_locations</span><span class="p">(</span><span class="n">addr_df</span><span class="p">[</span><span class="s1">&#39;adcode&#39;</span><span class="p">],</span> <span class="s2">&#34;df.html&#34;</span><span class="p">)</span>
</code></pre></div><p>这一段代码运行结束后会在运行代码的当前目录下生成一个df.html文件，用浏览器打开即可看到 绘制好的地图（如果某条数据&rsquo;省'，&lsquo;市&rsquo;或&rsquo;区&rsquo;字段有缺，则会忽略该条数据不进行绘制），速度会比较慢，需要耐心等待，绘制的图像如下：</p>
<p><img loading="lazy" src="img/02-folium.png" alt=""  />
</p>
<br>
<br>
]]></content:encoded>
    </item>
    
    <item>
      <title>数据集 |  美股年报10-K、20-F数据(2000-2023.12)</title>
      <link>https://textdata.cn/blog/2024-01-14-usa-sec-10k-report-dataset/</link>
      <pubDate>Sat, 13 Jan 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-01-14-usa-sec-10k-report-dataset/</guid>
      <description>&lt;h2 id=&#34;一数据集概况&#34;&gt;一、数据集概况&lt;/h2&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;数据名称: 美股年报10-K、20-F报告
数据来源: SEC
报告类型: 10-K、20-F
公司数量: 33619
报告数量: 189739
覆盖日期: 2000-07-05 ~ 2024.01.05
数据类型: html、csv(csv是对所有html的汇总文件)
数据体积: 378G
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/01-size.jpg&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;11-声明&#34;&gt;1.1 声明&lt;/h3&gt;
&lt;p&gt;科研用途；如有问题， 请加微信372335839，备注「姓名-学校-专业」&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;12-格式说明&#34;&gt;1.2 格式说明&lt;/h3&gt;
&lt;p&gt;美股报告是html格式(中国沪深交易所的报告是pdf格式),   可以通过爬虫批量下载所有的报告，并保存为html。&lt;/p&gt;
&lt;p&gt;以苹果公司为例，&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;https://www.sec.gov/Archives/edgar/data/320193/000032019323000106/aapl-20230930.htm
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/03-apple.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二-html文件&#34;&gt;二、 html文件&lt;/h2&gt;
&lt;p&gt;美股报告数据以html格式存储， 总体积了解其命名规则和处理方式，才能更好的使用该数据集。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/04-2023.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;21-html命名规则&#34;&gt;2.1 html命名规则&lt;/h3&gt;
&lt;p&gt;以 &lt;code&gt;1973368_2023-03-31_SVMH_SRIVARU Holding Ltd_20-F_2023-12-28.html&lt;/code&gt; 为例, html命名遵循CIK码(股票代码)、会计期末、上市公司简称、上市公司全名、Form类型、报告发布日期&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;file&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;1973368_2023-03-31_SVMH_SRIVARU Holding Ltd_20-F_2023-12-28.html&amp;#39;&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;file&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;split&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;_&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;[&amp;#39;1973368&amp;#39;,
 &amp;#39;2023-03-31&amp;#39;,
 &amp;#39;SVMH&amp;#39;,
 &amp;#39;SRIVARU Holding Ltd&amp;#39;,
 &amp;#39;20-F&amp;#39;,
 &amp;#39;2023-12-28.html&amp;#39;]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;22-提取文本&#34;&gt;2.2 提取文本&lt;/h3&gt;
&lt;p&gt;如果觉得html不方便分析，可以使用 pyquery、BeautifulSoup等html解析库，提取html中的文本内容。本文以pyquery为例&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pyquery&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;PyQuery&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;file&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;1973368_2023-03-31_SVMH_SRIVARU Holding Ltd_20-F_2023-12-28.html&amp;#39;&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;doc&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;PyQuery&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;open&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;file&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;rb&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;())&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;doc&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;bazadebezolkohpepadr=&amp;#34;608506832&amp;#34;\nfalse\nFY\n0001973368\n0001973368\n2022-04-01\n2023-03-31\n0001973368\ndei:BusinessContactMember\n2022-04-01\n2023-03-31\n0001973368\nSVMHW:OrdinarySharesMember\n2022-04-01\n2023-03-31\n0001973368\nSVMHW:WarrantsMember\n2022-04-01\n2023-03-31\n0001973368\n2023-03-31\n0001973368\n2022-03-31\n0001973368\n2021-06-16\n2022-03-31\n0001973368\nSVMHW:PredecessorMember\n2021-04-01\n2021-06-15\n0001973368\n2021-04-01\n2021-06-15\n0001973368\nSVMHW:PredecessorMember\nus-gaap:CommonStockMember\n2021-03-31\n0001973368\nSVMHW:PredecessorMember\nSVMHW:SharePremiumMember\n2021-03-31\n0001973368\nSVMHW:PredecessorMember\nus-gaap:RetainedEarningsMember\n2021-03-
......
SVMHW:Integer\nxbrli:pure\nUNITED STATES\nSECURITIES AND EXCHANGE COMMISSION\nWASHINGTON, D.C. 20549\nFORM\n20-F\n(Mark One)\n☐\nREGISTRATION STATEMENT PURSUANT TO SECTION 12(b) OR 12(g) OF THE SECURITIES EXCHANGE ACT OF 1934\nOR\n☐\nANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934\nFor the fiscal year ended\nMarch 31\n,\n2023\nOR\n☐\nTRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934\nOR\n☒\nSHELL COMPANY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934\nDate of event requiring this shell company report:\nDecember 8, 2023\nCommission File Number:\n333-272717\nSRIVARU Holding Limited\n(Exact name of Registrant as specified in its charter)\nNot applicable\nCayman Islands\n(Translation of Registrant’s name into English)\n(Jurisdiction of incorporation or organization)\n2nd Floor, Regatta Office Park\n,\nWest Bay Road\nP.O. Box 10655\nGrand Cayman\n,\nKY1-1006\nCayman Islands\n(Address of Principal Executive Offices)\nSRIVARU Holding Limited\n2nd Floor, Regatta Office Park,\nWest Bay Road\nP.O. Box 10655\nGrand Cayman\n,\nKY1-1006\nCayman Islands\nTelephone:\n+1 (888)\n227-8066\nEmail: ir@srivarumotors.com\n(Name, Telephone, Email and/or Facsimile number and Address of Company Contact Person)\nSecurities registered or to be registered pursuant to Section 12(b) of the Act:\nTitle of each class\nTrading Symbol(s)\nName of each exchange\non which registered\nOrdinary shares\nSVMH\nThe\nNasdaq\nGlobal Market\nWarrants\nSVMHW\nThe\nNasdaq\nGlobal Market\nSecurities registered or to be registered pursuant to Section 12(g) of the Act:\nNone\n(Title of Class)\nSecurities for which there is a reporting obligation pursuant to Section 15(d) of the Act:\nNone\n(Title of Class)\nIndicate the number of outstanding shares of each of the issuer’s classes of capital or common stock as of the close of the period covered by the shell company report:\n14,946,286\nordinary shares and 10,005,000 warrants.\nIndicate by check mark if the registrant is a well-known seasoned issuer, as defined in Rule 405 of the Securities Act. Yes ☐\nNo\n☒\nIf this report is an annual or transition report, indicate by check mark if the registrant is not required to file reports pursuant to Section 13 or 15(d) of the Securities Exchange Act of 1934. Yes ☐\nNo\n☒\nIndicate by check mark whether the registrant (1) has filed all reports required to be filed by Section 13 or 15(d) of the Securities Exchange Act of 1934 during the preceding 12 months (or for such shorter period that the registrant was required to file such reports), and (2) has been subject to such filing requirements for the past 90 days. Yes ☐\nNo\n☒\nIndicate by check mark whether the registrant has submitted electronically every Interactive Data File required to be submitted pursuant to Rule 405 of Regulation S-T (§232.405 of this chapter) during the preceding 12 months (or for such shorter period that the registrant was required to submit and post such files).\nYes\n☒ No ☐\nIf securities are registered pursuant to Section 12(b) of the Act, indicate by check mark whether the financial statements of the registrant included in the filing reflect the correction of an error to previously issued financial statements.\n☐\nIndicate by check mark whether any of those error corrections are restatements that required a recovery analysis of incentive-based compensation received by any of the registrant’s executive officers during the relevant recovery period pursuant to §240.10D-1(b).\u202f☐\nIndicate by check mark whether the registrant is a large accelerated filer, an accelerated filer, a non-accelerated filer, or an emerging growth company. See definition of “large accelerated filer,” “accelerated filer,” and “emerging growth company” in Rule 12b-2 of the Exchange Act.\nLarge accelerated filer\n☐\nAccelerated filer\n☐\nNon-accelerated filer\n☒\nEmerging growth company\n☒\nIf an emerging growth company that prepares its financial statements in accordance with U.S. GAAP, indicate by check mark if the registrant has elected to use the extended transition period for complying with any new or revised financial accounting standards† provided pursuant to Section 13(a) of the Exchange Act.
......
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;三csv文件&#34;&gt;三、csv文件&lt;/h2&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/02-file.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;h3 id=&#34;31-读取&#34;&gt;3.1 读取&lt;/h3&gt;
&lt;p&gt;csv是对所有html的汇总文件， 如果电脑内存OK， 直接读取 &lt;code&gt;美股年报_10-K和20-F.csv.gz(14.27G，解压后大概50+G)&lt;/code&gt;。&lt;/p&gt;
&lt;p&gt;我使用的电内存256G， 读取时间大概17min。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;美股年报_10-K和20-F.csv&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;converters&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;cik&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;})&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/05-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;常见电脑内存一般8~16G， 可以借鉴这篇推文 &lt;a href=&#34;https://textdata.cn/blog/2023-11-17-how-handle-mega-csv-that-far-exceed-memory/&#34;&gt;&lt;strong&gt;代码 | 如何处理远超电脑内存的csv文件&lt;/strong&gt;&lt;/a&gt;。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#只读取5行&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df2&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;美股年报_10-K和20-F.csv.gzip&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
                  &lt;span class=&#34;n&#34;&gt;converters&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;cik&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;},&lt;/span&gt; &lt;span class=&#34;c1&#34;&gt;#防止股票代码被识别为数字&lt;/span&gt;
                  &lt;span class=&#34;n&#34;&gt;compression&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;gzip&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
                  &lt;span class=&#34;n&#34;&gt;nrows&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;5&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/06-nrows5.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;32-公司数量&#34;&gt;3.2 公司数量&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;cik&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;nunique&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;33619
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;33-查看content&#34;&gt;3.3 查看content&lt;/h3&gt;
&lt;p&gt;使用df.loc方式查看content字段的内容&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;#第一行，content字段&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;loc&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;content&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;&amp;#39;10-K\n1\nw46943e10-k.txt\nANNUAL REPORT FOR FISCAL YEAR ENDED 12/30/2000\n1 SECURITIES AND EXCHANGE COMMISSION WASHINGTON, D.C. 20549 FORM 10-K (Mark One) [X] Annual report pursuant to section 13 or 15(d) of the Securities Exchange Act of 1934 [NO FEE REQUIRED] for the fiscal year ended December 30, 2000 or [ ] Transition report pursuant to section 13 or 15(d) of the Securities Exchange Act of 1934 [NO FEE REQUIRED] for the transition period from ________ to ________ COMMISSION FILE NUMBER 0-9576 ------ K-TRON INTERNATIONAL, INC. (EXACT NAME OF REGISTRANT AS SPECIFIED IN ITS CHARTER)\nNew Jersey 22-1759452 ------------ ------------\n(State or other jurisdiction of (I.R.S. Employer Identification No.) incorporation or organization)\nRoutes 55 and 553 P.O. Box 888 Pitman, New Jersey 08071-0888 -------------------- ---------- (Address of principal executive offices) (Zip Code) Registrant\&amp;#39;s telephone number, including area code: (856) 589-0500 -------------- Securities registered pursuant to Section 12(b) of the Act:\nTitle of each class Name of each exchange on which registered\nNone None ------------------- -----------------------------------------\nSecurities registered pursuant to Section 12(g) of the Act: Common Stock, par value $.01 per share -------------------------------------- (Title of class) Indicate by check mark whether the Registrant (1) has filed all reports required to be filed by Section 13 or 15(d) of the Securities Exchange Act of 1934 during the preceding 12 months (or for such shorter period that the Registrant was required to file such reports), and (2) has been subject to such filing requirements for the past 90 days. Yes X No --- ---\n2 Indicate by check mark if disclosure of delinquent filers pursuant to Item 405 of Regulation S-K is not contained herein, and will not be contained, to the best of Registrant\&amp;#39;s knowledge, in the definitive proxy statement incorporated by reference in Part III of this annual report on Form 10-K or any amendment to this annual report on Form 10-K. |X| As of February 28, 2001, the aggregate market value of the Common Stock held by non-affiliates of the Registrant was $35,606,718. Such aggregate market value was computed by reference to the closing sale price of the Common Stock as quoted on the Nasdaq National Market on such date. For purposes of making this calculation only, the Registrant has defined affiliates as including all directors and executive ......此处略去无数字
......此处略去无数字
......此处略去无数字

Amendment No. 1 to Employment Agreement dated October 5, 1998 by and between K-Tron International, Inc. and Edward B. Cloues, II (Filed as Exhibit 10.1 to our report on Form 10-Q for the quarterly period ended October 3, 1998 and incorporated herein by reference)** 10.10 Form of Employment Agreement with certain of our employees, which are identical in all material respects except for the employee, amount of salary to be paid and date of execution (Filed as Exhibit 10.12 to our annual report on Form 10-K for the year ended January 3, 1998 and incorporated herein by reference)** 10.11 Form of Indemnification Agreement with certain of our directors and officers listed on Schedule 10.11, which are identical in all material respects except for the director or officer who is a party thereto and the date of execution (Filed as Exhibit 10.11 to the 1999 Form 10-K and incorporated herein by reference)** 10.12 Leasing Agreement dated October 30, 1990 between CS Immobilien Leasing AG, Zurich and Hasler Freres SA, with limited guaranty of K-Tron Soder AG (Filed as Exhibit 10.1(b) to our report on Form 8-K dated October 30, 1990 and incorporated herein by reference) 10.13 Amendment, dated January 25, 1991, to Leasing Agreement, dated October 30, 1990, between CS Immobilien Leasing AG, Zurich and Hasler Freres SA and to the related limited guaranty of K-Tron Soder AG (Filed as Exhibit 10.3.3 to our annual report on Form 10-K for the year ended December 29, 1990 and incorporated herein by reference) 10.14 Note dated February 4, 2000 from K-Tron America, Inc. in favor of The Bank of Gloucester County (Filed as Exhibit (b)(1) on Amendment No.1 to our Tender Offer Statement on Schedule TO dated February 16, 2000 and incorporated herein by reference)\n55 10.15 Mortgage Note dated June 11, 1996 from K-Tron America, Inc. in favor of The Bank of Gloucester County (Filed as Exhibit 10.15 to the 1999 Form 10-K and incorporated herein by reference) 10.16 Loan Modification Agreement dated June 24, 1998 between K-Tron America, Inc. and The Bank of Gloucester County (Filed as Exhibit 10.16 to the 1999 Form 10-K and incorporated herein by reference) 10.17 Note dated June 24, 1998 from K-Tron America, Inc. in favor of The Bank of Gloucester County (Filed as Exhibit 10.17 to the 1999 Form 10-K and incorporated herein by reference) 10.18 Loan Modification Agreement dated as of July 22, 1999 between K-Tron America, Inc. and The Bank of Gloucester County (Filed as Exhibit 10.18 to the 1999 Form 10-K and incorporated herein by reference) 10.19 Loan Modification Agreement dated June 21, 2000 between K-Tron America, Inc. and The Bank of Gloucester County* 21.1 Subsidiaries* 23.1 Consent of Arthur Andersen LLP* 24.1 Power of Attorney (Included on Signature Page)* -------------------- * Filed herewith ** Management contract or compensatory plan or arrangement required to be filed or incorporated as an exhibit&amp;#39;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;34-日期&#34;&gt;3.4 日期&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;account_date&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;to_datetime&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;account_date&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;pub_date&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;to_datetime&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;pub_date&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#会计期末account_date&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;account_date&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;min&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;())&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;account_date&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;max&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;2000-01-31 00:00:00
2023-10-31 00:00:00
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;#报告发布日期&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;pub_date&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;min&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;())&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;pub_date&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;max&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;2000-07-05 00:00:00
2024-01-05 00:00:00
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;br&gt;
&lt;h2 id=&#34;四相关内容&#34;&gt;四、相关内容&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2024-01-21-hk-stock-market-anual-report/&#34;&gt;&lt;strong&gt;数据集 | 港股年报文本数据集(2007 ~ 2025)&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2024-01-18-neeq-china-listed-on-nation-equities-exchange-and-quotation-system-anunal-year-report/&#34;&gt;&lt;strong&gt;数据集 | 三板上市公司年报2002-2025&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-03-23-china-a-share-market-dataset-mda-from-01-to-21/&#34;&gt;&lt;strong&gt;数据集 | 2001-2024年A股上市公司年报&amp;amp;管理层讨论与分析&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;br&gt;
&lt;br&gt;
</description>
      <content:encoded><![CDATA[<h2 id="一数据集概况">一、数据集概况</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">数据名称: 美股年报10-K、20-F报告
数据来源: SEC
报告类型: 10-K、20-F
公司数量: 33619
报告数量: 189739
覆盖日期: 2000-07-05 ~ 2024.01.05
数据类型: html、csv(csv是对所有html的汇总文件)
数据体积: 378G
</code></pre></div><p><img loading="lazy" src="img/01-size.jpg" alt=""  />
</p>
<br>
<h3 id="11-声明">1.1 声明</h3>
<p>科研用途；如有问题， 请加微信372335839，备注「姓名-学校-专业」</p>
<br>
<h3 id="12-格式说明">1.2 格式说明</h3>
<p>美股报告是html格式(中国沪深交易所的报告是pdf格式),   可以通过爬虫批量下载所有的报告，并保存为html。</p>
<p>以苹果公司为例，</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">https://www.sec.gov/Archives/edgar/data/320193/000032019323000106/aapl-20230930.htm
</code></pre></div><p><img loading="lazy" src="img/03-apple.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="二-html文件">二、 html文件</h2>
<p>美股报告数据以html格式存储， 总体积了解其命名规则和处理方式，才能更好的使用该数据集。</p>
<p><img loading="lazy" src="img/04-2023.png" alt=""  />
</p>
<br>
<h3 id="21-html命名规则">2.1 html命名规则</h3>
<p>以 <code>1973368_2023-03-31_SVMH_SRIVARU Holding Ltd_20-F_2023-12-28.html</code> 为例, html命名遵循CIK码(股票代码)、会计期末、上市公司简称、上市公司全名、Form类型、报告发布日期</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">file</span> <span class="o">=</span> <span class="s1">&#39;1973368_2023-03-31_SVMH_SRIVARU Holding Ltd_20-F_2023-12-28.html&#39;</span>
<span class="n">file</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">&#39;_&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[&#39;1973368&#39;,
 &#39;2023-03-31&#39;,
 &#39;SVMH&#39;,
 &#39;SRIVARU Holding Ltd&#39;,
 &#39;20-F&#39;,
 &#39;2023-12-28.html&#39;]
</code></pre></div><br>
<h3 id="22-提取文本">2.2 提取文本</h3>
<p>如果觉得html不方便分析，可以使用 pyquery、BeautifulSoup等html解析库，提取html中的文本内容。本文以pyquery为例</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">pyquery</span> <span class="kn">import</span> <span class="n">PyQuery</span>

<span class="n">file</span> <span class="o">=</span> <span class="s1">&#39;1973368_2023-03-31_SVMH_SRIVARU Holding Ltd_20-F_2023-12-28.html&#39;</span>
<span class="n">doc</span> <span class="o">=</span> <span class="n">PyQuery</span><span class="p">(</span><span class="nb">open</span><span class="p">(</span><span class="n">file</span><span class="p">,</span> <span class="s1">&#39;rb&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">())</span>
<span class="n">doc</span><span class="o">.</span><span class="n">text</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">bazadebezolkohpepadr=&#34;608506832&#34;\nfalse\nFY\n0001973368\n0001973368\n2022-04-01\n2023-03-31\n0001973368\ndei:BusinessContactMember\n2022-04-01\n2023-03-31\n0001973368\nSVMHW:OrdinarySharesMember\n2022-04-01\n2023-03-31\n0001973368\nSVMHW:WarrantsMember\n2022-04-01\n2023-03-31\n0001973368\n2023-03-31\n0001973368\n2022-03-31\n0001973368\n2021-06-16\n2022-03-31\n0001973368\nSVMHW:PredecessorMember\n2021-04-01\n2021-06-15\n0001973368\n2021-04-01\n2021-06-15\n0001973368\nSVMHW:PredecessorMember\nus-gaap:CommonStockMember\n2021-03-31\n0001973368\nSVMHW:PredecessorMember\nSVMHW:SharePremiumMember\n2021-03-31\n0001973368\nSVMHW:PredecessorMember\nus-gaap:RetainedEarningsMember\n2021-03-
......
SVMHW:Integer\nxbrli:pure\nUNITED STATES\nSECURITIES AND EXCHANGE COMMISSION\nWASHINGTON, D.C. 20549\nFORM\n20-F\n(Mark One)\n☐\nREGISTRATION STATEMENT PURSUANT TO SECTION 12(b) OR 12(g) OF THE SECURITIES EXCHANGE ACT OF 1934\nOR\n☐\nANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934\nFor the fiscal year ended\nMarch 31\n,\n2023\nOR\n☐\nTRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934\nOR\n☒\nSHELL COMPANY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934\nDate of event requiring this shell company report:\nDecember 8, 2023\nCommission File Number:\n333-272717\nSRIVARU Holding Limited\n(Exact name of Registrant as specified in its charter)\nNot applicable\nCayman Islands\n(Translation of Registrant’s name into English)\n(Jurisdiction of incorporation or organization)\n2nd Floor, Regatta Office Park\n,\nWest Bay Road\nP.O. Box 10655\nGrand Cayman\n,\nKY1-1006\nCayman Islands\n(Address of Principal Executive Offices)\nSRIVARU Holding Limited\n2nd Floor, Regatta Office Park,\nWest Bay Road\nP.O. Box 10655\nGrand Cayman\n,\nKY1-1006\nCayman Islands\nTelephone:\n+1 (888)\n227-8066\nEmail: ir@srivarumotors.com\n(Name, Telephone, Email and/or Facsimile number and Address of Company Contact Person)\nSecurities registered or to be registered pursuant to Section 12(b) of the Act:\nTitle of each class\nTrading Symbol(s)\nName of each exchange\non which registered\nOrdinary shares\nSVMH\nThe\nNasdaq\nGlobal Market\nWarrants\nSVMHW\nThe\nNasdaq\nGlobal Market\nSecurities registered or to be registered pursuant to Section 12(g) of the Act:\nNone\n(Title of Class)\nSecurities for which there is a reporting obligation pursuant to Section 15(d) of the Act:\nNone\n(Title of Class)\nIndicate the number of outstanding shares of each of the issuer’s classes of capital or common stock as of the close of the period covered by the shell company report:\n14,946,286\nordinary shares and 10,005,000 warrants.\nIndicate by check mark if the registrant is a well-known seasoned issuer, as defined in Rule 405 of the Securities Act. Yes ☐\nNo\n☒\nIf this report is an annual or transition report, indicate by check mark if the registrant is not required to file reports pursuant to Section 13 or 15(d) of the Securities Exchange Act of 1934. Yes ☐\nNo\n☒\nIndicate by check mark whether the registrant (1) has filed all reports required to be filed by Section 13 or 15(d) of the Securities Exchange Act of 1934 during the preceding 12 months (or for such shorter period that the registrant was required to file such reports), and (2) has been subject to such filing requirements for the past 90 days. Yes ☐\nNo\n☒\nIndicate by check mark whether the registrant has submitted electronically every Interactive Data File required to be submitted pursuant to Rule 405 of Regulation S-T (§232.405 of this chapter) during the preceding 12 months (or for such shorter period that the registrant was required to submit and post such files).\nYes\n☒ No ☐\nIf securities are registered pursuant to Section 12(b) of the Act, indicate by check mark whether the financial statements of the registrant included in the filing reflect the correction of an error to previously issued financial statements.\n☐\nIndicate by check mark whether any of those error corrections are restatements that required a recovery analysis of incentive-based compensation received by any of the registrant’s executive officers during the relevant recovery period pursuant to §240.10D-1(b).\u202f☐\nIndicate by check mark whether the registrant is a large accelerated filer, an accelerated filer, a non-accelerated filer, or an emerging growth company. See definition of “large accelerated filer,” “accelerated filer,” and “emerging growth company” in Rule 12b-2 of the Exchange Act.\nLarge accelerated filer\n☐\nAccelerated filer\n☐\nNon-accelerated filer\n☒\nEmerging growth company\n☒\nIf an emerging growth company that prepares its financial statements in accordance with U.S. GAAP, indicate by check mark if the registrant has elected to use the extended transition period for complying with any new or revised financial accounting standards† provided pursuant to Section 13(a) of the Exchange Act.
......
</code></pre></div><p><br><br></p>
<h2 id="三csv文件">三、csv文件</h2>
<p><img loading="lazy" src="img/02-file.png" alt=""  />
</p>
<h3 id="31-读取">3.1 读取</h3>
<p>csv是对所有html的汇总文件， 如果电脑内存OK， 直接读取 <code>美股年报_10-K和20-F.csv.gz(14.27G，解压后大概50+G)</code>。</p>
<p>我使用的电内存256G， 读取时间大概17min。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;美股年报_10-K和20-F.csv&#39;</span><span class="p">,</span> <span class="n">converters</span><span class="o">=</span><span class="p">{</span><span class="s1">&#39;cik&#39;</span><span class="p">:</span> <span class="nb">str</span><span class="p">})</span>
<span class="n">df</span>
</code></pre></div><p><img loading="lazy" src="img/05-df.png" alt=""  />
</p>
<br>
<p>常见电脑内存一般8~16G， 可以借鉴这篇推文 <a href="https://textdata.cn/blog/2023-11-17-how-handle-mega-csv-that-far-exceed-memory/"><strong>代码 | 如何处理远超电脑内存的csv文件</strong></a>。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="c1">#只读取5行</span>
<span class="n">df2</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;美股年报_10-K和20-F.csv.gzip&#39;</span><span class="p">,</span> 
                  <span class="n">converters</span><span class="o">=</span><span class="p">{</span><span class="s1">&#39;cik&#39;</span><span class="p">:</span> <span class="nb">str</span><span class="p">},</span> <span class="c1">#防止股票代码被识别为数字</span>
                  <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">,</span> 
                  <span class="n">nrows</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
<span class="n">df2</span>
</code></pre></div><p><img loading="lazy" src="img/06-nrows5.png" alt=""  />
</p>
<br>
<h3 id="32-公司数量">3.2 公司数量</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s1">&#39;cik&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">nunique</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">33619
</code></pre></div><br>
<h3 id="33-查看content">3.3 查看content</h3>
<p>使用df.loc方式查看content字段的内容</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#第一行，content字段</span>
<span class="n">df</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="s1">&#39;content&#39;</span><span class="p">]</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">&#39;10-K\n1\nw46943e10-k.txt\nANNUAL REPORT FOR FISCAL YEAR ENDED 12/30/2000\n1 SECURITIES AND EXCHANGE COMMISSION WASHINGTON, D.C. 20549 FORM 10-K (Mark One) [X] Annual report pursuant to section 13 or 15(d) of the Securities Exchange Act of 1934 [NO FEE REQUIRED] for the fiscal year ended December 30, 2000 or [ ] Transition report pursuant to section 13 or 15(d) of the Securities Exchange Act of 1934 [NO FEE REQUIRED] for the transition period from ________ to ________ COMMISSION FILE NUMBER 0-9576 ------ K-TRON INTERNATIONAL, INC. (EXACT NAME OF REGISTRANT AS SPECIFIED IN ITS CHARTER)\nNew Jersey 22-1759452 ------------ ------------\n(State or other jurisdiction of (I.R.S. Employer Identification No.) incorporation or organization)\nRoutes 55 and 553 P.O. Box 888 Pitman, New Jersey 08071-0888 -------------------- ---------- (Address of principal executive offices) (Zip Code) Registrant\&#39;s telephone number, including area code: (856) 589-0500 -------------- Securities registered pursuant to Section 12(b) of the Act:\nTitle of each class Name of each exchange on which registered\nNone None ------------------- -----------------------------------------\nSecurities registered pursuant to Section 12(g) of the Act: Common Stock, par value $.01 per share -------------------------------------- (Title of class) Indicate by check mark whether the Registrant (1) has filed all reports required to be filed by Section 13 or 15(d) of the Securities Exchange Act of 1934 during the preceding 12 months (or for such shorter period that the Registrant was required to file such reports), and (2) has been subject to such filing requirements for the past 90 days. Yes X No --- ---\n2 Indicate by check mark if disclosure of delinquent filers pursuant to Item 405 of Regulation S-K is not contained herein, and will not be contained, to the best of Registrant\&#39;s knowledge, in the definitive proxy statement incorporated by reference in Part III of this annual report on Form 10-K or any amendment to this annual report on Form 10-K. |X| As of February 28, 2001, the aggregate market value of the Common Stock held by non-affiliates of the Registrant was $35,606,718. Such aggregate market value was computed by reference to the closing sale price of the Common Stock as quoted on the Nasdaq National Market on such date. For purposes of making this calculation only, the Registrant has defined affiliates as including all directors and executive ......此处略去无数字
......此处略去无数字
......此处略去无数字

Amendment No. 1 to Employment Agreement dated October 5, 1998 by and between K-Tron International, Inc. and Edward B. Cloues, II (Filed as Exhibit 10.1 to our report on Form 10-Q for the quarterly period ended October 3, 1998 and incorporated herein by reference)** 10.10 Form of Employment Agreement with certain of our employees, which are identical in all material respects except for the employee, amount of salary to be paid and date of execution (Filed as Exhibit 10.12 to our annual report on Form 10-K for the year ended January 3, 1998 and incorporated herein by reference)** 10.11 Form of Indemnification Agreement with certain of our directors and officers listed on Schedule 10.11, which are identical in all material respects except for the director or officer who is a party thereto and the date of execution (Filed as Exhibit 10.11 to the 1999 Form 10-K and incorporated herein by reference)** 10.12 Leasing Agreement dated October 30, 1990 between CS Immobilien Leasing AG, Zurich and Hasler Freres SA, with limited guaranty of K-Tron Soder AG (Filed as Exhibit 10.1(b) to our report on Form 8-K dated October 30, 1990 and incorporated herein by reference) 10.13 Amendment, dated January 25, 1991, to Leasing Agreement, dated October 30, 1990, between CS Immobilien Leasing AG, Zurich and Hasler Freres SA and to the related limited guaranty of K-Tron Soder AG (Filed as Exhibit 10.3.3 to our annual report on Form 10-K for the year ended December 29, 1990 and incorporated herein by reference) 10.14 Note dated February 4, 2000 from K-Tron America, Inc. in favor of The Bank of Gloucester County (Filed as Exhibit (b)(1) on Amendment No.1 to our Tender Offer Statement on Schedule TO dated February 16, 2000 and incorporated herein by reference)\n55 10.15 Mortgage Note dated June 11, 1996 from K-Tron America, Inc. in favor of The Bank of Gloucester County (Filed as Exhibit 10.15 to the 1999 Form 10-K and incorporated herein by reference) 10.16 Loan Modification Agreement dated June 24, 1998 between K-Tron America, Inc. and The Bank of Gloucester County (Filed as Exhibit 10.16 to the 1999 Form 10-K and incorporated herein by reference) 10.17 Note dated June 24, 1998 from K-Tron America, Inc. in favor of The Bank of Gloucester County (Filed as Exhibit 10.17 to the 1999 Form 10-K and incorporated herein by reference) 10.18 Loan Modification Agreement dated as of July 22, 1999 between K-Tron America, Inc. and The Bank of Gloucester County (Filed as Exhibit 10.18 to the 1999 Form 10-K and incorporated herein by reference) 10.19 Loan Modification Agreement dated June 21, 2000 between K-Tron America, Inc. and The Bank of Gloucester County* 21.1 Subsidiaries* 23.1 Consent of Arthur Andersen LLP* 24.1 Power of Attorney (Included on Signature Page)* -------------------- * Filed herewith ** Management contract or compensatory plan or arrangement required to be filed or incorporated as an exhibit&#39;
</code></pre></div><br>
<h3 id="34-日期">3.4 日期</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s1">&#39;account_date&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;account_date&#39;</span><span class="p">])</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;pub_date&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;pub_date&#39;</span><span class="p">])</span>

<span class="c1">#会计期末account_date</span>
<span class="nb">print</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;account_date&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">min</span><span class="p">())</span>
<span class="nb">print</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;account_date&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">())</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">2000-01-31 00:00:00
2023-10-31 00:00:00
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#报告发布日期</span>
<span class="nb">print</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;pub_date&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">min</span><span class="p">())</span>
<span class="nb">print</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;pub_date&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">())</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">2000-07-05 00:00:00
2024-01-05 00:00:00
</code></pre></div><br>
<br>
<h2 id="四相关内容">四、相关内容</h2>
<ul>
<li><a href="https://textdata.cn/blog/2024-01-21-hk-stock-market-anual-report/"><strong>数据集 | 港股年报文本数据集(2007 ~ 2025)</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-01-18-neeq-china-listed-on-nation-equities-exchange-and-quotation-system-anunal-year-report/"><strong>数据集 | 三板上市公司年报2002-2025</strong></a></li>
<li><a href="https://textdata.cn/blog/2023-03-23-china-a-share-market-dataset-mda-from-01-to-21/"><strong>数据集 | 2001-2024年A股上市公司年报&amp;管理层讨论与分析</strong></a></li>
</ul>
<br>
<br>
]]></content:encoded>
    </item>
    
    <item>
      <title>如何设计好 lambda 函数 ？</title>
      <link>https://textdata.cn/blog/2024-01-03-how-to-design-lambda-function/</link>
      <pubDate>Wed, 03 Jan 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-01-03-how-to-design-lambda-function/</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;本文来源：掘金社区。仅用于传递和分享更多信息，并不代表本平台赞同其观点和对其真实性负责，版权归原作者所有，如有侵权请联系我们删除。&lt;/p&gt;
&lt;/blockquote&gt;
&lt;br&gt;
&lt;p&gt;当你需要完成一件小工作时，在本地环境中使用这个函数，可以让工作如此得心应手，它就是Lambda 函数。&lt;/p&gt;
&lt;p&gt;Lambda 函数是 Python 中的匿名函数。有些人将它们简称为lambdas，它们的语法如下：
lambda arguments: expression&lt;/p&gt;
&lt;p&gt;lambda 关键字可以用来创建一个 lambda 函数，紧跟其后的是参数列表和用冒号分割开的单个表达式。例如，lambda x: 2 * x 是将任何输入的数乘2，而 lambda x, y: x+y 是计算两个数字的和。语法十分直截了当，对吧？
假设您知道什么是 lambda 函数，本文旨在提供有关如何正确使用 lambda 函数的一些常规准则。
&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;1-不要返回任何值&#34;&gt;1. 不要返回任何值&lt;/h2&gt;
&lt;p&gt;看看语法，您可能会注意到我们在 lambda 函数中并没有返回任何内容。这都是因为 lambda 函数只能包含一个表达式。然而，使用 return 关键字会构成不符合规定语法的语句，如下所示：&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;&amp;gt;&amp;gt;&amp;gt; integers = [(3, -3), (2, 3), (5, 1), (-4, 4)]
&amp;gt;&amp;gt;&amp;gt; sorted(integers, key=lambda x: x[-1])
[(3, -3), (5, 1), (2, 3), (-4, 4)]
&amp;gt;&amp;gt;&amp;gt; sorted(integers, key=lambda x: return x[-1])
... 
  File &amp;#34;&amp;#34;, line 1
    sorted(integers, key=lambda x: return x[-1])
                                   ^
SyntaxError: invalid syntax
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;p&gt;该错误可能是由于无法区分表达式和语句而引起的。像是包含 return、try、 with 以及 if 的语句会执行特殊动作。然而，表达式指的是那些可以被计算出一个值的表达，例如数值或其他 Python 对象。
通过使用 lambda 函数，单个表达式会被计算为一个值并且参与后续的计算，例如由 sorted 函数排序。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;2-不要忘记更好的选择&#34;&gt;2. 不要忘记更好的选择&lt;/h2&gt;
&lt;p&gt;lambda 函数最常见的使用场景是将它作为一些内置工具函数中 key 的实参，比如上面展示的 sorted() 和 max()。根据情况，我们可以使用其他替代方法。思考下面的例子：&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;&amp;gt;&amp;gt;&amp;gt; integers = [-4, 3, 7, -5, -2, 6]
&amp;gt;&amp;gt;&amp;gt; sorted(integers, key=lambda x: abs(x))
[-2, 3, -4, -5, 6, 7]
&amp;gt;&amp;gt;&amp;gt; sorted(integers, key=abs)
[-2, 3, -4, -5, 6, 7]
&amp;gt;&amp;gt;&amp;gt; scores = [(93, 100), (92, 99), (95, 94)]
&amp;gt;&amp;gt;&amp;gt; max(scores, key=lambda x: x[0] + x[1])
(93, 100)
&amp;gt;&amp;gt;&amp;gt; max(scores, key=sum)
(93, 100)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;p&gt;在数据科学领域，很多人使用 pandas 库来处理数据。如下所示，我们可以使用 lambda 函数通过 map() 函数从现有数据中创建新数据。除了使用 lambda 函数外，我们还可以直接使用算术函数，因为 pandas 是支持的：&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;&amp;gt;&amp;gt;&amp;gt; import pandas as pd
&amp;gt;&amp;gt;&amp;gt; data = pd.Series([1, 2, 3, 4])
&amp;gt;&amp;gt;&amp;gt; data.map(lambda x: x + 5)
0    6
1    7
2    8
3    9
dtype: int64
&amp;gt;&amp;gt;&amp;gt; data + 5
0    6
1    7
2    8
3    9
dtype: int64
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h2 id=&#34;3-不要将它赋值给变量&#34;&gt;3. 不要将它赋值给变量&lt;/h2&gt;
&lt;p&gt;我曾见过一些人将 lambda 函数误认为是简单函数的另一种声明方式，您可能也见过有人像下面这么做：&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;&amp;gt;&amp;gt;&amp;gt; doubler = lambda x: 2 * x
&amp;gt;&amp;gt;&amp;gt; doubler(5)
10
&amp;gt;&amp;gt;&amp;gt; doubler(7)
14
&amp;gt;&amp;gt;&amp;gt; type(doubler)
&amp;lt;class &amp;#39;function&amp;#39;&amp;gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;p&gt;对 lambda 函数命名的唯一作用可能是出于教学目的，以表明 lambda 函数的确是和其他函数一样的函数——可以被调用并且具有某种功能。除此之外，我们不应该将 lambda 函数赋值给变量。&lt;/p&gt;
&lt;p&gt;为 lambda 函数命名的问题在于这使得调试不那么直观。与其他的使用常规 def 关键字创建的函数不同，lambda 函数没有名字，这也是为什么有时它们被称为匿名函数的原因。思考下面简单的例子，找出细微的区别：&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;&amp;gt;&amp;gt;&amp;gt; inversive0 = lambda x: 1 / x
&amp;gt;&amp;gt;&amp;gt; inversive0(2)
0.5
&amp;gt;&amp;gt;&amp;gt; inversive0(0)
Traceback (most recent call last):
  File &amp;#34;&amp;#34;, line 1, in &amp;lt;module&amp;gt;
  File &amp;#34;&amp;#34;, line 1, in
ZeroDivisionError: division by zero
&amp;gt;&amp;gt;&amp;gt; def inversive1(x):
...     return 1 / x
... 
&amp;gt;&amp;gt;&amp;gt; inversive1(2)
0.5
&amp;gt;&amp;gt;&amp;gt; inversive1(0)
Traceback (most recent call last):
  File &amp;#34;&amp;#34;, line 1, in &amp;lt;module&amp;gt;
  File &amp;#34;&amp;#34;, line 2, in inversive1
ZeroDivisionError: division by zero
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;p&gt;当您的代码存在关于 lambda 函数的问题（即 inversive0），Traceback 错误信息只会提示您 lambda 函数存在问题。
相比之下，使用正常定义的函数，Traceback会清晰地提示您有问题的函数（即 inversive1）。
与此相关，如果您想多次使用 lambda 函数，最佳实践是使用通过 def 定义的允许使用文档字符串的常规函数。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;4-不要忘记列表推导式&#34;&gt;4. 不要忘记列表推导式&lt;/h2&gt;
&lt;p&gt;有些人喜欢将 lambda 函数和高阶函数一起使用，比如 map 或 filter。思考下面用法示例：&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;&amp;gt;&amp;gt;&amp;gt; # 创建一个数字列表
&amp;gt;&amp;gt;&amp;gt; numbers = [2, 1, 3, -3]
&amp;gt;&amp;gt;&amp;gt; # 使用带有 lambda 函数的 map 函数
&amp;gt;&amp;gt;&amp;gt; list(map(lambda x: x * x, numbers))
[4, 1, 9, 9]
&amp;gt;&amp;gt;&amp;gt; # 使用带有 lambda 函数的 filter 函数
&amp;gt;&amp;gt;&amp;gt; list(filter(lambda x: x % 2, numbers))
[1, 3, -3]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;p&gt;我们可以使用可读性更强的列表推导式代替 lambda 函数。如下所示，我们使用列表推导式来创建相同的列表对象。如您所见，与列表推导式相比，之前将 map 或 filter 函数与 lambda 函数一起使用更麻烦。因此，在创建涉及高阶函数的列表时，应考虑使用列表推导式。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;&amp;gt;&amp;gt;&amp;gt; # Use list comprehensions
&amp;gt;&amp;gt;&amp;gt; [x * x for x in numbers]
[4, 1, 9, 9]
&amp;gt;&amp;gt;&amp;gt; [x for x in numbers if x % 2]
[1, 3, -3]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;结论&#34;&gt;结论&lt;/h2&gt;
&lt;p&gt;在本文中，我们回顾了使用 lambda 函数可能会犯的四个常见错误。通过避免这些错误，您应该能在代码中正确使用 lambda 函数。
使用 lambda 函数的经验准则是保持简单以及只在本地使用一次。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
</description>
      <content:encoded><![CDATA[<blockquote>
<p>本文来源：掘金社区。仅用于传递和分享更多信息，并不代表本平台赞同其观点和对其真实性负责，版权归原作者所有，如有侵权请联系我们删除。</p>
</blockquote>
<br>
<p>当你需要完成一件小工作时，在本地环境中使用这个函数，可以让工作如此得心应手，它就是Lambda 函数。</p>
<p>Lambda 函数是 Python 中的匿名函数。有些人将它们简称为lambdas，它们的语法如下：
lambda arguments: expression</p>
<p>lambda 关键字可以用来创建一个 lambda 函数，紧跟其后的是参数列表和用冒号分割开的单个表达式。例如，lambda x: 2 * x 是将任何输入的数乘2，而 lambda x, y: x+y 是计算两个数字的和。语法十分直截了当，对吧？
假设您知道什么是 lambda 函数，本文旨在提供有关如何正确使用 lambda 函数的一些常规准则。
<br></p>
<h2 id="1-不要返回任何值">1. 不要返回任何值</h2>
<p>看看语法，您可能会注意到我们在 lambda 函数中并没有返回任何内容。这都是因为 lambda 函数只能包含一个表达式。然而，使用 return 关键字会构成不符合规定语法的语句，如下所示：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">&gt;&gt;&gt; integers = [(3, -3), (2, 3), (5, 1), (-4, 4)]
&gt;&gt;&gt; sorted(integers, key=lambda x: x[-1])
[(3, -3), (5, 1), (2, 3), (-4, 4)]
&gt;&gt;&gt; sorted(integers, key=lambda x: return x[-1])
... 
  File &#34;&#34;, line 1
    sorted(integers, key=lambda x: return x[-1])
                                   ^
SyntaxError: invalid syntax
</code></pre></div><br>
<p>该错误可能是由于无法区分表达式和语句而引起的。像是包含 return、try、 with 以及 if 的语句会执行特殊动作。然而，表达式指的是那些可以被计算出一个值的表达，例如数值或其他 Python 对象。
通过使用 lambda 函数，单个表达式会被计算为一个值并且参与后续的计算，例如由 sorted 函数排序。</p>
<p><br><br></p>
<h2 id="2-不要忘记更好的选择">2. 不要忘记更好的选择</h2>
<p>lambda 函数最常见的使用场景是将它作为一些内置工具函数中 key 的实参，比如上面展示的 sorted() 和 max()。根据情况，我们可以使用其他替代方法。思考下面的例子：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">&gt;&gt;&gt; integers = [-4, 3, 7, -5, -2, 6]
&gt;&gt;&gt; sorted(integers, key=lambda x: abs(x))
[-2, 3, -4, -5, 6, 7]
&gt;&gt;&gt; sorted(integers, key=abs)
[-2, 3, -4, -5, 6, 7]
&gt;&gt;&gt; scores = [(93, 100), (92, 99), (95, 94)]
&gt;&gt;&gt; max(scores, key=lambda x: x[0] + x[1])
(93, 100)
&gt;&gt;&gt; max(scores, key=sum)
(93, 100)
</code></pre></div><br>
<p>在数据科学领域，很多人使用 pandas 库来处理数据。如下所示，我们可以使用 lambda 函数通过 map() 函数从现有数据中创建新数据。除了使用 lambda 函数外，我们还可以直接使用算术函数，因为 pandas 是支持的：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">&gt;&gt;&gt; import pandas as pd
&gt;&gt;&gt; data = pd.Series([1, 2, 3, 4])
&gt;&gt;&gt; data.map(lambda x: x + 5)
0    6
1    7
2    8
3    9
dtype: int64
&gt;&gt;&gt; data + 5
0    6
1    7
2    8
3    9
dtype: int64
</code></pre></div><br>
<h2 id="3-不要将它赋值给变量">3. 不要将它赋值给变量</h2>
<p>我曾见过一些人将 lambda 函数误认为是简单函数的另一种声明方式，您可能也见过有人像下面这么做：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">&gt;&gt;&gt; doubler = lambda x: 2 * x
&gt;&gt;&gt; doubler(5)
10
&gt;&gt;&gt; doubler(7)
14
&gt;&gt;&gt; type(doubler)
&lt;class &#39;function&#39;&gt;
</code></pre></div><br>
<p>对 lambda 函数命名的唯一作用可能是出于教学目的，以表明 lambda 函数的确是和其他函数一样的函数——可以被调用并且具有某种功能。除此之外，我们不应该将 lambda 函数赋值给变量。</p>
<p>为 lambda 函数命名的问题在于这使得调试不那么直观。与其他的使用常规 def 关键字创建的函数不同，lambda 函数没有名字，这也是为什么有时它们被称为匿名函数的原因。思考下面简单的例子，找出细微的区别：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">&gt;&gt;&gt; inversive0 = lambda x: 1 / x
&gt;&gt;&gt; inversive0(2)
0.5
&gt;&gt;&gt; inversive0(0)
Traceback (most recent call last):
  File &#34;&#34;, line 1, in &lt;module&gt;
  File &#34;&#34;, line 1, in
ZeroDivisionError: division by zero
&gt;&gt;&gt; def inversive1(x):
...     return 1 / x
... 
&gt;&gt;&gt; inversive1(2)
0.5
&gt;&gt;&gt; inversive1(0)
Traceback (most recent call last):
  File &#34;&#34;, line 1, in &lt;module&gt;
  File &#34;&#34;, line 2, in inversive1
ZeroDivisionError: division by zero
</code></pre></div><br>
<p>当您的代码存在关于 lambda 函数的问题（即 inversive0），Traceback 错误信息只会提示您 lambda 函数存在问题。
相比之下，使用正常定义的函数，Traceback会清晰地提示您有问题的函数（即 inversive1）。
与此相关，如果您想多次使用 lambda 函数，最佳实践是使用通过 def 定义的允许使用文档字符串的常规函数。</p>
<p><br><br></p>
<h2 id="4-不要忘记列表推导式">4. 不要忘记列表推导式</h2>
<p>有些人喜欢将 lambda 函数和高阶函数一起使用，比如 map 或 filter。思考下面用法示例：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">&gt;&gt;&gt; # 创建一个数字列表
&gt;&gt;&gt; numbers = [2, 1, 3, -3]
&gt;&gt;&gt; # 使用带有 lambda 函数的 map 函数
&gt;&gt;&gt; list(map(lambda x: x * x, numbers))
[4, 1, 9, 9]
&gt;&gt;&gt; # 使用带有 lambda 函数的 filter 函数
&gt;&gt;&gt; list(filter(lambda x: x % 2, numbers))
[1, 3, -3]
</code></pre></div><br>
<p>我们可以使用可读性更强的列表推导式代替 lambda 函数。如下所示，我们使用列表推导式来创建相同的列表对象。如您所见，与列表推导式相比，之前将 map 或 filter 函数与 lambda 函数一起使用更麻烦。因此，在创建涉及高阶函数的列表时，应考虑使用列表推导式。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">&gt;&gt;&gt; # Use list comprehensions
&gt;&gt;&gt; [x * x for x in numbers]
[4, 1, 9, 9]
&gt;&gt;&gt; [x for x in numbers if x % 2]
[1, 3, -3]
</code></pre></div><p><br><br></p>
<h2 id="结论">结论</h2>
<p>在本文中，我们回顾了使用 lambda 函数可能会犯的四个常见错误。通过避免这些错误，您应该能在代码中正确使用 lambda 函数。
使用 lambda 函数的经验准则是保持简单以及只在本地使用一次。</p>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>数据集 | 36330条上市公司仲裁数据(2000-2021)</title>
      <link>https://textdata.cn/blog/2024-01-03-listed-company-arbitration-dataset/</link>
      <pubDate>Wed, 03 Jan 2024 00:00:00 +0000</pubDate>
      
      <guid>/blog/2024-01-03-listed-company-arbitration-dataset/</guid>
      <description>&lt;h2 id=&#34;一数据介绍&#34;&gt;一、数据介绍&lt;/h2&gt;
&lt;h3 id=&#34;11-数据集概况&#34;&gt;1.1 数据集概况&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;数据集名: 上市公司仲裁数据
时间跨度: 2000-01-26 ~ 2021-09-28
案件数据: 36330
数据来源: 裁判文书网
下载链接: https://pan.baidu.com/s/16fBSpfJSididpPT43ew6fg?pwd=mm8r
本文声明: 科研用途；如有问题， 请加微信372335839，备注「姓名-学校-专业」
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;数据整理自&lt;a href=&#34;https://textdata.cn/blog/2023-05-07-china-law-judgment-documents-datasets/&#34;&gt;数据集 | 中国裁判文书网(2010-2021)&lt;/a&gt;&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;12-声明&#34;&gt;1.2 声明&lt;/h3&gt;
&lt;p&gt;科研用途；如有问题， 请加微信372335839，备注「姓名-学校-专业」&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;13-相关文献&#34;&gt;1.3 相关文献&lt;/h3&gt;
&lt;p&gt;上市公司仲裁数据可用于衡量上市公司法律风险等，&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;[1]冯延超,梁莱歆.上市公司法律风险、审计收费及非标准审计意见——来自中国上市公司的经验证据[J].审计研究,2010(03):75-81.
[2]祝继高.会计稳健性与债权人利益保护——基于银行与上市公司关于贷款的法律诉讼的研究[J].会计研究,2011(05):50-57+96.
[3]辛宇,黄欣怡,纪蓓蓓.投资者保护公益组织与股东诉讼在中国的实践——基于中证投服证券支持诉讼的多案例研究[J].管理世界,2020,36(01):69-87+235.
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;br&gt;
&lt;h3 id=&#34;14-字段&#34;&gt;1.4 字段&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt; -  公告日期
 -  股票代码
 -  股票简称
 -  涉案类型
 -  原告被告
 -  案件案由
 -  涉案金额
 -  判决情况
 -  执行情况
 -  货币种类
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二查看数据&#34;&gt;二、查看数据&lt;/h2&gt;
&lt;h3 id=&#34;21-读取数据&#34;&gt;2.1 读取数据&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_excel&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;上市公司仲裁数据2000-2021.xlsx&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;公告日期&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;to_datetime&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;公告日期&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;head&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/01-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;22-记录数&#34;&gt;2.2 记录数&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;36330
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;23-公司数&#34;&gt;2.3 公司数&lt;/h3&gt;
&lt;p&gt;涉案的上市公司数量&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;股票代码&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;nunique&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;2251
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;2-4-覆盖日期&#34;&gt;2. 4 覆盖日期&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;公告日期&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;min&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;())&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;公告日期&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;max&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;2000-01-26 00:00:00
2021-09-28 00:00:00
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;25-字段缺失率&#34;&gt;2.5 字段&amp;amp;缺失率&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;col&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;columns&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;ratio&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;col&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;isna&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;sum&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;/&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;col&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ratio&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;公告日期 0.0
股票代码 0.0
股票简称 2.7525461051472613e-05
涉案类型 0.0002202036884117809
原告被告 0.001568951279933939
案件案由 0.00013762730525736306
涉案金额 0.00016515276630883568
判决情况 0.8911643270024773
执行情况 0.740765207817231
货币种类 0.0
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;br&gt;
</description>
      <content:encoded><![CDATA[<h2 id="一数据介绍">一、数据介绍</h2>
<h3 id="11-数据集概况">1.1 数据集概况</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">数据集名: 上市公司仲裁数据
时间跨度: 2000-01-26 ~ 2021-09-28
案件数据: 36330
数据来源: 裁判文书网
下载链接: https://pan.baidu.com/s/16fBSpfJSididpPT43ew6fg?pwd=mm8r
本文声明: 科研用途；如有问题， 请加微信372335839，备注「姓名-学校-专业」
</code></pre></div><p>数据整理自<a href="https://textdata.cn/blog/2023-05-07-china-law-judgment-documents-datasets/">数据集 | 中国裁判文书网(2010-2021)</a></p>
<br>
<h3 id="12-声明">1.2 声明</h3>
<p>科研用途；如有问题， 请加微信372335839，备注「姓名-学校-专业」</p>
<br>
<h3 id="13-相关文献">1.3 相关文献</h3>
<p>上市公司仲裁数据可用于衡量上市公司法律风险等，</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[1]冯延超,梁莱歆.上市公司法律风险、审计收费及非标准审计意见——来自中国上市公司的经验证据[J].审计研究,2010(03):75-81.
[2]祝继高.会计稳健性与债权人利益保护——基于银行与上市公司关于贷款的法律诉讼的研究[J].会计研究,2011(05):50-57+96.
[3]辛宇,黄欣怡,纪蓓蓓.投资者保护公益组织与股东诉讼在中国的实践——基于中证投服证券支持诉讼的多案例研究[J].管理世界,2020,36(01):69-87+235.
</code></pre></div><br>
<br>
<h3 id="14-字段">1.4 字段</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"> -  公告日期
 -  股票代码
 -  股票简称
 -  涉案类型
 -  原告被告
 -  案件案由
 -  涉案金额
 -  判决情况
 -  执行情况
 -  货币种类
</code></pre></div><p><br><br></p>
<h2 id="二查看数据">二、查看数据</h2>
<h3 id="21-读取数据">2.1 读取数据</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_excel</span><span class="p">(</span><span class="s1">&#39;上市公司仲裁数据2000-2021.xlsx&#39;</span><span class="p">)</span>

<span class="n">df</span><span class="p">[</span><span class="s1">&#39;公告日期&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;公告日期&#39;</span><span class="p">])</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/01-df.png" alt=""  />
</p>
<br>
<h3 id="22-记录数">2.2 记录数</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">36330
</code></pre></div><br>
<h3 id="23-公司数">2.3 公司数</h3>
<p>涉案的上市公司数量</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">nunique</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">2251
</code></pre></div><br>
<h3 id="2-4-覆盖日期">2. 4 覆盖日期</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="nb">print</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;公告日期&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">min</span><span class="p">())</span>
<span class="nb">print</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;公告日期&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">())</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">2000-01-26 00:00:00
2021-09-28 00:00:00
</code></pre></div><br>
<h3 id="25-字段缺失率">2.5 字段&amp;缺失率</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="k">for</span> <span class="n">col</span> <span class="ow">in</span> <span class="n">df</span><span class="o">.</span><span class="n">columns</span><span class="p">:</span>
    <span class="n">ratio</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">col</span><span class="p">]</span><span class="o">.</span><span class="n">isna</span><span class="p">()</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span><span class="o">/</span><span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
    <span class="nb">print</span><span class="p">(</span><span class="n">col</span><span class="p">,</span> <span class="n">ratio</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">公告日期 0.0
股票代码 0.0
股票简称 2.7525461051472613e-05
涉案类型 0.0002202036884117809
原告被告 0.001568951279933939
案件案由 0.00013762730525736306
涉案金额 0.00016515276630883568
判决情况 0.8911643270024773
执行情况 0.740765207817231
货币种类 0.0
</code></pre></div><br>
<br>
]]></content:encoded>
    </item>
    
    <item>
      <title>数据集 | 2024年中国全国5级行政区划（省、市、县、镇、村）</title>
      <link>https://textdata.cn/blog/2023-12-29-china-area-dataset/</link>
      <pubDate>Fri, 29 Dec 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-12-29-china-area-dataset/</guid>
      <description>&lt;p&gt;最近分享的数据集一般都含有地址信息，这就很有必要寻找中国区划数据集， 来帮助我们更好的清洗地址数据。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;一数据集概况&#34;&gt;一、数据集概况&lt;/h2&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;数据来源:  中华人民共和国国家统计局 
          https://www.stats.gov.cn/sj/tjbz/tjyqhdmhcxhfdm/2023/

下载地址: ``https://github.com/adyliu/china_area`` 

数据量(2023年): 665552 

数据格式: csv.gz 或 sql.gz

级别:
   1级：省、直辖市、自治区
   2级：地级市
   3级：市辖区、县（旗）、县级市、自治县（自治旗）、特区、林区
   4级：镇、乡、民族乡、县辖区、街道
   5级：村、居委会
   
城乡分类 (1开头是城镇，2开头是乡村)
   111表示主城区；
   112表示城乡接合区；
   121表示镇中心区；
   122表示镇乡接合区；
   123表示特殊区域；
   210表示乡中心区；
   220表示村庄
   
   
code: 共12位(省2位，市2位，县2位，镇3位，村3位)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;p&gt;按截图操作即可获取数据集&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/01-cover.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;分省份2010-2024数据变化&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/2010-2024.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;说明&#34;&gt;说明&lt;/h3&gt;
&lt;p&gt;科研用途展示， 如有问题， 加微信372335839，备注「姓名-学校-专业」&lt;/p&gt;
&lt;br&gt;
&lt;h2 id=&#34;二读取数据&#34;&gt;二、读取数据&lt;/h2&gt;
&lt;p&gt;以 &lt;em&gt;&lt;strong&gt;area_code_2024.csv.gz&lt;/strong&gt;&lt;/em&gt; 为例， 解压后得到 &lt;em&gt;&lt;strong&gt;area_code_2024.csv&lt;/strong&gt;&lt;/em&gt;，&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;area_code_2024.csv&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;header&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;None&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;c1&#34;&gt;#, names=[&amp;#39;name&amp;#39;, &amp;#39;level&amp;#39;, &amp;#39;code&amp;#39;, &amp;#39;class&amp;#39;]&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;columns&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;code&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;name&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;level&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;pcode&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;category&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;head&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;10&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;665552
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/02-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h2 id=&#34;三查看区划等级&#34;&gt;三、查看区划等级&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;区划级别&lt;/strong&gt;:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt; 1级：省、直辖市、自治区
 2级：地级市
 3级：市辖区、县（旗）、县级市、自治县（自治旗）、特区、林区
 4级：镇、乡、民族乡、县辖区、街道
 5级：村、居委会
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;br&gt;
&lt;h3 id=&#34;31-省&#34;&gt;3.1 省&lt;/h3&gt;
&lt;p&gt;查看所有省名字&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;level&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;==&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;][&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;name&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;values&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;array([&amp;#39;北京市&amp;#39;, &amp;#39;天津市&amp;#39;, &amp;#39;河北省&amp;#39;, &amp;#39;山西省&amp;#39;, &amp;#39;内蒙古自治区&amp;#39;, &amp;#39;辽宁省&amp;#39;, &amp;#39;吉林省&amp;#39;, &amp;#39;黑龙江省&amp;#39;, &amp;#39;上海市&amp;#39;,
       &amp;#39;江苏省&amp;#39;, &amp;#39;浙江省&amp;#39;, &amp;#39;安徽省&amp;#39;, &amp;#39;福建省&amp;#39;, &amp;#39;江西省&amp;#39;, &amp;#39;山东省&amp;#39;, &amp;#39;河南省&amp;#39;, &amp;#39;湖北省&amp;#39;, &amp;#39;湖南省&amp;#39;,
       &amp;#39;广东省&amp;#39;, &amp;#39;广西壮族自治区&amp;#39;, &amp;#39;海南省&amp;#39;, &amp;#39;重庆市&amp;#39;, &amp;#39;四川省&amp;#39;, &amp;#39;贵州省&amp;#39;, &amp;#39;云南省&amp;#39;, &amp;#39;西藏自治区&amp;#39;,
       &amp;#39;陕西省&amp;#39;, &amp;#39;甘肃省&amp;#39;, &amp;#39;青海省&amp;#39;, &amp;#39;宁夏回族自治区&amp;#39;, &amp;#39;新疆维吾尔自治区&amp;#39;], dtype=object)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;code: 共12位(省2位，市2位，县2位，镇3位，村3位), 查看所有省的代码&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;level&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;==&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;][&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;code&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;astype&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[:&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;values&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;array([&amp;#39;11&amp;#39;, &amp;#39;12&amp;#39;, &amp;#39;13&amp;#39;, &amp;#39;14&amp;#39;, &amp;#39;15&amp;#39;, &amp;#39;21&amp;#39;, &amp;#39;22&amp;#39;, &amp;#39;23&amp;#39;, &amp;#39;31&amp;#39;, &amp;#39;32&amp;#39;, &amp;#39;33&amp;#39;,
       &amp;#39;34&amp;#39;, &amp;#39;35&amp;#39;, &amp;#39;36&amp;#39;, &amp;#39;37&amp;#39;, &amp;#39;41&amp;#39;, &amp;#39;42&amp;#39;, &amp;#39;43&amp;#39;, &amp;#39;44&amp;#39;, &amp;#39;45&amp;#39;, &amp;#39;46&amp;#39;, &amp;#39;50&amp;#39;,
       &amp;#39;51&amp;#39;, &amp;#39;52&amp;#39;, &amp;#39;53&amp;#39;, &amp;#39;54&amp;#39;, &amp;#39;61&amp;#39;, &amp;#39;62&amp;#39;, &amp;#39;63&amp;#39;, &amp;#39;64&amp;#39;, &amp;#39;65&amp;#39;], dtype=object)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;p&gt;省份名和区划代码&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;province_code_df = pd.DataFrame(
    {&amp;#39;province&amp;#39;: df[df[&amp;#39;level&amp;#39;]==1][&amp;#39;name&amp;#39;].values,
    &amp;#39;code&amp;#39;:df[df[&amp;#39;level&amp;#39;]==1][&amp;#39;code&amp;#39;].astype(str).str[:2].values}
)

province_code_df
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/03-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;32-市&#34;&gt;3.2 市&lt;/h3&gt;
&lt;p&gt;code: 共12位(省2位，市2位，县2位，镇3位，村3位), 查看所有市的代码&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;city_code_df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;DataFrame&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;
    &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;city&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;level&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;==&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;][&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;name&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;values&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
     &lt;span class=&#34;s1&#34;&gt;&amp;#39;code&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;level&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;==&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;][&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;code&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;astype&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[:&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;4&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;values&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;city_code_df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/04-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;33-县&#34;&gt;3.3 县&lt;/h3&gt;
&lt;p&gt;code: 共12位(省2位，市2位，县2位，镇3位，村3位), 查看所有县的代码&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;county_code_df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;DataFrame&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;
    &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;county&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;level&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;==&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;3&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;][&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;name&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;values&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
     &lt;span class=&#34;s1&#34;&gt;&amp;#39;code&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;level&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;==&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;3&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;][&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;code&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;astype&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[:&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;6&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;values&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;county_code_df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/05-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;34-镇&#34;&gt;3.4 镇&lt;/h3&gt;
&lt;p&gt;code: 共12位(省2位，市2位，县2位，镇3位，村3位), 查看所有镇的代码&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;zhen_code_df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;DataFrame&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;
    &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;zhen&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;level&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;==&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;4&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;][&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;name&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;values&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
     &lt;span class=&#34;s1&#34;&gt;&amp;#39;code&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;level&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;==&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;4&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;][&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;code&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;astype&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[:&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;9&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;values&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;zhen_code_df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/06-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;35-村&#34;&gt;3.5 村&lt;/h3&gt;
&lt;p&gt;code: 共12位(省2位，市2位，县2位，镇3位，村3位), 查看所有镇的代码&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;village_code_df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;DataFrame&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;
    &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;village&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;level&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;==&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;5&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;][&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;name&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;values&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
     &lt;span class=&#34;s1&#34;&gt;&amp;#39;code&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;level&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;==&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;5&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;][&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;code&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;astype&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[:&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;12&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;values&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;village_code_df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/07-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;四-城乡分类&#34;&gt;四、 城乡分类&lt;/h2&gt;
&lt;p&gt;城乡分类 (1开头是城镇，2开头是乡村)&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;111表示主城区；&lt;/li&gt;
&lt;li&gt;112表示城乡接合区；&lt;/li&gt;
&lt;li&gt;121表示镇中心区；&lt;/li&gt;
&lt;li&gt;122表示镇乡接合区；&lt;/li&gt;
&lt;li&gt;123表示特殊区域；&lt;/li&gt;
&lt;li&gt;210表示乡中心区；&lt;/li&gt;
&lt;li&gt;220表示村庄&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;查看所有的城镇&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;#category以1为开头，即城镇&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;category&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;astype&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;startswith&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;1&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/08-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;查看所有的镇中心区&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;category&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;==&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;121&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/09-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;h2 id=&#34;相关内容&#34;&gt;相关内容&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-12-29-china-area-division-change/&#34;&gt;&lt;strong&gt;中国行政区划代码历史沿革数据库&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/10-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;b&gt;
&lt;b&gt;
</description>
      <content:encoded><![CDATA[<p>最近分享的数据集一般都含有地址信息，这就很有必要寻找中国区划数据集， 来帮助我们更好的清洗地址数据。</p>
<p><br><br></p>
<h2 id="一数据集概况">一、数据集概况</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">数据来源:  中华人民共和国国家统计局 
          https://www.stats.gov.cn/sj/tjbz/tjyqhdmhcxhfdm/2023/

下载地址: ``https://github.com/adyliu/china_area`` 

数据量(2023年): 665552 

数据格式: csv.gz 或 sql.gz

级别:
   1级：省、直辖市、自治区
   2级：地级市
   3级：市辖区、县（旗）、县级市、自治县（自治旗）、特区、林区
   4级：镇、乡、民族乡、县辖区、街道
   5级：村、居委会
   
城乡分类 (1开头是城镇，2开头是乡村)
   111表示主城区；
   112表示城乡接合区；
   121表示镇中心区；
   122表示镇乡接合区；
   123表示特殊区域；
   210表示乡中心区；
   220表示村庄
   
   
code: 共12位(省2位，市2位，县2位，镇3位，村3位)
</code></pre></div><br>
<p>按截图操作即可获取数据集</p>
<p><img loading="lazy" src="img/01-cover.png" alt=""  />
</p>
<p><strong>分省份2010-2024数据变化</strong></p>
<p><img loading="lazy" src="img/2010-2024.png" alt=""  />
</p>
<br>
<h3 id="说明">说明</h3>
<p>科研用途展示， 如有问题， 加微信372335839，备注「姓名-学校-专业」</p>
<br>
<h2 id="二读取数据">二、读取数据</h2>
<p>以 <em><strong>area_code_2024.csv.gz</strong></em> 为例， 解压后得到 <em><strong>area_code_2024.csv</strong></em>，</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;area_code_2024.csv&#39;</span><span class="p">,</span> <span class="n">header</span><span class="o">=</span><span class="kc">None</span><span class="p">)</span><span class="c1">#, names=[&#39;name&#39;, &#39;level&#39;, &#39;code&#39;, &#39;class&#39;]</span>
<span class="n">df</span><span class="o">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;code&#39;</span><span class="p">,</span> <span class="s1">&#39;name&#39;</span><span class="p">,</span> <span class="s1">&#39;level&#39;</span><span class="p">,</span> <span class="s1">&#39;pcode&#39;</span><span class="p">,</span> <span class="s1">&#39;category&#39;</span><span class="p">]</span>
<span class="nb">print</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">))</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">665552
</code></pre></div><p><img loading="lazy" src="img/02-df.png" alt=""  />
</p>
<br>
<h2 id="三查看区划等级">三、查看区划等级</h2>
<p><strong>区划级别</strong>:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"> 1级：省、直辖市、自治区
 2级：地级市
 3级：市辖区、县（旗）、县级市、自治县（自治旗）、特区、林区
 4级：镇、乡、民族乡、县辖区、街道
 5级：村、居委会
</code></pre></div>  <br>
<h3 id="31-省">3.1 省</h3>
<p>查看所有省名字</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;level&#39;</span><span class="p">]</span><span class="o">==</span><span class="mi">1</span><span class="p">][</span><span class="s1">&#39;name&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">values</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">array([&#39;北京市&#39;, &#39;天津市&#39;, &#39;河北省&#39;, &#39;山西省&#39;, &#39;内蒙古自治区&#39;, &#39;辽宁省&#39;, &#39;吉林省&#39;, &#39;黑龙江省&#39;, &#39;上海市&#39;,
       &#39;江苏省&#39;, &#39;浙江省&#39;, &#39;安徽省&#39;, &#39;福建省&#39;, &#39;江西省&#39;, &#39;山东省&#39;, &#39;河南省&#39;, &#39;湖北省&#39;, &#39;湖南省&#39;,
       &#39;广东省&#39;, &#39;广西壮族自治区&#39;, &#39;海南省&#39;, &#39;重庆市&#39;, &#39;四川省&#39;, &#39;贵州省&#39;, &#39;云南省&#39;, &#39;西藏自治区&#39;,
       &#39;陕西省&#39;, &#39;甘肃省&#39;, &#39;青海省&#39;, &#39;宁夏回族自治区&#39;, &#39;新疆维吾尔自治区&#39;], dtype=object)
</code></pre></div><p>code: 共12位(省2位，市2位，县2位，镇3位，村3位), 查看所有省的代码</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;level&#39;</span><span class="p">]</span><span class="o">==</span><span class="mi">1</span><span class="p">][</span><span class="s1">&#39;code&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="nb">str</span><span class="p">)</span><span class="o">.</span><span class="n">str</span><span class="p">[:</span><span class="mi">2</span><span class="p">]</span><span class="o">.</span><span class="n">values</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">array([&#39;11&#39;, &#39;12&#39;, &#39;13&#39;, &#39;14&#39;, &#39;15&#39;, &#39;21&#39;, &#39;22&#39;, &#39;23&#39;, &#39;31&#39;, &#39;32&#39;, &#39;33&#39;,
       &#39;34&#39;, &#39;35&#39;, &#39;36&#39;, &#39;37&#39;, &#39;41&#39;, &#39;42&#39;, &#39;43&#39;, &#39;44&#39;, &#39;45&#39;, &#39;46&#39;, &#39;50&#39;,
       &#39;51&#39;, &#39;52&#39;, &#39;53&#39;, &#39;54&#39;, &#39;61&#39;, &#39;62&#39;, &#39;63&#39;, &#39;64&#39;, &#39;65&#39;], dtype=object)
</code></pre></div><br>
<p>省份名和区划代码</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">province_code_df = pd.DataFrame(
    {&#39;province&#39;: df[df[&#39;level&#39;]==1][&#39;name&#39;].values,
    &#39;code&#39;:df[df[&#39;level&#39;]==1][&#39;code&#39;].astype(str).str[:2].values}
)

province_code_df
</code></pre></div><p><img loading="lazy" src="img/03-df.png" alt=""  />
</p>
<br>
<h3 id="32-市">3.2 市</h3>
<p>code: 共12位(省2位，市2位，县2位，镇3位，村3位), 查看所有市的代码</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">city_code_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span>
    <span class="p">{</span><span class="s1">&#39;city&#39;</span><span class="p">:</span> <span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;level&#39;</span><span class="p">]</span><span class="o">==</span><span class="mi">2</span><span class="p">][</span><span class="s1">&#39;name&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">values</span><span class="p">,</span>
     <span class="s1">&#39;code&#39;</span><span class="p">:</span><span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;level&#39;</span><span class="p">]</span><span class="o">==</span><span class="mi">2</span><span class="p">][</span><span class="s1">&#39;code&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="nb">str</span><span class="p">)</span><span class="o">.</span><span class="n">str</span><span class="p">[:</span><span class="mi">4</span><span class="p">]</span><span class="o">.</span><span class="n">values</span><span class="p">}</span>
<span class="p">)</span>

<span class="n">city_code_df</span>
</code></pre></div><p><img loading="lazy" src="img/04-df.png" alt=""  />
</p>
<br>
<h3 id="33-县">3.3 县</h3>
<p>code: 共12位(省2位，市2位，县2位，镇3位，村3位), 查看所有县的代码</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">county_code_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span>
    <span class="p">{</span><span class="s1">&#39;county&#39;</span><span class="p">:</span> <span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;level&#39;</span><span class="p">]</span><span class="o">==</span><span class="mi">3</span><span class="p">][</span><span class="s1">&#39;name&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">values</span><span class="p">,</span>
     <span class="s1">&#39;code&#39;</span><span class="p">:</span><span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;level&#39;</span><span class="p">]</span><span class="o">==</span><span class="mi">3</span><span class="p">][</span><span class="s1">&#39;code&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="nb">str</span><span class="p">)</span><span class="o">.</span><span class="n">str</span><span class="p">[:</span><span class="mi">6</span><span class="p">]</span><span class="o">.</span><span class="n">values</span><span class="p">}</span>
<span class="p">)</span>

<span class="n">county_code_df</span>
</code></pre></div><p><img loading="lazy" src="img/05-df.png" alt=""  />
</p>
<br>
<h3 id="34-镇">3.4 镇</h3>
<p>code: 共12位(省2位，市2位，县2位，镇3位，村3位), 查看所有镇的代码</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">zhen_code_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span>
    <span class="p">{</span><span class="s1">&#39;zhen&#39;</span><span class="p">:</span> <span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;level&#39;</span><span class="p">]</span><span class="o">==</span><span class="mi">4</span><span class="p">][</span><span class="s1">&#39;name&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">values</span><span class="p">,</span>
     <span class="s1">&#39;code&#39;</span><span class="p">:</span><span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;level&#39;</span><span class="p">]</span><span class="o">==</span><span class="mi">4</span><span class="p">][</span><span class="s1">&#39;code&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="nb">str</span><span class="p">)</span><span class="o">.</span><span class="n">str</span><span class="p">[:</span><span class="mi">9</span><span class="p">]</span><span class="o">.</span><span class="n">values</span><span class="p">}</span>
<span class="p">)</span>

<span class="n">zhen_code_df</span>
</code></pre></div><p><img loading="lazy" src="img/06-df.png" alt=""  />
</p>
<br>
<h3 id="35-村">3.5 村</h3>
<p>code: 共12位(省2位，市2位，县2位，镇3位，村3位), 查看所有镇的代码</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">village_code_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span>
    <span class="p">{</span><span class="s1">&#39;village&#39;</span><span class="p">:</span> <span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;level&#39;</span><span class="p">]</span><span class="o">==</span><span class="mi">5</span><span class="p">][</span><span class="s1">&#39;name&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">values</span><span class="p">,</span>
     <span class="s1">&#39;code&#39;</span><span class="p">:</span><span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;level&#39;</span><span class="p">]</span><span class="o">==</span><span class="mi">5</span><span class="p">][</span><span class="s1">&#39;code&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="nb">str</span><span class="p">)</span><span class="o">.</span><span class="n">str</span><span class="p">[:</span><span class="mi">12</span><span class="p">]</span><span class="o">.</span><span class="n">values</span><span class="p">}</span>
<span class="p">)</span>

<span class="n">village_code_df</span>
</code></pre></div><p><img loading="lazy" src="img/07-df.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="四-城乡分类">四、 城乡分类</h2>
<p>城乡分类 (1开头是城镇，2开头是乡村)</p>
<ul>
<li>111表示主城区；</li>
<li>112表示城乡接合区；</li>
<li>121表示镇中心区；</li>
<li>122表示镇乡接合区；</li>
<li>123表示特殊区域；</li>
<li>210表示乡中心区；</li>
<li>220表示村庄</li>
</ul>
<p>查看所有的城镇</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#category以1为开头，即城镇</span>
<span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;category&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="nb">str</span><span class="p">)</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="s1">&#39;1&#39;</span><span class="p">)]</span>
</code></pre></div><p><img loading="lazy" src="img/08-df.png" alt=""  />
</p>
<br>
<p>查看所有的镇中心区</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;category&#39;</span><span class="p">]</span><span class="o">==</span><span class="mi">121</span><span class="p">]</span>
</code></pre></div><p><img loading="lazy" src="img/09-df.png" alt=""  />
</p>
<br>
<br>
<h2 id="相关内容">相关内容</h2>
<ul>
<li><a href="https://textdata.cn/blog/2023-12-29-china-area-division-change/"><strong>中国行政区划代码历史沿革数据库</strong></a></li>
</ul>
<p><br><br></p>
<p><img loading="lazy" src="img/10-df.png" alt=""  />
</p>
<b>
<b>
]]></content:encoded>
    </item>
    
    <item>
      <title>数据集 | 行政区划代码历史沿革数据集</title>
      <link>https://textdata.cn/blog/2023-12-29-china-area-division-change/</link>
      <pubDate>Fri, 29 Dec 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-12-29-china-area-division-change/</guid>
      <description>&lt;p&gt;前一期分享了 [数据集 | 2024年中国全国5级行政区划（省、市、县、镇、村）]((&lt;a href=&#34;https://textdata.cn/blog/2023-12-29-china-area-dataset/&#34;&gt;https://textdata.cn/blog/2023-12-29-china-area-dataset/&lt;/a&gt;) ，今天再分享一个行政区划数据库。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;一概况&#34;&gt;一、概况&lt;/h2&gt;
&lt;p&gt;整理行政区划的历史沿革，包括拆分合并、名称变化、隶属变化、级别变化等变更情况。&lt;/p&gt;
&lt;p&gt;可根据身份证号前 6 位查询持证人所在地：出生或初次申领时的所在地，以及与之对应的当今的所在地。因我国1984年开始制发居民身份证、身份证号中的行政区划代码精确到县，故目前只整理到县级及以上、1984 年及以后。&lt;/p&gt;
&lt;p&gt;数据现已更新到 2022 年底。&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;按截图操作即可获取数据集&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/01-cover.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二实验代码&#34;&gt;二、实验代码&lt;/h2&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;#代码文件放在 division-changes文件夹内&lt;/span&gt;

&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;translate&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;translate&lt;/span&gt;


&lt;span class=&#34;c1&#34;&gt;# 正向查询（起始年份 &amp;lt; 目标年份）&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;translate&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;512323&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;1984&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2018&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;c1&#34;&gt;# 返回 [&amp;#34;500119&amp;#34;]&lt;/span&gt;
    &lt;span class=&#34;c1&#34;&gt;# 1984年的四川省涪陵地区南川县&lt;/span&gt;
    &lt;span class=&#34;c1&#34;&gt;# 对应于2018年的重庆市南川区&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;translate&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;430404&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2000&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2018&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;c1&#34;&gt;# 返回 [&amp;#34;430407&amp;#34;, &amp;#34;430408&amp;#34;]&lt;/span&gt;
    &lt;span class=&#34;c1&#34;&gt;# 2000年的湖南省衡阳市城北区&lt;/span&gt;
    &lt;span class=&#34;c1&#34;&gt;# 对应于2018年的湖南省衡阳市石鼓区、蒸湘区&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;# 反向查询（起始年份 &amp;gt; 目标年份）&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;translate&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;110102&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2010&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2000&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;c1&#34;&gt;# 返回 [&amp;#39;110102&amp;#39;, &amp;#39;110104&amp;#39;]&lt;/span&gt;
    &lt;span class=&#34;c1&#34;&gt;# 2010年的北京市西城区&lt;/span&gt;
    &lt;span class=&#34;c1&#34;&gt;# 对应于2000年的北京市西城区、宣武区&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;br&gt;
&lt;h2 id=&#34;三相关内容&#34;&gt;三、相关内容&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;[数据集 | 2024年中国全国5级行政区划（省、市、县、镇、村）]((&lt;a href=&#34;https://textdata.cn/blog/2023-12-29-china-area-dataset/&#34;&gt;https://textdata.cn/blog/2023-12-29-china-area-dataset/&lt;/a&gt;)
&lt;b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;b&gt;
</description>
      <content:encoded><![CDATA[<p>前一期分享了 [数据集 | 2024年中国全国5级行政区划（省、市、县、镇、村）]((<a href="https://textdata.cn/blog/2023-12-29-china-area-dataset/">https://textdata.cn/blog/2023-12-29-china-area-dataset/</a>) ，今天再分享一个行政区划数据库。</p>
<p><br><br></p>
<h2 id="一概况">一、概况</h2>
<p>整理行政区划的历史沿革，包括拆分合并、名称变化、隶属变化、级别变化等变更情况。</p>
<p>可根据身份证号前 6 位查询持证人所在地：出生或初次申领时的所在地，以及与之对应的当今的所在地。因我国1984年开始制发居民身份证、身份证号中的行政区划代码精确到县，故目前只整理到县级及以上、1984 年及以后。</p>
<p>数据现已更新到 2022 年底。</p>
<br>
<p>按截图操作即可获取数据集</p>
<p><img loading="lazy" src="img/01-cover.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="二实验代码">二、实验代码</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#代码文件放在 division-changes文件夹内</span>

<span class="kn">from</span> <span class="nn">translate</span> <span class="kn">import</span> <span class="n">translate</span>


<span class="c1"># 正向查询（起始年份 &lt; 目标年份）</span>

<span class="n">translate</span><span class="p">(</span><span class="s2">&#34;512323&#34;</span><span class="p">,</span> <span class="mi">1984</span><span class="p">,</span> <span class="mi">2018</span><span class="p">)</span> <span class="c1"># 返回 [&#34;500119&#34;]</span>
    <span class="c1"># 1984年的四川省涪陵地区南川县</span>
    <span class="c1"># 对应于2018年的重庆市南川区</span>

<span class="n">translate</span><span class="p">(</span><span class="s2">&#34;430404&#34;</span><span class="p">,</span> <span class="mi">2000</span><span class="p">,</span> <span class="mi">2018</span><span class="p">)</span> <span class="c1"># 返回 [&#34;430407&#34;, &#34;430408&#34;]</span>
    <span class="c1"># 2000年的湖南省衡阳市城北区</span>
    <span class="c1"># 对应于2018年的湖南省衡阳市石鼓区、蒸湘区</span>

<span class="c1"># 反向查询（起始年份 &gt; 目标年份）</span>

<span class="n">translate</span><span class="p">(</span><span class="s2">&#34;110102&#34;</span><span class="p">,</span> <span class="mi">2010</span><span class="p">,</span> <span class="mi">2000</span><span class="p">)</span> <span class="c1"># 返回 [&#39;110102&#39;, &#39;110104&#39;]</span>
    <span class="c1"># 2010年的北京市西城区</span>
    <span class="c1"># 对应于2000年的北京市西城区、宣武区</span>
</code></pre></div><br>
<br>
<h2 id="三相关内容">三、相关内容</h2>
<ul>
<li>[数据集 | 2024年中国全国5级行政区划（省、市、县、镇、村）]((<a href="https://textdata.cn/blog/2023-12-29-china-area-dataset/">https://textdata.cn/blog/2023-12-29-china-area-dataset/</a>)
<b></li>
</ul>
<b>
]]></content:encoded>
    </item>
    
    <item>
      <title>Polars库 | 最强 Pandas 平替来了</title>
      <link>https://textdata.cn/blog/2023-12-27-polars-tutorial-an-altertaive-of-pandas/</link>
      <pubDate>Wed, 27 Dec 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-12-27-polars-tutorial-an-altertaive-of-pandas/</guid>
      <description>&lt;h2 id=&#34;一介绍&#34;&gt;一、介绍&lt;/h2&gt;
&lt;p&gt;Polars 是一个用于操作结构化数据的高性能 DataFrame 库，由于 Polars 是从0开始用Rust编写，紧密与机器结合。其矢量化和列式处理可在现代处理器上实现缓存一致性算法和高性能。如果您经常使用 pandas，那么用起 Polars 会感觉很轻松，可以说是平替 Pandas 最有潜质的包。&lt;/p&gt;
&lt;p&gt;Polars 在独立的 TPCH 基准测试中与其他几个解决方案进行了基准测试。该基准测试旨在复制实践中使用的数据整理操作。由于其并行执行引擎、高效算法以及 SIMD（单指令、多数据）矢量化的使用，Polars 轻松胜过其他解决方案。&lt;strong&gt;与pandas相比，它可以实现30倍以上的性能提升&lt;/strong&gt;。&lt;/p&gt;
&lt;p&gt;Polars 的目标是提供一个闪电般快速的&lt;code&gt;DataFrame&lt;/code&gt;库：&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;利用机器上所有可用的内核。&lt;/li&gt;
&lt;li&gt;优化查询以减少不必要的工作/内存分配。&lt;/li&gt;
&lt;li&gt;处理比可用 RAM 大得多的数据集。&lt;/li&gt;
&lt;li&gt;拥有一致且可预测的 API。&lt;/li&gt;
&lt;li&gt;具有严格的架构（在运行查询之前应该知道数据类型）。&lt;/li&gt;
&lt;/ul&gt;
&lt;br&gt;
&lt;p&gt;&lt;a href=&#34;https://pola-rs.github.io/polars/user-guide/&#34;&gt;User guide: https://pola-rs.github.io/polars/user-guide/&lt;/a&gt;
&lt;a href=&#34;https://pola-rs.github.io/polars/py-polars/html/reference/io.html&#34;&gt;API reference: https://pola-rs.github.io/polars/py-polars/html/reference/io.html&lt;/a&gt;&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;打开命令行， 执行  polars 安装命令&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;pip3 install &amp;#39;polars[all]&amp;#39;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h2 id=&#34;二数据读写&#34;&gt;二、数据读写&lt;/h2&gt;
&lt;p&gt;Polars 读写数据支持&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;常见的数据文件，如 csv、xlsx、json、parquet ；&lt;/li&gt;
&lt;li&gt;云存储，如 S3、Azure Blob, BigQuery；&lt;/li&gt;
&lt;li&gt;数据库，如postgres、mysql&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;咱们主要分享常见的代码操作&lt;/p&gt;
&lt;h3 id=&#34;21-dataframe&#34;&gt;2.1 DataFrame&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;polars&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pl&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;polars.selectors&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;cs&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;datetime&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;datetime&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pl&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;DataFrame&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;
    &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;
        &lt;span class=&#34;s2&#34;&gt;&amp;#34;idx&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;3&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;4&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt;
        &lt;span class=&#34;s2&#34;&gt;&amp;#34;name&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;张三&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;李四&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;王五&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;赵六&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt;
        &lt;span class=&#34;s2&#34;&gt;&amp;#34;birthday&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;
            &lt;span class=&#34;n&#34;&gt;datetime&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;2009&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;5&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt;
            &lt;span class=&#34;n&#34;&gt;datetime&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;2005&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;10&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;15&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt;
            &lt;span class=&#34;n&#34;&gt;datetime&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;2000&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;12&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;31&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt;
            &lt;span class=&#34;n&#34;&gt;datetime&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;1995&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;6&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;15&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt;
        &lt;span class=&#34;p&#34;&gt;],&lt;/span&gt;
        &lt;span class=&#34;s2&#34;&gt;&amp;#34;gender&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;男&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;男&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;男&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;女&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt;
        &lt;span class=&#34;s2&#34;&gt;&amp;#34;bio&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;好好学习，天天向上&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
                &lt;span class=&#34;s2&#34;&gt;&amp;#34;泰难了&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
                &lt;span class=&#34;s2&#34;&gt;&amp;#34;学习有毛用&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
                &lt;span class=&#34;s2&#34;&gt;&amp;#34;躺平ing&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt;
    &lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#存入csv、excel、json、parquet&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;write_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;data.csv&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;write_excel&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;data.xlsx&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;write_json&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;data.json&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;write_parquet&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;data.parquet&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;


&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;shape: (4, 5)
┌───┬──────┬─────────────────────┬───────┬─────────────────┐
│idx| name ┆    birthday	       | gender┆	    bio        ┆
│---┆ ---  ┆    -------------    ┆  ---  ┆ --------------  │
│i64┆ str  ┆    datetime[μs]     ┆  str  ┆      str        ┆
╞═══╪══════╪═════════════════════╪═══════╡═════════════════╡
│ 1 ┆&amp;#34;张三&amp;#34; ┆ 2009-05-01 00:00:00 ┆ &amp;#34;男&amp;#34;  │&amp;#34;好好学习，天天向上&amp;#34;|
│ 2 ┆&amp;#34;李四&amp;#34; ┆ 2005-10-15 00:00:00 ┆ &amp;#34;男&amp;#34;  │&amp;#34;泰难了&amp;#34;          |
│ 3 ┆&amp;#34;王五&amp;#34; ┆ 2000-12-31 00:00:00 ┆ &amp;#34;男&amp;#34;  │&amp;#34;学习有毛用&amp;#34;       |
│ 4 ┆&amp;#34;赵六&amp;#34; ┆ 1995-06-15 00:00:00 ┆ &amp;#34;女&amp;#34;  │&amp;#34;躺平ing&amp;#34;         |
└──────────┴─────────────────────┴───────┘─────────────────┴
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;22-csvexcel&#34;&gt;2.2 csv、excel&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;df.write_csv 存入csv&lt;/li&gt;
&lt;li&gt;pl.read_csv  读取csv&lt;/li&gt;
&lt;li&gt;df.write_excel 存入xlsx文件&lt;/li&gt;
&lt;li&gt;pl.read_excel   读取xlsx&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;df_csv = pl.read_csv(&amp;#39;data.csv&amp;#39;)
df_xlsx = pl.read_excel(&amp;#39;data.xlsx&amp;#39;)

df_csv
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;shape: (4, 5)
┌───┬──────┬─────────────────────┬───────┬─────────────────┐
│idx| name ┆       birthday	     | gender┆	    bio        ┆
│---┆ ---  ┆    -------------    ┆  ---  ┆ --------------  │
│i64┆ str  ┆        str          ┆  str  ┆      str        ┆
╞═══╪══════╪═════════════════════╪═══════╡═════════════════╡
│ 1 ┆&amp;#34;张三&amp;#34; ┆  &amp;#34;2009-05-01T00:…   ┆ &amp;#34;男&amp;#34;  │&amp;#34;好好学习，天天向上&amp;#34;|
│ 2 ┆&amp;#34;李四&amp;#34; ┆  &amp;#34;2005-10-15T00:…	  ┆ &amp;#34;男&amp;#34;  │&amp;#34;泰难了&amp;#34;          |
│ 3 ┆&amp;#34;王五&amp;#34; ┆  &amp;#34;2000-12-31T00:…   ┆ &amp;#34;男&amp;#34;  │&amp;#34;学习有毛用&amp;#34;       |
│ 4 ┆&amp;#34;赵六&amp;#34; ┆  &amp;#34;1995-06-15T00:…	  ┆ &amp;#34;女&amp;#34;  │&amp;#34;躺平ing&amp;#34;         |
└──────────┴─────────────────────┴───────┘─────────────────┴
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;注意哦， 此时的 date 字段数据类型是 str&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;23-jsonparquet&#34;&gt;2.3 json/parquet&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;df.write_json&lt;/li&gt;
&lt;li&gt;pl.read_json&lt;/li&gt;
&lt;li&gt;df.write_parquet&lt;/li&gt;
&lt;li&gt;pl.read_parquet&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df_json&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pl&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_json&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;data.json&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df_parquet&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pl&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_parquet&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;data.parquet&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df_json&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;shape: (4, 5)
┌───┬──────┬─────────────────────┬───────┬─────────────────┐
│idx| name ┆    birthday	       | gender┆	    bio        ┆
│---┆ ---  ┆    -------------    ┆  ---  ┆ --------------  │
│i64┆ str  ┆    datetime[μs]     ┆  str  ┆      str        ┆
╞═══╪══════╪═════════════════════╪═══════╡═════════════════╡
│ 1 ┆&amp;#34;张三&amp;#34; ┆ 2009-05-01 00:00:00 ┆ &amp;#34;男&amp;#34;  │&amp;#34;好好学习，天天向上&amp;#34;|
│ 2 ┆&amp;#34;李四&amp;#34; ┆ 2005-10-15 00:00:00 ┆ &amp;#34;男&amp;#34;  │&amp;#34;泰难了&amp;#34;          |
│ 3 ┆&amp;#34;王五&amp;#34; ┆ 2000-12-31 00:00:00 ┆ &amp;#34;男&amp;#34;  │&amp;#34;学习有毛用&amp;#34;       |
│ 4 ┆&amp;#34;赵六&amp;#34; ┆ 1995-06-15 00:00:00 ┆ &amp;#34;女&amp;#34;  │&amp;#34;躺平ing&amp;#34;         |
└──────────┴─────────────────────┴───────┘─────────────────┴
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;注意， 使用 df.write_json 或 df.write_parquet 将数据存入 json、parquet， 都可以保留 date 字段的 datetime 类型。而 csv、xlsx 只会将date字段存储为 str 类型。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;三常用表达式&#34;&gt;三、常用表达式&lt;/h2&gt;
&lt;p&gt;&lt;code&gt;Expressions&lt;/code&gt;是Polars的核心功能， &lt;code&gt;expressions&lt;/code&gt; 既可以解决简单的查询，又可以轻松扩展到复杂的查询。下面是 polars 的基本表达式&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;pl.col&lt;/strong&gt; 列选择器&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;df.select&lt;/strong&gt;  结合pl.col， 返回dataframe&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;selector&lt;/strong&gt;  selector选择器&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;df.filter&lt;/strong&gt; 结合pl.col， 返回dataframe&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;df.with_columns&lt;/strong&gt; 结合pl.col， 返回dataframe&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;df.grouby&lt;/strong&gt;  结合pl.col， 返回dataframe&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;31-plcol&#34;&gt;3.1 pl.col&lt;/h3&gt;
&lt;p&gt;选择某一(多)个字段(列)&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;pl&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;col&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;birthday&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;col(&amp;#34;birthday&amp;#34;)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;pl&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;col&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;name&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;birthday&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;col([&amp;#34;name&amp;#34;, &amp;#34;birthday&amp;#34;])
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;32--dfselect&#34;&gt;3.2  df.select&lt;/h3&gt;
&lt;p&gt;选择 &lt;em&gt;&lt;strong&gt;name&lt;/strong&gt;&lt;/em&gt; 和 &lt;em&gt;&lt;strong&gt;birthday&lt;/strong&gt;&lt;/em&gt; 两个字段， 实现该功能有多种写法&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;#df[[&amp;#39;name&amp;#39;, &amp;#39;birthday&amp;#39;]]&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#df.select(&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#    pl.col(&amp;#34;name&amp;#34;), &lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#    pl.col(&amp;#34;birthday&amp;#34;), &lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#)&lt;/span&gt;


&lt;span class=&#34;c1&#34;&gt;#df.select([&amp;#34;name&amp;#34;, &amp;#34;birthday&amp;#34;])&lt;/span&gt;


&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;select&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;pl&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;col&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;name&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;birthday&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;shape: (4, 2)
┌──────┬─────────────────────┬
│ name ┆    birthday	       | 
│------┆ ------------------  ┆ 
│  str ┆    datetime[μs]     ┆ 
╞══════╪═════════════════════╪
│&amp;#34;张三&amp;#34; ┆ 2009-05-01 00:00:00 ┆
│&amp;#34;李四&amp;#34; ┆ 2005-10-15 00:00:00 ┆
│&amp;#34;王五&amp;#34; ┆ 2000-12-31 00:00:00 ┆
│&amp;#34;赵六&amp;#34; ┆ 1995-06-15 00:00:00 ┆
└─────────────────────────────
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;p&gt;polars 即使选择一个字段， 返回的也是dataframe&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;#df[[&amp;#39;name&amp;#39;]]&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#df.select([&amp;#34;name&amp;#34;])&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;select&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;name&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;shape: (4, 1)
┌──────┬
│ 姓名  ┆
│------┆ 
│  str ┆ 
╞══════╪
│&amp;#34;张三&amp;#34; ┆
│&amp;#34;李四&amp;#34; ┆
│&amp;#34;王五&amp;#34; ┆
│&amp;#34;赵六&amp;#34; ┆
└───────
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;33-dfwith_columns&#34;&gt;3.3 df.with_columns&lt;/h3&gt;
&lt;p&gt;与 df.select 功能类似，但是df.with_columns可以在选择字段的同时，保留之前的字段&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;with_columns&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;pl&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;col&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;name&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;shape: (4, 5)
┌───┬──────┬─────────────────────┬───────┬─────────────────┐
│idx| name ┆       birthday	     | gender┆	    bio        ┆
│---┆ ---  ┆    -------------    ┆  ---  ┆ --------------  │
│i64┆ str  ┆        str          ┆  str  ┆      str        ┆
╞═══╪══════╪═════════════════════╪═══════╡═════════════════╡
│ 1 ┆&amp;#34;张三&amp;#34; ┆  &amp;#34;2009-05-01T00:…   ┆ &amp;#34;男&amp;#34;  │&amp;#34;好好学习，天天向上&amp;#34;|
│ 2 ┆&amp;#34;李四&amp;#34; ┆  &amp;#34;2005-10-15T00:…	  ┆ &amp;#34;男&amp;#34;  │&amp;#34;泰难了&amp;#34;          |
│ 3 ┆&amp;#34;王五&amp;#34; ┆  &amp;#34;2000-12-31T00:…   ┆ &amp;#34;男&amp;#34;  │&amp;#34;学习有毛用&amp;#34;       |
│ 4 ┆&amp;#34;赵六&amp;#34; ┆  &amp;#34;1995-06-15T00:…	  ┆ &amp;#34;女&amp;#34;  │&amp;#34;躺平ing&amp;#34;         |
└──────────┴─────────────────────┴───────┘─────────────────┴
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;with_columns&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;pl&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;col&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;name&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;alias&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;姓名&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;shape: (4, 6)
┌───┬──────┬─────────────────────┬───────┬─────────────────┐───────┬
│idx| name ┆    birthday	       | gender┆	    bio        ┆  姓名  ┆ 
│---┆------┆-------------------- ┆  ---  ┆ --------------  │-------┆
│i64┆ str  ┆    datetime[μs]     ┆  str  ┆      str        ┆  str  ┆
╞═══╪══════╪═════════════════════╪═══════╡═════════════════╡═══════╡
│ 1 ┆&amp;#34;张三&amp;#34; ┆ 2009-05-01 00:00:00 ┆ &amp;#34;男&amp;#34;  │&amp;#34;好好学习，天天向上&amp;#34;|&amp;#34;张三&amp;#34;  ┆
│ 2 ┆&amp;#34;李四&amp;#34; ┆ 2005-10-15 00:00:00 ┆ &amp;#34;男&amp;#34;  │&amp;#34;泰难了&amp;#34;          |&amp;#34;李四&amp;#34;  ┆
│ 3 ┆&amp;#34;王五&amp;#34; ┆ 2000-12-31 00:00:00 ┆ &amp;#34;男&amp;#34;  │&amp;#34;学习有毛用&amp;#34;       |&amp;#34;王五&amp;#34;  ┆
│ 4 ┆&amp;#34;赵六&amp;#34; ┆ 1995-06-15 00:00:00 ┆ &amp;#34;女&amp;#34;  │&amp;#34;躺平ing&amp;#34;         |&amp;#34;赵六&amp;#34;  ┆ 
└──────────┴─────────────────────┴───────┘─────────────────┴───────┴
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;34--dffilter&#34;&gt;3.4  df.filter&lt;/h3&gt;
&lt;p&gt;筛选出生日是 00 后的记录&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;filter&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;
  &lt;span class=&#34;n&#34;&gt;pl&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;col&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;birthday&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;datetime&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;2000&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;shape: (4, 5)
┌───┬──────┬─────────────────────┬───────┬─────────────────┐
│idx| name ┆    birthday	       | gender┆	    bio        ┆
│---┆ ---  ┆    -------------    ┆  ---  ┆ --------------  │
│i64┆ str  ┆    datetime[μs]     ┆  str  ┆      str        ┆
╞═══╪══════╪═════════════════════╪═══════╡═════════════════╡
│ 1 ┆&amp;#34;张三&amp;#34; ┆ 2009-05-01 00:00:00 ┆ &amp;#34;男&amp;#34;  │&amp;#34;好好学习，天天向上&amp;#34;|
│ 2 ┆&amp;#34;李四&amp;#34; ┆ 2005-10-15 00:00:00 ┆ &amp;#34;男&amp;#34;  │&amp;#34;泰难了&amp;#34;          |
│ 3 ┆&amp;#34;王五&amp;#34; ┆ 2000-12-31 00:00:00 ┆ &amp;#34;男&amp;#34;  │&amp;#34;学习有毛用&amp;#34;       |
└──────────┴─────────────────────┴───────┘─────────────────┴
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;35-dfgroupby&#34;&gt;3.5 df.groupby&lt;/h3&gt;
&lt;p&gt;按 &lt;em&gt;&lt;strong&gt;性别gender&lt;/strong&gt;&lt;/em&gt; 进行分组功能&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;#for gender, gender_df in df.groupby(&amp;#39;gender&amp;#39;):&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;gender&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;gender_df&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;groupby&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;pl&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;col&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;gender&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)):&lt;/span&gt;
    &lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;gender&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;gender_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;type&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;gender_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;男 3 &amp;lt;class &amp;#39;polars.dataframe.frame.DataFrame&amp;#39;&amp;gt;
女 1 &amp;lt;class &amp;#39;polars.dataframe.frame.DataFrame&amp;#39;&amp;gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;p&gt;分别计算男女学生的bio的文本长度的均值&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;gender&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;gender_df&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;groupby&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;pl&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;col&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;gender&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)):&lt;/span&gt;
    &lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;gender&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;  &lt;span class=&#34;n&#34;&gt;gender_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;bio&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;apply&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;k&#34;&gt;lambda&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;t&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;t&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;mean&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;男 5.666666666666667
女 5.0
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;groupby&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;gender&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;agg&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;pl&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;count&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(),&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;pl&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;col&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;bio&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;len_chars&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;mean&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;alias&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;mean_len&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;shape: (2, 3)
┌──────┬───────┬───────────┬
│gender| count ┆ mean_len	 |
│------┆ ----- ┆-----------┆
│  str ┆  u32	 ┆   f64     ┆
╞══════╪═══════╪═══════════╡
│ &amp;#34;女&amp;#34; ┆  1    ┆    5.0    ┆
│ &amp;#34;男&amp;#34; ┆  3    ┆  5.666667 ┆
└──────┴───────┴───────────┘
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;四选择器&#34;&gt;四、选择器&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;cs.integer、cs.string、cs.numeric 、cs.datetime()、cs.temporal() 按照数据格式筛选字段&lt;/li&gt;
&lt;li&gt;cs.contains 、cs.matches 使用正则表达式筛选字段&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;41-按数据格式筛选&#34;&gt;4.1 按数据格式筛选&lt;/h3&gt;
&lt;p&gt;筛选出字段数据类型为字符和数字的字段，返回dataframe&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;polars.selectors&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;cs&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;select&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;cs&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;integer&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(),&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;cs&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;string&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;shape: (4, 4)
┌───┬──────┬───────┬─────────────────┐
│idx| name ┆ gender┆	    bio        ┆
│---┆ ---  ┆  ---  ┆ --------------  │
│i64┆ str  ┆  str  ┆      str        ┆
╞═══╪══════╪═══════╡═════════════════╡
│ 1 ┆&amp;#34;张三&amp;#34; ┆ &amp;#34;男&amp;#34;  │&amp;#34;好好学习，天天向上&amp;#34;|
│ 2 ┆&amp;#34;李四&amp;#34; ┆ &amp;#34;男&amp;#34;  │&amp;#34;泰难了&amp;#34;          |
│ 3 ┆&amp;#34;王五&amp;#34; ┆ &amp;#34;男&amp;#34;  │&amp;#34;学习有毛用&amp;#34;       |
│ 4 ┆&amp;#34;赵六&amp;#34; ┆ &amp;#34;女&amp;#34;  │&amp;#34;躺平ing&amp;#34;         |
└──────────┴───────┘─────────────────┴
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;p&gt;筛选出 datetime 格式的字段，返回 dataframe&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;#df.select(cs.temporal())&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;select&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;cs&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;datetime&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;shape: (4, 1)
┌───────────────────┬
│ birthday	        |
│-------------------┆
│ datetime[μs]      ┆
╞═══════════════════╪
│2009-05-01 00:00:00┆
│2005-10-15 00:00:00┆
│2000-12-31 00:00:00┆
│1995-06-15 00:00:00┆
└───────────────────┴
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;42-cscontains-csmatches&#34;&gt;4.2 cs.contains/ cs.matches&lt;/h3&gt;
&lt;p&gt;筛选出含 r 字段，返回dataframe&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;#筛选出字段名含 r 的字段&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;select&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;cs&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;contains&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;r&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;shape: (4, 2)
┌───────────────────┬───────┬
│ birthday	        | gender┆
│-------------------┆  ---  ┆
│ datetime[μs]      ┆  str  ┆
╞═══════════════════╪═══════╡
│2009-05-01 00:00:00┆ &amp;#34;男&amp;#34;  │
│2005-10-15 00:00:00┆ &amp;#34;男&amp;#34;  │
│2000-12-31 00:00:00┆ &amp;#34;男&amp;#34;  │
│1995-06-15 00:00:00┆ &amp;#34;女&amp;#34;  │
└───────────────────┴───────┘
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;p&gt;筛选出含 &lt;em&gt;&lt;strong&gt;na&lt;/strong&gt;&lt;/em&gt; 或 &lt;em&gt;&lt;strong&gt;io&lt;/strong&gt;&lt;/em&gt; 的字段，返回dataframe&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;select&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;cs&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;matches&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;na|io&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;shape: (4, 2)
┌─────┬───────────────────┐
│name ┆       bio         ┆
│ --- ┆ ---------------   ┆
│ str ┆  str              ┆
╞═════╪═══════════════════╡
│&amp;#34;张三&amp;#34;┆ &amp;#34;好好学习，天天向上&amp;#34; |
│&amp;#34;李四&amp;#34;┆ &amp;#34;泰难了&amp;#34;           |
│&amp;#34;王五&amp;#34;┆ &amp;#34;学习有毛用&amp;#34;        |
│&amp;#34;赵六&amp;#34;┆ &amp;#34;躺平ing&amp;#34;          |
└─────┴───────────────────┴
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;五逻辑条件&#34;&gt;五、逻辑条件&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;pl.when(condition).then(result1).otherwise(result2)&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;当满足condition时， 值为result1； 反之，则result2&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;df.with_columns(
    pl.when(pl.col(&amp;#39;birthday&amp;#39;)&amp;gt;datetime(2000, 1, 1))
    .then(True)
    .otherwise(False)
    .alias(&amp;#39;00后&amp;#39;)
)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;shape: (4, 5)
┌───┬──────┬─────────────────────┬───────┬─────────────────┐───────┬
│idx| name ┆    birthday	       | gender┆	    bio        ┆ 00后  ┆
│---┆ ---  ┆    -------------    ┆  ---  ┆ --------------  │  ---- ┆
│i64┆ str  ┆    datetime[μs]     ┆  str  ┆      str        ┆  str  ┆
╞═══╪══════╪═════════════════════╪═══════╡═════════════════╡═══════╡
│ 1 ┆&amp;#34;张三&amp;#34; ┆ 2009-05-01 00:00:00 ┆ &amp;#34;男&amp;#34;  │&amp;#34;好好学习，天天向上&amp;#34;| true  |
│ 2 ┆&amp;#34;李四&amp;#34; ┆ 2005-10-15 00:00:00 ┆ &amp;#34;男&amp;#34;  │&amp;#34;泰难了&amp;#34;          | true  |
│ 3 ┆&amp;#34;王五&amp;#34; ┆ 2000-12-31 00:00:00 ┆ &amp;#34;男&amp;#34;  │&amp;#34;学习有毛用&amp;#34;       | true  |
│ 4 ┆&amp;#34;赵六&amp;#34; ┆ 1995-06-15 00:00:00 ┆ &amp;#34;女&amp;#34;  │&amp;#34;躺平ing&amp;#34;         | false |
└──────────┴─────────────────────┴───────┘─────────────────┴───────┴
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;六字符串操作&#34;&gt;六、字符串操作&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;pl.col().str.len_chars()&lt;/strong&gt; 字符长度&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;pl.col().str.contains(pat)&lt;/strong&gt; 是否含某字符(符合pat模式)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;pl.col().str.extract(pat)&lt;/strong&gt; 提取出符合模式的文本&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;pl.col().str.replace(old_pat, new_pat)&lt;/strong&gt;  把old_pat替换为new_pat&lt;/li&gt;
&lt;li&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;61-strlen_chars&#34;&gt;6.1 str.len_chars()&lt;/h3&gt;
&lt;p&gt;计算 bio 的文字长度，计算结果存储到 lenth 字段中&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;select&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;pl&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;col&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;bio&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;pl&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;col&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;bio&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;len_chars&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;alias&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;lenth&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;shape: (4, 2)
┌─────────────────┐───────┬
│	      bio       ┆ lenth ┆
│ --------------  │  ---- ┆
│      str        ┆  u32  ┆
╞═════════════════╡═══════╡
│ &amp;#34;好好学习，天天向上&amp;#34;|   9  |
│ &amp;#34;泰难了&amp;#34;          |   3  |
│ &amp;#34;学习有毛用&amp;#34;       |   5  |
│ &amp;#34;躺平ing&amp;#34;         |   5  |
└──────────────────┴───────┴
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;62-strcontains&#34;&gt;6.2 str.contains()&lt;/h3&gt;
&lt;p&gt;从 bio 中筛选出含 &lt;strong&gt;学习&lt;/strong&gt; 字眼的记录&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;filter&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;
  &lt;span class=&#34;n&#34;&gt;pl&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;col&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;bio&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;contains&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;学习&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;shape: (4, 5)
┌───┬──────┬─────────────────────┬───────┬─────────────────┐
│idx| name ┆    birthday	       | gender┆	    bio        ┆
│---┆ ---  ┆    -------------    ┆  ---  ┆ --------------  │
│i64┆ str  ┆    datetime[μs]     ┆  str  ┆      str        ┆
╞═══╪══════╪═════════════════════╪═══════╡═════════════════╡
│ 1 ┆&amp;#34;张三&amp;#34; ┆ 2009-05-01 00:00:00 ┆ &amp;#34;男&amp;#34;  │&amp;#34;好好学习，天天向上&amp;#34;|
│ 3 ┆&amp;#34;王五&amp;#34; ┆ 2000-12-31 00:00:00 ┆ &amp;#34;男&amp;#34;  │&amp;#34;学习有毛用&amp;#34;       |
└──────────┴─────────────────────┴───────┘─────────────────┴
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;63-strextract&#34;&gt;6.3 str.extract()&lt;/h3&gt;
&lt;p&gt;根据负面词典 &lt;code&gt;&#39;躺平|难|毛&#39;&lt;/code&gt; 选出负面词, 结果存储到字段 neg&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;with_columns&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;pl&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;col&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;bio&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;extract_all&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;躺平|难|毛&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;alias&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;neg&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;shape: (4, 6)
┌───┬──────┬─────────────────────┬───────┬─────────────────┐───────┬
│idx| name ┆    birthday	       | gender┆	    bio        ┆  neg  ┆
│---┆ ---  ┆    -------------    ┆  ---  ┆ --------------  │  ---  ┆
│i64┆ str  ┆    datetime[μs]     ┆  str  ┆      str        ┆  str  ┆
╞═══╪══════╪═════════════════════╪═══════╡═════════════════╡═══════╡
│ 1 ┆&amp;#34;张三&amp;#34; ┆ 2009-05-01 00:00:00 ┆ &amp;#34;男&amp;#34;  │&amp;#34;好好学习，天天向上&amp;#34;|   []  |
│ 2 ┆&amp;#34;李四&amp;#34; ┆ 2005-10-15 00:00:00 ┆ &amp;#34;男&amp;#34;  │&amp;#34;泰难了&amp;#34;          | [&amp;#34;难&amp;#34;]|
│ 3 ┆&amp;#34;王五&amp;#34; ┆ 2000-12-31 00:00:00 ┆ &amp;#34;男&amp;#34;  │&amp;#34;学习有毛用&amp;#34;       | [&amp;#34;毛&amp;#34;]|
│ 4 ┆&amp;#34;赵六&amp;#34; ┆ 1995-06-15 00:00:00 ┆ &amp;#34;女&amp;#34;  │&amp;#34;躺平ing&amp;#34;         |[&amp;#34;躺平&amp;#34;]|
└──────────┴─────────────────────┴───────┘─────────────────┴───────┴
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
</description>
      <content:encoded><![CDATA[<h2 id="一介绍">一、介绍</h2>
<p>Polars 是一个用于操作结构化数据的高性能 DataFrame 库，由于 Polars 是从0开始用Rust编写，紧密与机器结合。其矢量化和列式处理可在现代处理器上实现缓存一致性算法和高性能。如果您经常使用 pandas，那么用起 Polars 会感觉很轻松，可以说是平替 Pandas 最有潜质的包。</p>
<p>Polars 在独立的 TPCH 基准测试中与其他几个解决方案进行了基准测试。该基准测试旨在复制实践中使用的数据整理操作。由于其并行执行引擎、高效算法以及 SIMD（单指令、多数据）矢量化的使用，Polars 轻松胜过其他解决方案。<strong>与pandas相比，它可以实现30倍以上的性能提升</strong>。</p>
<p>Polars 的目标是提供一个闪电般快速的<code>DataFrame</code>库：</p>
<ul>
<li>利用机器上所有可用的内核。</li>
<li>优化查询以减少不必要的工作/内存分配。</li>
<li>处理比可用 RAM 大得多的数据集。</li>
<li>拥有一致且可预测的 API。</li>
<li>具有严格的架构（在运行查询之前应该知道数据类型）。</li>
</ul>
<br>
<p><a href="https://pola-rs.github.io/polars/user-guide/">User guide: https://pola-rs.github.io/polars/user-guide/</a>
<a href="https://pola-rs.github.io/polars/py-polars/html/reference/io.html">API reference: https://pola-rs.github.io/polars/py-polars/html/reference/io.html</a></p>
<br>
<p>打开命令行， 执行  polars 安装命令</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">pip3 install &#39;polars[all]&#39;
</code></pre></div><br>
<h2 id="二数据读写">二、数据读写</h2>
<p>Polars 读写数据支持</p>
<ul>
<li>常见的数据文件，如 csv、xlsx、json、parquet ；</li>
<li>云存储，如 S3、Azure Blob, BigQuery；</li>
<li>数据库，如postgres、mysql</li>
</ul>
<p>咱们主要分享常见的代码操作</p>
<h3 id="21-dataframe">2.1 DataFrame</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">polars</span> <span class="k">as</span> <span class="nn">pl</span>
<span class="kn">import</span> <span class="nn">polars.selectors</span> <span class="k">as</span> <span class="nn">cs</span>
<span class="kn">from</span> <span class="nn">datetime</span> <span class="kn">import</span> <span class="n">datetime</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pl</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span>
    <span class="p">{</span>
        <span class="s2">&#34;idx&#34;</span><span class="p">:</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">],</span>
        <span class="s2">&#34;name&#34;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&#34;张三&#34;</span><span class="p">,</span> <span class="s2">&#34;李四&#34;</span><span class="p">,</span> <span class="s2">&#34;王五&#34;</span><span class="p">,</span> <span class="s2">&#34;赵六&#34;</span><span class="p">],</span>
        <span class="s2">&#34;birthday&#34;</span><span class="p">:</span> <span class="p">[</span>
            <span class="n">datetime</span><span class="p">(</span><span class="mi">2009</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span>
            <span class="n">datetime</span><span class="p">(</span><span class="mi">2005</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">15</span><span class="p">),</span>
            <span class="n">datetime</span><span class="p">(</span><span class="mi">2000</span><span class="p">,</span> <span class="mi">12</span><span class="p">,</span> <span class="mi">31</span><span class="p">),</span>
            <span class="n">datetime</span><span class="p">(</span><span class="mi">1995</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">15</span><span class="p">),</span>
        <span class="p">],</span>
        <span class="s2">&#34;gender&#34;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&#34;男&#34;</span><span class="p">,</span> <span class="s2">&#34;男&#34;</span><span class="p">,</span> <span class="s2">&#34;男&#34;</span><span class="p">,</span> <span class="s2">&#34;女&#34;</span><span class="p">],</span>
        <span class="s2">&#34;bio&#34;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&#34;好好学习，天天向上&#34;</span><span class="p">,</span> 
                <span class="s2">&#34;泰难了&#34;</span><span class="p">,</span> 
                <span class="s2">&#34;学习有毛用&#34;</span><span class="p">,</span> 
                <span class="s2">&#34;躺平ing&#34;</span><span class="p">],</span>
    <span class="p">}</span>
<span class="p">)</span>

<span class="c1">#存入csv、excel、json、parquet</span>
<span class="n">df</span><span class="o">.</span><span class="n">write_csv</span><span class="p">(</span><span class="s2">&#34;data.csv&#34;</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">write_excel</span><span class="p">(</span><span class="s2">&#34;data.xlsx&#34;</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">write_json</span><span class="p">(</span><span class="s2">&#34;data.json&#34;</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">write_parquet</span><span class="p">(</span><span class="s2">&#34;data.parquet&#34;</span><span class="p">)</span>


<span class="n">df</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">shape: (4, 5)
┌───┬──────┬─────────────────────┬───────┬─────────────────┐
│idx| name ┆    birthday	       | gender┆	    bio        ┆
│---┆ ---  ┆    -------------    ┆  ---  ┆ --------------  │
│i64┆ str  ┆    datetime[μs]     ┆  str  ┆      str        ┆
╞═══╪══════╪═════════════════════╪═══════╡═════════════════╡
│ 1 ┆&#34;张三&#34; ┆ 2009-05-01 00:00:00 ┆ &#34;男&#34;  │&#34;好好学习，天天向上&#34;|
│ 2 ┆&#34;李四&#34; ┆ 2005-10-15 00:00:00 ┆ &#34;男&#34;  │&#34;泰难了&#34;          |
│ 3 ┆&#34;王五&#34; ┆ 2000-12-31 00:00:00 ┆ &#34;男&#34;  │&#34;学习有毛用&#34;       |
│ 4 ┆&#34;赵六&#34; ┆ 1995-06-15 00:00:00 ┆ &#34;女&#34;  │&#34;躺平ing&#34;         |
└──────────┴─────────────────────┴───────┘─────────────────┴
</code></pre></div><br>
<h3 id="22-csvexcel">2.2 csv、excel</h3>
<ul>
<li>df.write_csv 存入csv</li>
<li>pl.read_csv  读取csv</li>
<li>df.write_excel 存入xlsx文件</li>
<li>pl.read_excel   读取xlsx</li>
</ul>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">df_csv = pl.read_csv(&#39;data.csv&#39;)
df_xlsx = pl.read_excel(&#39;data.xlsx&#39;)

df_csv
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">shape: (4, 5)
┌───┬──────┬─────────────────────┬───────┬─────────────────┐
│idx| name ┆       birthday	     | gender┆	    bio        ┆
│---┆ ---  ┆    -------------    ┆  ---  ┆ --------------  │
│i64┆ str  ┆        str          ┆  str  ┆      str        ┆
╞═══╪══════╪═════════════════════╪═══════╡═════════════════╡
│ 1 ┆&#34;张三&#34; ┆  &#34;2009-05-01T00:…   ┆ &#34;男&#34;  │&#34;好好学习，天天向上&#34;|
│ 2 ┆&#34;李四&#34; ┆  &#34;2005-10-15T00:…	  ┆ &#34;男&#34;  │&#34;泰难了&#34;          |
│ 3 ┆&#34;王五&#34; ┆  &#34;2000-12-31T00:…   ┆ &#34;男&#34;  │&#34;学习有毛用&#34;       |
│ 4 ┆&#34;赵六&#34; ┆  &#34;1995-06-15T00:…	  ┆ &#34;女&#34;  │&#34;躺平ing&#34;         |
└──────────┴─────────────────────┴───────┘─────────────────┴
</code></pre></div><p>注意哦， 此时的 date 字段数据类型是 str</p>
<br>
<h3 id="23-jsonparquet">2.3 json/parquet</h3>
<ul>
<li>df.write_json</li>
<li>pl.read_json</li>
<li>df.write_parquet</li>
<li>pl.read_parquet</li>
</ul>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df_json</span> <span class="o">=</span> <span class="n">pl</span><span class="o">.</span><span class="n">read_json</span><span class="p">(</span><span class="s2">&#34;data.json&#34;</span><span class="p">)</span>
<span class="n">df_parquet</span> <span class="o">=</span> <span class="n">pl</span><span class="o">.</span><span class="n">read_parquet</span><span class="p">(</span><span class="s2">&#34;data.parquet&#34;</span><span class="p">)</span>

<span class="n">df_json</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">shape: (4, 5)
┌───┬──────┬─────────────────────┬───────┬─────────────────┐
│idx| name ┆    birthday	       | gender┆	    bio        ┆
│---┆ ---  ┆    -------------    ┆  ---  ┆ --------------  │
│i64┆ str  ┆    datetime[μs]     ┆  str  ┆      str        ┆
╞═══╪══════╪═════════════════════╪═══════╡═════════════════╡
│ 1 ┆&#34;张三&#34; ┆ 2009-05-01 00:00:00 ┆ &#34;男&#34;  │&#34;好好学习，天天向上&#34;|
│ 2 ┆&#34;李四&#34; ┆ 2005-10-15 00:00:00 ┆ &#34;男&#34;  │&#34;泰难了&#34;          |
│ 3 ┆&#34;王五&#34; ┆ 2000-12-31 00:00:00 ┆ &#34;男&#34;  │&#34;学习有毛用&#34;       |
│ 4 ┆&#34;赵六&#34; ┆ 1995-06-15 00:00:00 ┆ &#34;女&#34;  │&#34;躺平ing&#34;         |
└──────────┴─────────────────────┴───────┘─────────────────┴
</code></pre></div><p>注意， 使用 df.write_json 或 df.write_parquet 将数据存入 json、parquet， 都可以保留 date 字段的 datetime 类型。而 csv、xlsx 只会将date字段存储为 str 类型。</p>
<p><br><br></p>
<h2 id="三常用表达式">三、常用表达式</h2>
<p><code>Expressions</code>是Polars的核心功能， <code>expressions</code> 既可以解决简单的查询，又可以轻松扩展到复杂的查询。下面是 polars 的基本表达式</p>
<ul>
<li><strong>pl.col</strong> 列选择器</li>
<li><strong>df.select</strong>  结合pl.col， 返回dataframe</li>
<li><strong>selector</strong>  selector选择器</li>
<li><strong>df.filter</strong> 结合pl.col， 返回dataframe</li>
<li><strong>df.with_columns</strong> 结合pl.col， 返回dataframe</li>
<li><strong>df.grouby</strong>  结合pl.col， 返回dataframe</li>
</ul>
<h3 id="31-plcol">3.1 pl.col</h3>
<p>选择某一(多)个字段(列)</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s1">&#39;birthday&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">col(&#34;birthday&#34;)
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s1">&#39;name&#39;</span><span class="p">,</span> <span class="s1">&#39;birthday&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">col([&#34;name&#34;, &#34;birthday&#34;])
</code></pre></div><br>
<h3 id="32--dfselect">3.2  df.select</h3>
<p>选择 <em><strong>name</strong></em> 和 <em><strong>birthday</strong></em> 两个字段， 实现该功能有多种写法</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#df[[&#39;name&#39;, &#39;birthday&#39;]]</span>

<span class="c1">#df.select(</span>
<span class="c1">#    pl.col(&#34;name&#34;), </span>
<span class="c1">#    pl.col(&#34;birthday&#34;), </span>
<span class="c1">#)</span>


<span class="c1">#df.select([&#34;name&#34;, &#34;birthday&#34;])</span>


<span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span>
    <span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">&#34;name&#34;</span><span class="p">,</span> <span class="s2">&#34;birthday&#34;</span><span class="p">)</span>
<span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">shape: (4, 2)
┌──────┬─────────────────────┬
│ name ┆    birthday	       | 
│------┆ ------------------  ┆ 
│  str ┆    datetime[μs]     ┆ 
╞══════╪═════════════════════╪
│&#34;张三&#34; ┆ 2009-05-01 00:00:00 ┆
│&#34;李四&#34; ┆ 2005-10-15 00:00:00 ┆
│&#34;王五&#34; ┆ 2000-12-31 00:00:00 ┆
│&#34;赵六&#34; ┆ 1995-06-15 00:00:00 ┆
└─────────────────────────────
</code></pre></div><br>
<p>polars 即使选择一个字段， 返回的也是dataframe</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#df[[&#39;name&#39;]]</span>

<span class="c1">#df.select([&#34;name&#34;])</span>

<span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s2">&#34;name&#34;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">shape: (4, 1)
┌──────┬
│ 姓名  ┆
│------┆ 
│  str ┆ 
╞══════╪
│&#34;张三&#34; ┆
│&#34;李四&#34; ┆
│&#34;王五&#34; ┆
│&#34;赵六&#34; ┆
└───────
</code></pre></div><br>
<h3 id="33-dfwith_columns">3.3 df.with_columns</h3>
<p>与 df.select 功能类似，但是df.with_columns可以在选择字段的同时，保留之前的字段</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="o">.</span><span class="n">with_columns</span><span class="p">(</span>
    <span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s1">&#39;name&#39;</span><span class="p">)</span>
<span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">shape: (4, 5)
┌───┬──────┬─────────────────────┬───────┬─────────────────┐
│idx| name ┆       birthday	     | gender┆	    bio        ┆
│---┆ ---  ┆    -------------    ┆  ---  ┆ --------------  │
│i64┆ str  ┆        str          ┆  str  ┆      str        ┆
╞═══╪══════╪═════════════════════╪═══════╡═════════════════╡
│ 1 ┆&#34;张三&#34; ┆  &#34;2009-05-01T00:…   ┆ &#34;男&#34;  │&#34;好好学习，天天向上&#34;|
│ 2 ┆&#34;李四&#34; ┆  &#34;2005-10-15T00:…	  ┆ &#34;男&#34;  │&#34;泰难了&#34;          |
│ 3 ┆&#34;王五&#34; ┆  &#34;2000-12-31T00:…   ┆ &#34;男&#34;  │&#34;学习有毛用&#34;       |
│ 4 ┆&#34;赵六&#34; ┆  &#34;1995-06-15T00:…	  ┆ &#34;女&#34;  │&#34;躺平ing&#34;         |
└──────────┴─────────────────────┴───────┘─────────────────┴
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="o">.</span><span class="n">with_columns</span><span class="p">(</span>
    <span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s1">&#39;name&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;姓名&#39;</span><span class="p">)</span>
<span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">shape: (4, 6)
┌───┬──────┬─────────────────────┬───────┬─────────────────┐───────┬
│idx| name ┆    birthday	       | gender┆	    bio        ┆  姓名  ┆ 
│---┆------┆-------------------- ┆  ---  ┆ --------------  │-------┆
│i64┆ str  ┆    datetime[μs]     ┆  str  ┆      str        ┆  str  ┆
╞═══╪══════╪═════════════════════╪═══════╡═════════════════╡═══════╡
│ 1 ┆&#34;张三&#34; ┆ 2009-05-01 00:00:00 ┆ &#34;男&#34;  │&#34;好好学习，天天向上&#34;|&#34;张三&#34;  ┆
│ 2 ┆&#34;李四&#34; ┆ 2005-10-15 00:00:00 ┆ &#34;男&#34;  │&#34;泰难了&#34;          |&#34;李四&#34;  ┆
│ 3 ┆&#34;王五&#34; ┆ 2000-12-31 00:00:00 ┆ &#34;男&#34;  │&#34;学习有毛用&#34;       |&#34;王五&#34;  ┆
│ 4 ┆&#34;赵六&#34; ┆ 1995-06-15 00:00:00 ┆ &#34;女&#34;  │&#34;躺平ing&#34;         |&#34;赵六&#34;  ┆ 
└──────────┴─────────────────────┴───────┘─────────────────┴───────┴
</code></pre></div><br>
<h3 id="34--dffilter">3.4  df.filter</h3>
<p>筛选出生日是 00 后的记录</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span>
  <span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s1">&#39;birthday&#39;</span><span class="p">)</span> <span class="o">&gt;</span> <span class="n">datetime</span><span class="p">(</span><span class="mi">2000</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
<span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">shape: (4, 5)
┌───┬──────┬─────────────────────┬───────┬─────────────────┐
│idx| name ┆    birthday	       | gender┆	    bio        ┆
│---┆ ---  ┆    -------------    ┆  ---  ┆ --------------  │
│i64┆ str  ┆    datetime[μs]     ┆  str  ┆      str        ┆
╞═══╪══════╪═════════════════════╪═══════╡═════════════════╡
│ 1 ┆&#34;张三&#34; ┆ 2009-05-01 00:00:00 ┆ &#34;男&#34;  │&#34;好好学习，天天向上&#34;|
│ 2 ┆&#34;李四&#34; ┆ 2005-10-15 00:00:00 ┆ &#34;男&#34;  │&#34;泰难了&#34;          |
│ 3 ┆&#34;王五&#34; ┆ 2000-12-31 00:00:00 ┆ &#34;男&#34;  │&#34;学习有毛用&#34;       |
└──────────┴─────────────────────┴───────┘─────────────────┴
</code></pre></div><br>
<h3 id="35-dfgroupby">3.5 df.groupby</h3>
<p>按 <em><strong>性别gender</strong></em> 进行分组功能</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#for gender, gender_df in df.groupby(&#39;gender&#39;):</span>
<span class="k">for</span> <span class="n">gender</span><span class="p">,</span> <span class="n">gender_df</span> <span class="ow">in</span> <span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s1">&#39;gender&#39;</span><span class="p">)):</span>
    <span class="nb">print</span><span class="p">(</span><span class="n">gender</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">gender_df</span><span class="p">),</span> <span class="nb">type</span><span class="p">(</span><span class="n">gender_df</span><span class="p">))</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">男 3 &lt;class &#39;polars.dataframe.frame.DataFrame&#39;&gt;
女 1 &lt;class &#39;polars.dataframe.frame.DataFrame&#39;&gt;
</code></pre></div><br>
<p>分别计算男女学生的bio的文本长度的均值</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="k">for</span> <span class="n">gender</span><span class="p">,</span> <span class="n">gender_df</span> <span class="ow">in</span> <span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s1">&#39;gender&#39;</span><span class="p">)):</span>
    <span class="nb">print</span><span class="p">(</span><span class="n">gender</span><span class="p">,</span>  <span class="n">gender_df</span><span class="p">[</span><span class="s1">&#39;bio&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">t</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="n">t</span><span class="p">))</span><span class="o">.</span><span class="n">mean</span><span class="p">())</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">男 5.666666666666667
女 5.0
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">&#39;gender&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">agg</span><span class="p">(</span>
    <span class="n">pl</span><span class="o">.</span><span class="n">count</span><span class="p">(),</span>
    <span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s1">&#39;bio&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">len_chars</span><span class="p">()</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;mean_len&#39;</span><span class="p">)</span>
<span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">shape: (2, 3)
┌──────┬───────┬───────────┬
│gender| count ┆ mean_len	 |
│------┆ ----- ┆-----------┆
│  str ┆  u32	 ┆   f64     ┆
╞══════╪═══════╪═══════════╡
│ &#34;女&#34; ┆  1    ┆    5.0    ┆
│ &#34;男&#34; ┆  3    ┆  5.666667 ┆
└──────┴───────┴───────────┘
</code></pre></div><p><br><br></p>
<h2 id="四选择器">四、选择器</h2>
<ul>
<li>cs.integer、cs.string、cs.numeric 、cs.datetime()、cs.temporal() 按照数据格式筛选字段</li>
<li>cs.contains 、cs.matches 使用正则表达式筛选字段</li>
</ul>
<h3 id="41-按数据格式筛选">4.1 按数据格式筛选</h3>
<p>筛选出字段数据类型为字符和数字的字段，返回dataframe</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">polars.selectors</span> <span class="k">as</span> <span class="nn">cs</span>

<span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span>
    <span class="n">cs</span><span class="o">.</span><span class="n">integer</span><span class="p">(),</span> <span class="n">cs</span><span class="o">.</span><span class="n">string</span><span class="p">()</span>
<span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">shape: (4, 4)
┌───┬──────┬───────┬─────────────────┐
│idx| name ┆ gender┆	    bio        ┆
│---┆ ---  ┆  ---  ┆ --------------  │
│i64┆ str  ┆  str  ┆      str        ┆
╞═══╪══════╪═══════╡═════════════════╡
│ 1 ┆&#34;张三&#34; ┆ &#34;男&#34;  │&#34;好好学习，天天向上&#34;|
│ 2 ┆&#34;李四&#34; ┆ &#34;男&#34;  │&#34;泰难了&#34;          |
│ 3 ┆&#34;王五&#34; ┆ &#34;男&#34;  │&#34;学习有毛用&#34;       |
│ 4 ┆&#34;赵六&#34; ┆ &#34;女&#34;  │&#34;躺平ing&#34;         |
└──────────┴───────┘─────────────────┴
</code></pre></div><br>
<p>筛选出 datetime 格式的字段，返回 dataframe</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#df.select(cs.temporal())</span>

<span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span>
    <span class="n">cs</span><span class="o">.</span><span class="n">datetime</span><span class="p">()</span>
<span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">shape: (4, 1)
┌───────────────────┬
│ birthday	        |
│-------------------┆
│ datetime[μs]      ┆
╞═══════════════════╪
│2009-05-01 00:00:00┆
│2005-10-15 00:00:00┆
│2000-12-31 00:00:00┆
│1995-06-15 00:00:00┆
└───────────────────┴
</code></pre></div><br>
<h3 id="42-cscontains-csmatches">4.2 cs.contains/ cs.matches</h3>
<p>筛选出含 r 字段，返回dataframe</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#筛选出字段名含 r 的字段</span>
<span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span>
    <span class="n">cs</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">&#39;r&#39;</span><span class="p">)</span>
<span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">shape: (4, 2)
┌───────────────────┬───────┬
│ birthday	        | gender┆
│-------------------┆  ---  ┆
│ datetime[μs]      ┆  str  ┆
╞═══════════════════╪═══════╡
│2009-05-01 00:00:00┆ &#34;男&#34;  │
│2005-10-15 00:00:00┆ &#34;男&#34;  │
│2000-12-31 00:00:00┆ &#34;男&#34;  │
│1995-06-15 00:00:00┆ &#34;女&#34;  │
└───────────────────┴───────┘
</code></pre></div><br>
<p>筛选出含 <em><strong>na</strong></em> 或 <em><strong>io</strong></em> 的字段，返回dataframe</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span>
    <span class="n">cs</span><span class="o">.</span><span class="n">matches</span><span class="p">(</span><span class="s1">&#39;na|io&#39;</span><span class="p">)</span>
<span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">shape: (4, 2)
┌─────┬───────────────────┐
│name ┆       bio         ┆
│ --- ┆ ---------------   ┆
│ str ┆  str              ┆
╞═════╪═══════════════════╡
│&#34;张三&#34;┆ &#34;好好学习，天天向上&#34; |
│&#34;李四&#34;┆ &#34;泰难了&#34;           |
│&#34;王五&#34;┆ &#34;学习有毛用&#34;        |
│&#34;赵六&#34;┆ &#34;躺平ing&#34;          |
└─────┴───────────────────┴
</code></pre></div><p><br><br></p>
<h2 id="五逻辑条件">五、逻辑条件</h2>
<p><strong>pl.when(condition).then(result1).otherwise(result2)</strong></p>
<p>当满足condition时， 值为result1； 反之，则result2</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">df.with_columns(
    pl.when(pl.col(&#39;birthday&#39;)&gt;datetime(2000, 1, 1))
    .then(True)
    .otherwise(False)
    .alias(&#39;00后&#39;)
)
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">shape: (4, 5)
┌───┬──────┬─────────────────────┬───────┬─────────────────┐───────┬
│idx| name ┆    birthday	       | gender┆	    bio        ┆ 00后  ┆
│---┆ ---  ┆    -------------    ┆  ---  ┆ --------------  │  ---- ┆
│i64┆ str  ┆    datetime[μs]     ┆  str  ┆      str        ┆  str  ┆
╞═══╪══════╪═════════════════════╪═══════╡═════════════════╡═══════╡
│ 1 ┆&#34;张三&#34; ┆ 2009-05-01 00:00:00 ┆ &#34;男&#34;  │&#34;好好学习，天天向上&#34;| true  |
│ 2 ┆&#34;李四&#34; ┆ 2005-10-15 00:00:00 ┆ &#34;男&#34;  │&#34;泰难了&#34;          | true  |
│ 3 ┆&#34;王五&#34; ┆ 2000-12-31 00:00:00 ┆ &#34;男&#34;  │&#34;学习有毛用&#34;       | true  |
│ 4 ┆&#34;赵六&#34; ┆ 1995-06-15 00:00:00 ┆ &#34;女&#34;  │&#34;躺平ing&#34;         | false |
└──────────┴─────────────────────┴───────┘─────────────────┴───────┴
</code></pre></div><p><br><br></p>
<h2 id="六字符串操作">六、字符串操作</h2>
<ul>
<li><strong>pl.col().str.len_chars()</strong> 字符长度</li>
<li><strong>pl.col().str.contains(pat)</strong> 是否含某字符(符合pat模式)</li>
<li><strong>pl.col().str.extract(pat)</strong> 提取出符合模式的文本</li>
<li><strong>pl.col().str.replace(old_pat, new_pat)</strong>  把old_pat替换为new_pat</li>
<li></li>
</ul>
<h3 id="61-strlen_chars">6.1 str.len_chars()</h3>
<p>计算 bio 的文字长度，计算结果存储到 lenth 字段中</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span>
    <span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s1">&#39;bio&#39;</span><span class="p">),</span>
    <span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s1">&#39;bio&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">len_chars</span><span class="p">()</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;lenth&#39;</span><span class="p">)</span>
<span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">shape: (4, 2)
┌─────────────────┐───────┬
│	      bio       ┆ lenth ┆
│ --------------  │  ---- ┆
│      str        ┆  u32  ┆
╞═════════════════╡═══════╡
│ &#34;好好学习，天天向上&#34;|   9  |
│ &#34;泰难了&#34;          |   3  |
│ &#34;学习有毛用&#34;       |   5  |
│ &#34;躺平ing&#34;         |   5  |
└──────────────────┴───────┴
</code></pre></div><br>
<h3 id="62-strcontains">6.2 str.contains()</h3>
<p>从 bio 中筛选出含 <strong>学习</strong> 字眼的记录</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span>
  <span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s1">&#39;bio&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s2">&#34;学习&#34;</span><span class="p">)</span>
<span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">shape: (4, 5)
┌───┬──────┬─────────────────────┬───────┬─────────────────┐
│idx| name ┆    birthday	       | gender┆	    bio        ┆
│---┆ ---  ┆    -------------    ┆  ---  ┆ --------------  │
│i64┆ str  ┆    datetime[μs]     ┆  str  ┆      str        ┆
╞═══╪══════╪═════════════════════╪═══════╡═════════════════╡
│ 1 ┆&#34;张三&#34; ┆ 2009-05-01 00:00:00 ┆ &#34;男&#34;  │&#34;好好学习，天天向上&#34;|
│ 3 ┆&#34;王五&#34; ┆ 2000-12-31 00:00:00 ┆ &#34;男&#34;  │&#34;学习有毛用&#34;       |
└──────────┴─────────────────────┴───────┘─────────────────┴
</code></pre></div><br>
<h3 id="63-strextract">6.3 str.extract()</h3>
<p>根据负面词典 <code>'躺平|难|毛'</code> 选出负面词, 结果存储到字段 neg</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="o">.</span><span class="n">with_columns</span><span class="p">(</span>
    <span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s1">&#39;bio&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">extract_all</span><span class="p">(</span><span class="s1">&#39;躺平|难|毛&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;neg&#39;</span><span class="p">)</span>
<span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">shape: (4, 6)
┌───┬──────┬─────────────────────┬───────┬─────────────────┐───────┬
│idx| name ┆    birthday	       | gender┆	    bio        ┆  neg  ┆
│---┆ ---  ┆    -------------    ┆  ---  ┆ --------------  │  ---  ┆
│i64┆ str  ┆    datetime[μs]     ┆  str  ┆      str        ┆  str  ┆
╞═══╪══════╪═════════════════════╪═══════╡═════════════════╡═══════╡
│ 1 ┆&#34;张三&#34; ┆ 2009-05-01 00:00:00 ┆ &#34;男&#34;  │&#34;好好学习，天天向上&#34;|   []  |
│ 2 ┆&#34;李四&#34; ┆ 2005-10-15 00:00:00 ┆ &#34;男&#34;  │&#34;泰难了&#34;          | [&#34;难&#34;]|
│ 3 ┆&#34;王五&#34; ┆ 2000-12-31 00:00:00 ┆ &#34;男&#34;  │&#34;学习有毛用&#34;       | [&#34;毛&#34;]|
│ 4 ┆&#34;赵六&#34; ┆ 1995-06-15 00:00:00 ┆ &#34;女&#34;  │&#34;躺平ing&#34;         |[&#34;躺平&#34;]|
└──────────┴─────────────────────┴───────┘─────────────────┴───────┴
</code></pre></div><p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>代码 | 使用gov工作报告生成数字化词频「面板数据」</title>
      <link>https://textdata.cn/blog/2023-12-27-measure-gov-digitalization/</link>
      <pubDate>Wed, 27 Dec 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-12-27-measure-gov-digitalization/</guid>
      <description>&lt;p&gt;使用 10 个城市的2003-2024年的政府工作报告，绘制出的「&lt;em&gt;&lt;strong&gt;数字化概念&lt;/strong&gt;&lt;/em&gt;」词频的趋势图。 直接上效果效果图&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/plot.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h2 id=&#34;相关代码&#34;&gt;相关代码&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-12-17-how-to-generate-panel-data-from-gov-report-dataset/&#34;&gt;代码 | 使用地方gov工作报告生成某类概念词词频「面板数据」&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-12-17-gov-anual-report-dataset/&#34;&gt;数据集 | 国、省、市三级政府工作报告文本&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;一直接上代码&#34;&gt;一、直接上代码&lt;/h2&gt;
&lt;h3 id=&#34;11-代码文件结构&#34;&gt;1.1 代码文件结构&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;项目文件夹
   |---代码.ipynb
   |---GovReportData        #数据集 | 国、省、市三级政府工作报告文本
           |---city.csv     #市政府工作报告（2002-2024）
           |---province.csv #省政府工作报告（2002-2024）
           |---nation.csv   #国务院政府工作报告（2002-2024）
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;12-读取数据&#34;&gt;1.2 读取数据&lt;/h3&gt;
&lt;p&gt;读取地级市报告数据文件 &lt;strong&gt;city.csv&lt;/strong&gt; ，&lt;a href=&#34;https://textdata.cn/blog/2023-12-17-gov-anual-report-dataset/&#34;&gt;点击链接，获取政府工作报告数据集&lt;/a&gt;&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;GovReportData/city.csv&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;head&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/01-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;13-设计面板数据生成函数&#34;&gt;1.3 设计面板数据生成函数&lt;/h3&gt;
&lt;p&gt;假设你使用的城市政府工作报告数据是大邓提供的，可以直接使用下面封装的函数，快速生成概念词典，指定城市指定年度区间的面板数据。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;k&#34;&gt;def&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;generate_city_panel_data&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;csvf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;concept_words&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;selected_citys&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;None&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;selected_years&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;None&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
    &lt;span class=&#34;s2&#34;&gt;&amp;#34;&amp;#34;&amp;#34;
&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;    csvf: csv的文件路径
&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;    concept_words: 概念词词语列表
&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;    selected_citys: 筛选指定城市的数据进行计算，列表
&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;    selected_years: 筛选指定年度的数据进行计算，列表
&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;    
&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;    结果返回dataframe， 每一行代表一个省，每一列代表一年。
&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;    &amp;#34;&amp;#34;&amp;#34;&lt;/span&gt;
    
    &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;
    &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;jieba&lt;/span&gt;
    
    &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;csvf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;table_df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;pivot_table&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
                       &lt;span class=&#34;n&#34;&gt;columns&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;year&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;  &lt;span class=&#34;c1&#34;&gt;#列-年份&lt;/span&gt;
                       &lt;span class=&#34;n&#34;&gt;index&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;city&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;    &lt;span class=&#34;c1&#34;&gt;#行-城市  &lt;/span&gt;
                       &lt;span class=&#34;n&#34;&gt;values&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;doc&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;   &lt;span class=&#34;c1&#34;&gt;#单元格-文本&lt;/span&gt;
                       &lt;span class=&#34;n&#34;&gt;aggfunc&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;k&#34;&gt;lambda&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;cs&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;join&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;c&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;c&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;cs&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt; &lt;span class=&#34;c1&#34;&gt;#让单元格填充文本&lt;/span&gt;

    &lt;span class=&#34;k&#34;&gt;if&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;selected_citys&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
        &lt;span class=&#34;n&#34;&gt;table_df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;table_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;table_df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;index&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;isin&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;selected_citys&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)]&lt;/span&gt;
    
    &lt;span class=&#34;k&#34;&gt;if&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;selected_years&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
        &lt;span class=&#34;n&#34;&gt;selected_years&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;y&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;y&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;selected_years&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
        &lt;span class=&#34;n&#34;&gt;table_df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;table_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;selected_years&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
        

    &lt;span class=&#34;n&#34;&gt;word_count_df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;table_df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;apply&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;k&#34;&gt;lambda&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;row&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;row&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;apply&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;k&#34;&gt;lambda&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;t&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;jieba&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;lcut&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;t&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))))&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;concept_word_count_df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;table_df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;apply&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;k&#34;&gt;lambda&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;row&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;row&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;count&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;|&amp;#39;&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;join&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;concept_words&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)))&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;concept_word_ratio_df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;concept_word_count_df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;/&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;word_count_df&lt;/span&gt;
    &lt;span class=&#34;k&#34;&gt;return&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;concept_word_ratio_df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;14-生成面板数据&#34;&gt;1.4 生成面板数据&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;o&#34;&gt;%%&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;time&lt;/span&gt;


&lt;span class=&#34;c1&#34;&gt;#数字化关键词仅供参考&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;digitalization_words&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;大数据|移动化|云端化|互联网化|智能化|云化|服务化|数字化|数智化|信息化|信息技术|电子政务|智能化|数字平台|移动应用|app|智慧化|网络化|智慧型||数字平台|信息平台|综合信息平台|管理软件|saas|数据赋能|云端|互联网应用|智慧互联|数据化|上云|互联化|移动办公|数据驱动|可视化|在线化|rfid技术|云架构|协同化|一体化平台|云办公|信息服务平台|综合信息服务|数据服务平台|软件应用|数字化转型|云上|融合媒体|智能管理系统|互联网平台|aiot|ai+|智能物联|宽带|全面云化&amp;#39;&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;digitalization_words&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;digitalization_words&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;split&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;|&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;


&lt;span class=&#34;c1&#34;&gt;#所有城市，所有年度(2003-2024) 数字化&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;panel_data_df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;generate_city_panel_data&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;csvf&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;GovReportData/city.csv&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
                                         &lt;span class=&#34;n&#34;&gt;concept_words&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;digitalization_words&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;panel_data_df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;shape&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#如果需要保存&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;panel_data_df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;to_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;282city-digitalization2003-2024.csv&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#panel_data_df.to_excel(&amp;#39;282city-digitalization2003-2024.csv&amp;#39;)&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;panel_data_df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/02-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;h2 id=&#34;二可视化&#34;&gt;二、可视化&lt;/h2&gt;
&lt;h3 id=&#34;21-plot_line&#34;&gt;2.1 plot_line&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;k&#34;&gt;def&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;plot_line&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;panel_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;title&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
    &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;plt&lt;/span&gt;
    &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;matplotlib&lt;/span&gt;
    &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;scienceplots&lt;/span&gt;
    &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;platform&lt;/span&gt;
    &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;
    &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;matplotlib_inline&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;matplotlib_inline&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;backend_inline&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;set_matplotlib_formats&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;png&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;svg&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;jieba&lt;/span&gt;
    &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;warnings&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;warnings&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;filterwarnings&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;ignore&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

    &lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;style&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;use&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;([&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;science&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;no-latex&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;cjk-sc-font&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;system&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;platform&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;system&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;  &lt;span class=&#34;c1&#34;&gt;# 获取操作系统类型&lt;/span&gt;

    &lt;span class=&#34;k&#34;&gt;if&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;system&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;==&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;Windows&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
        &lt;span class=&#34;n&#34;&gt;font&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;family&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;SimHei&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
    &lt;span class=&#34;k&#34;&gt;elif&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;system&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;==&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;Darwin&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
        &lt;span class=&#34;n&#34;&gt;font&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;family&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;Arial Unicode MS&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
    &lt;span class=&#34;k&#34;&gt;else&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
        &lt;span class=&#34;n&#34;&gt;font&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;family&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;sans-serif&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;matplotlib&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;rc&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;font&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;**&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;font&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;  &lt;span class=&#34;c1&#34;&gt;# 设置全局字体&lt;/span&gt;
    
    
    
    &lt;span class=&#34;n&#34;&gt;panel_df_T&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;panel_df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;T&lt;/span&gt;

    &lt;span class=&#34;n&#34;&gt;ax&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;panel_df_T&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;plot&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;figsize&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;10&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;5&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
    &lt;span class=&#34;c1&#34;&gt;# 添加图例，并指定位置和偏移&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;ax&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;legend&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;loc&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;upper right&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;bbox_to_anchor&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mf&#34;&gt;1.15&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mf&#34;&gt;1.05&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;


    &lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;title&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;title&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;15&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;xticks&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;12&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;xlabel&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;年份&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;13&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ylabel&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;词频&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;13&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

    &lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;show&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;22-十城数字化&#34;&gt;2.2 十城数字化&lt;/h3&gt;
&lt;p&gt;按照我自己对城市的感知， 1-5线城市&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;深圳市&lt;/li&gt;
&lt;li&gt;杭州市 成都市 合肥市&lt;/li&gt;
&lt;li&gt;青岛市 长沙市 西安市&lt;/li&gt;
&lt;li&gt;哈尔滨市 石家庄市&lt;/li&gt;
&lt;li&gt;衡水市&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;咱们看看不同级别城市的数字化词频是否有显著的差异&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;selected_citys&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;深圳市&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
                  &lt;span class=&#34;s1&#34;&gt;&amp;#39;杭州市&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;成都市&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;合肥市&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
                  &lt;span class=&#34;s1&#34;&gt;&amp;#39;青岛市&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;长沙市&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;西安市&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
                  &lt;span class=&#34;s1&#34;&gt;&amp;#39;哈尔滨市&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;石家庄市&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
                  &lt;span class=&#34;s1&#34;&gt;&amp;#39;衡水市&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#数字化关键词仅供参考&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;digitalization_words&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;大数据|移动化|云端化|互联网化|智能化|云化|服务化|数字化|数智化|信息化|信息技术|电子政务|智能化|数字平台|移动应用|app|智慧化|网络化|智慧型||数字平台|信息平台|综合信息平台|管理软件|saas|数据赋能|云端|互联网应用|智慧互联|数据化|上云|互联化|移动办公|数据驱动|可视化|在线化|rfid技术|云架构|协同化|一体化平台|云办公|信息服务平台|综合信息服务|数据服务平台|软件应用|数字化转型|云上|融合媒体|智能管理系统|互联网平台|aiot|ai+|智能物联|宽带|全面云化&amp;#39;&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;digitalization_words&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;digitalization_words&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;split&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;|&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;


&lt;span class=&#34;c1&#34;&gt;#生成面板数据&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;panel_data_df2&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;generate_city_panel_data&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;csvf&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;GovReportData/city.csv&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
                                          &lt;span class=&#34;n&#34;&gt;concept_words&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;digitalization_words&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
                                          &lt;span class=&#34;n&#34;&gt;selected_citys&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;selected_citys&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#绘图&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plot_line&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;panel_df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;panel_data_df2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
          &lt;span class=&#34;n&#34;&gt;title&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;城市数字化词频(程度)折线图(2003-2024)&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/plot.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;从图中可以看到&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;12年之前， 数字化词频变动较大。&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;衡水市&lt;/strong&gt;数字化词频在2004、2007、2010是所有城市中最高的， 但是在这三个时间点之间又是局部最低点。&lt;/li&gt;
&lt;li&gt;12年之后各个城市呈现下降趋势。 可能的原因并不是政府不重视数字化建设， 恰恰是数字化问题得到解决，没那么迫切，也就不太提及。&lt;/li&gt;
&lt;/ol&gt;
&lt;br&gt;
&lt;p&gt;从政务数字化实现程度(从常识出发)， 杭州绝对是no1。  用数字化词频高低体现数字化重视程度， 衡水曾有几个年份是十个城市中的最高点，是最重视数字化的城市。 而杭州的政府工作报告中数字化词频并不比其他地市突出，这令我很失望啊。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;三总结&#34;&gt;三、总结&lt;/h2&gt;
&lt;p&gt;之前看到一篇论文研究人民网留言板问答中的政府回复行为， 控制变量使用的是政府数字化程度。&lt;/p&gt;
&lt;p&gt;论文使用政府工作报告数字化词语提及次数， 用来测量政府的数字化程度。&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;但从今天的实验看，用数字化词频测量政府数字化程度，不怎么准，  要慎重使用&lt;/strong&gt;。&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;h2 id=&#34;四获取资料&#34;&gt;四、获取资料&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-12-17-gov-anual-report-dataset/&#34;&gt;数据集| 国、省、市三级政府工作报告文本&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;点击下载本文计算结果 &lt;a href=&#34;282city-digitalization2003-2024.zip&#34;&gt;&lt;strong&gt;282city-digitalization2003-2024.csv&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;br&gt;
&lt;br&gt;
</description>
      <content:encoded><![CDATA[<p>使用 10 个城市的2003-2024年的政府工作报告，绘制出的「<em><strong>数字化概念</strong></em>」词频的趋势图。 直接上效果效果图</p>
<p><img loading="lazy" src="img/plot.png" alt=""  />
</p>
<br>
<h2 id="相关代码">相关代码</h2>
<ul>
<li><a href="https://textdata.cn/blog/2023-12-17-how-to-generate-panel-data-from-gov-report-dataset/">代码 | 使用地方gov工作报告生成某类概念词词频「面板数据」</a></li>
<li><a href="https://textdata.cn/blog/2023-12-17-gov-anual-report-dataset/">数据集 | 国、省、市三级政府工作报告文本</a></li>
</ul>
<p><br><br></p>
<h2 id="一直接上代码">一、直接上代码</h2>
<h3 id="11-代码文件结构">1.1 代码文件结构</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">项目文件夹
   |---代码.ipynb
   |---GovReportData        #数据集 | 国、省、市三级政府工作报告文本
           |---city.csv     #市政府工作报告（2002-2024）
           |---province.csv #省政府工作报告（2002-2024）
           |---nation.csv   #国务院政府工作报告（2002-2024）
</code></pre></div><br>
<h3 id="12-读取数据">1.2 读取数据</h3>
<p>读取地级市报告数据文件 <strong>city.csv</strong> ，<a href="https://textdata.cn/blog/2023-12-17-gov-anual-report-dataset/">点击链接，获取政府工作报告数据集</a></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;GovReportData/city.csv&#39;</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/01-df.png" alt=""  />
</p>
<br>
<h3 id="13-设计面板数据生成函数">1.3 设计面板数据生成函数</h3>
<p>假设你使用的城市政府工作报告数据是大邓提供的，可以直接使用下面封装的函数，快速生成概念词典，指定城市指定年度区间的面板数据。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">generate_city_panel_data</span><span class="p">(</span><span class="n">csvf</span><span class="p">,</span> <span class="n">concept_words</span><span class="p">,</span> <span class="n">selected_citys</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span> <span class="n">selected_years</span><span class="o">=</span><span class="kc">None</span><span class="p">):</span>
    <span class="s2">&#34;&#34;&#34;
</span><span class="s2">    csvf: csv的文件路径
</span><span class="s2">    concept_words: 概念词词语列表
</span><span class="s2">    selected_citys: 筛选指定城市的数据进行计算，列表
</span><span class="s2">    selected_years: 筛选指定年度的数据进行计算，列表
</span><span class="s2">    
</span><span class="s2">    结果返回dataframe， 每一行代表一个省，每一列代表一年。
</span><span class="s2">    &#34;&#34;&#34;</span>
    
    <span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
    <span class="kn">import</span> <span class="nn">jieba</span>
    
    <span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="n">csvf</span><span class="p">)</span>
    <span class="n">table_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">pivot_table</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> 
                       <span class="n">columns</span><span class="o">=</span><span class="s1">&#39;year&#39;</span><span class="p">,</span>  <span class="c1">#列-年份</span>
                       <span class="n">index</span><span class="o">=</span><span class="s1">&#39;city&#39;</span><span class="p">,</span>    <span class="c1">#行-城市  </span>
                       <span class="n">values</span><span class="o">=</span><span class="s1">&#39;doc&#39;</span><span class="p">,</span>   <span class="c1">#单元格-文本</span>
                       <span class="n">aggfunc</span><span class="o">=</span><span class="k">lambda</span> <span class="n">cs</span><span class="p">:</span> <span class="s1">&#39;&#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="nb">str</span><span class="p">(</span><span class="n">c</span><span class="p">)</span> <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">cs</span><span class="p">))</span> <span class="c1">#让单元格填充文本</span>

    <span class="k">if</span> <span class="n">selected_citys</span><span class="p">:</span>
        <span class="n">table_df</span> <span class="o">=</span> <span class="n">table_df</span><span class="p">[</span><span class="n">table_df</span><span class="o">.</span><span class="n">index</span><span class="o">.</span><span class="n">isin</span><span class="p">(</span><span class="n">selected_citys</span><span class="p">)]</span>
    
    <span class="k">if</span> <span class="n">selected_years</span><span class="p">:</span>
        <span class="n">selected_years</span> <span class="o">=</span> <span class="p">[</span><span class="nb">str</span><span class="p">(</span><span class="n">y</span><span class="p">)</span> <span class="k">for</span> <span class="n">y</span> <span class="ow">in</span> <span class="n">selected_years</span><span class="p">]</span>
        <span class="n">table_df</span> <span class="o">=</span> <span class="n">table_df</span><span class="p">[</span><span class="n">selected_years</span><span class="p">]</span>
        

    <span class="n">word_count_df</span> <span class="o">=</span> <span class="n">table_df</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">row</span><span class="p">:</span> <span class="n">row</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">t</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="n">jieba</span><span class="o">.</span><span class="n">lcut</span><span class="p">(</span><span class="n">t</span><span class="p">))))</span>
    <span class="n">concept_word_count_df</span> <span class="o">=</span> <span class="n">table_df</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">row</span><span class="p">:</span> <span class="n">row</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">count</span><span class="p">(</span><span class="s1">&#39;|&#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">concept_words</span><span class="p">)))</span>
    <span class="n">concept_word_ratio_df</span> <span class="o">=</span> <span class="n">concept_word_count_df</span><span class="o">/</span><span class="n">word_count_df</span>
    <span class="k">return</span> <span class="n">concept_word_ratio_df</span>
</code></pre></div><br>
<h3 id="14-生成面板数据">1.4 生成面板数据</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>


<span class="c1">#数字化关键词仅供参考</span>
<span class="n">digitalization_words</span> <span class="o">=</span> <span class="s1">&#39;大数据|移动化|云端化|互联网化|智能化|云化|服务化|数字化|数智化|信息化|信息技术|电子政务|智能化|数字平台|移动应用|app|智慧化|网络化|智慧型||数字平台|信息平台|综合信息平台|管理软件|saas|数据赋能|云端|互联网应用|智慧互联|数据化|上云|互联化|移动办公|数据驱动|可视化|在线化|rfid技术|云架构|协同化|一体化平台|云办公|信息服务平台|综合信息服务|数据服务平台|软件应用|数字化转型|云上|融合媒体|智能管理系统|互联网平台|aiot|ai+|智能物联|宽带|全面云化&#39;</span>
<span class="n">digitalization_words</span> <span class="o">=</span> <span class="n">digitalization_words</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">&#39;|&#39;</span><span class="p">)</span>


<span class="c1">#所有城市，所有年度(2003-2024) 数字化</span>
<span class="n">panel_data_df</span> <span class="o">=</span> <span class="n">generate_city_panel_data</span><span class="p">(</span><span class="n">csvf</span><span class="o">=</span><span class="s1">&#39;GovReportData/city.csv&#39;</span><span class="p">,</span> 
                                         <span class="n">concept_words</span> <span class="o">=</span> <span class="n">digitalization_words</span><span class="p">)</span>

<span class="nb">print</span><span class="p">(</span><span class="n">panel_data_df</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>

<span class="c1">#如果需要保存</span>
<span class="n">panel_data_df</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s1">&#39;282city-digitalization2003-2024.csv&#39;</span><span class="p">)</span>
<span class="c1">#panel_data_df.to_excel(&#39;282city-digitalization2003-2024.csv&#39;)</span>

<span class="n">panel_data_df</span>
</code></pre></div><p>Run</p>
<p><img loading="lazy" src="img/02-df.png" alt=""  />
</p>
<br>
<br>
<h2 id="二可视化">二、可视化</h2>
<h3 id="21-plot_line">2.1 plot_line</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">plot_line</span><span class="p">(</span><span class="n">panel_df</span><span class="p">,</span> <span class="n">title</span><span class="p">):</span>
    <span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
    <span class="kn">import</span> <span class="nn">matplotlib</span>
    <span class="kn">import</span> <span class="nn">scienceplots</span>
    <span class="kn">import</span> <span class="nn">platform</span>
    <span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
    <span class="kn">import</span> <span class="nn">matplotlib_inline</span>
    <span class="n">matplotlib_inline</span><span class="o">.</span><span class="n">backend_inline</span><span class="o">.</span><span class="n">set_matplotlib_formats</span><span class="p">(</span><span class="s1">&#39;png&#39;</span><span class="p">,</span> <span class="s1">&#39;svg&#39;</span><span class="p">)</span>
    <span class="kn">import</span> <span class="nn">jieba</span>
    <span class="kn">import</span> <span class="nn">warnings</span>
    <span class="n">warnings</span><span class="o">.</span><span class="n">filterwarnings</span><span class="p">(</span><span class="s1">&#39;ignore&#39;</span><span class="p">)</span>

    <span class="n">plt</span><span class="o">.</span><span class="n">style</span><span class="o">.</span><span class="n">use</span><span class="p">([</span><span class="s1">&#39;science&#39;</span><span class="p">,</span> <span class="s1">&#39;no-latex&#39;</span><span class="p">,</span> <span class="s1">&#39;cjk-sc-font&#39;</span><span class="p">])</span>
    <span class="n">system</span> <span class="o">=</span> <span class="n">platform</span><span class="o">.</span><span class="n">system</span><span class="p">()</span>  <span class="c1"># 获取操作系统类型</span>

    <span class="k">if</span> <span class="n">system</span> <span class="o">==</span> <span class="s1">&#39;Windows&#39;</span><span class="p">:</span>
        <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;SimHei&#39;</span><span class="p">}</span>
    <span class="k">elif</span> <span class="n">system</span> <span class="o">==</span> <span class="s1">&#39;Darwin&#39;</span><span class="p">:</span>
        <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;Arial Unicode MS&#39;</span><span class="p">}</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;sans-serif&#39;</span><span class="p">}</span>
    <span class="n">matplotlib</span><span class="o">.</span><span class="n">rc</span><span class="p">(</span><span class="s1">&#39;font&#39;</span><span class="p">,</span> <span class="o">**</span><span class="n">font</span><span class="p">)</span>  <span class="c1"># 设置全局字体</span>
    
    
    
    <span class="n">panel_df_T</span> <span class="o">=</span> <span class="n">panel_df</span><span class="o">.</span><span class="n">T</span>

    <span class="n">ax</span> <span class="o">=</span> <span class="n">panel_df_T</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span>
    <span class="c1"># 添加图例，并指定位置和偏移</span>
    <span class="n">ax</span><span class="o">.</span><span class="n">legend</span><span class="p">(</span><span class="n">loc</span><span class="o">=</span><span class="s1">&#39;upper right&#39;</span><span class="p">,</span> <span class="n">bbox_to_anchor</span><span class="o">=</span><span class="p">(</span><span class="mf">1.15</span><span class="p">,</span> <span class="mf">1.05</span><span class="p">))</span>


    <span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="n">title</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">15</span><span class="p">)</span>
    <span class="n">plt</span><span class="o">.</span><span class="n">xticks</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="mi">12</span><span class="p">)</span>
    <span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s1">&#39;年份&#39;</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">13</span><span class="p">)</span>
    <span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s1">&#39;词频&#39;</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">13</span><span class="p">)</span>

    <span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div><br>
<h3 id="22-十城数字化">2.2 十城数字化</h3>
<p>按照我自己对城市的感知， 1-5线城市</p>
<ol>
<li>深圳市</li>
<li>杭州市 成都市 合肥市</li>
<li>青岛市 长沙市 西安市</li>
<li>哈尔滨市 石家庄市</li>
<li>衡水市</li>
</ol>
<p>咱们看看不同级别城市的数字化词频是否有显著的差异</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">selected_citys</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;深圳市&#39;</span><span class="p">,</span>
                  <span class="s1">&#39;杭州市&#39;</span><span class="p">,</span> <span class="s1">&#39;成都市&#39;</span><span class="p">,</span> <span class="s1">&#39;合肥市&#39;</span><span class="p">,</span> 
                  <span class="s1">&#39;青岛市&#39;</span><span class="p">,</span> <span class="s1">&#39;长沙市&#39;</span><span class="p">,</span> <span class="s1">&#39;西安市&#39;</span><span class="p">,</span> 
                  <span class="s1">&#39;哈尔滨市&#39;</span><span class="p">,</span> <span class="s1">&#39;石家庄市&#39;</span><span class="p">,</span> 
                  <span class="s1">&#39;衡水市&#39;</span><span class="p">]</span>

<span class="c1">#数字化关键词仅供参考</span>
<span class="n">digitalization_words</span> <span class="o">=</span> <span class="s1">&#39;大数据|移动化|云端化|互联网化|智能化|云化|服务化|数字化|数智化|信息化|信息技术|电子政务|智能化|数字平台|移动应用|app|智慧化|网络化|智慧型||数字平台|信息平台|综合信息平台|管理软件|saas|数据赋能|云端|互联网应用|智慧互联|数据化|上云|互联化|移动办公|数据驱动|可视化|在线化|rfid技术|云架构|协同化|一体化平台|云办公|信息服务平台|综合信息服务|数据服务平台|软件应用|数字化转型|云上|融合媒体|智能管理系统|互联网平台|aiot|ai+|智能物联|宽带|全面云化&#39;</span>
<span class="n">digitalization_words</span> <span class="o">=</span> <span class="n">digitalization_words</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">&#39;|&#39;</span><span class="p">)</span>


<span class="c1">#生成面板数据</span>
<span class="n">panel_data_df2</span> <span class="o">=</span> <span class="n">generate_city_panel_data</span><span class="p">(</span><span class="n">csvf</span><span class="o">=</span><span class="s1">&#39;GovReportData/city.csv&#39;</span><span class="p">,</span> 
                                          <span class="n">concept_words</span> <span class="o">=</span> <span class="n">digitalization_words</span><span class="p">,</span> 
                                          <span class="n">selected_citys</span> <span class="o">=</span> <span class="n">selected_citys</span><span class="p">)</span>

<span class="c1">#绘图</span>
<span class="n">plot_line</span><span class="p">(</span><span class="n">panel_df</span><span class="o">=</span><span class="n">panel_data_df2</span><span class="p">,</span> 
          <span class="n">title</span><span class="o">=</span><span class="s1">&#39;城市数字化词频(程度)折线图(2003-2024)&#39;</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/plot.png" alt=""  />
</p>
<br>
<p>从图中可以看到</p>
<ol>
<li>12年之前， 数字化词频变动较大。</li>
<li><strong>衡水市</strong>数字化词频在2004、2007、2010是所有城市中最高的， 但是在这三个时间点之间又是局部最低点。</li>
<li>12年之后各个城市呈现下降趋势。 可能的原因并不是政府不重视数字化建设， 恰恰是数字化问题得到解决，没那么迫切，也就不太提及。</li>
</ol>
<br>
<p>从政务数字化实现程度(从常识出发)， 杭州绝对是no1。  用数字化词频高低体现数字化重视程度， 衡水曾有几个年份是十个城市中的最高点，是最重视数字化的城市。 而杭州的政府工作报告中数字化词频并不比其他地市突出，这令我很失望啊。</p>
<p><br><br></p>
<h2 id="三总结">三、总结</h2>
<p>之前看到一篇论文研究人民网留言板问答中的政府回复行为， 控制变量使用的是政府数字化程度。</p>
<p>论文使用政府工作报告数字化词语提及次数， 用来测量政府的数字化程度。</p>
<p><strong>但从今天的实验看，用数字化词频测量政府数字化程度，不怎么准，  要慎重使用</strong>。</p>
<br>
<br>
<h2 id="四获取资料">四、获取资料</h2>
<ul>
<li>
<p><a href="https://textdata.cn/blog/2023-12-17-gov-anual-report-dataset/">数据集| 国、省、市三级政府工作报告文本</a></p>
</li>
<li>
<p>点击下载本文计算结果 <a href="282city-digitalization2003-2024.zip"><strong>282city-digitalization2003-2024.csv</strong></a></p>
</li>
</ul>
<br>
<br>
]]></content:encoded>
    </item>
    
    <item>
      <title>数据集 | 人民网地方领导留言板原始文本(2011-2023.12)</title>
      <link>https://textdata.cn/blog/2023-12-22-renmin-gov-leader-comment-board/</link>
      <pubDate>Fri, 22 Dec 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-12-22-renmin-gov-leader-comment-board/</guid>
      <description>&lt;img src=&#34;img/04-dataset.png&#34; style=&#34;zoom:80%;&#34; /&gt;
&lt;br&gt;
&lt;h2 id=&#34;一数据集&#34;&gt;一、数据集&lt;/h2&gt;
&lt;h3 id=&#34;11-概况&#34;&gt;1.1 概况&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;数据来源: 人民网地方领导留言板

覆盖时间: 2011-01-01 ~ 2023.12.06

记录条数: 3914385

文件格式: xlsx、csv
    
所含字段:
 -  留言领导
 -  留言标题
 -  省份
 -  市
 -  状态
 -  主题类别
 -  投诉种类
 -  留言人
 -  留言时间
 -  留言内容
 -  回复内容
 -  回复时间
 -  回复机构
 -  办理速度评分(该字段出现在2019之后)
 -  办理态度评分(该字段出现在2019之后)
 -  解决程度评分(该字段出现在2019之后)
 -  用户评价(该字段出现在2019之后)
 -  评价标签(该字段出现在2019之后)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;12-说明&#34;&gt;1.2 说明&lt;/h3&gt;
&lt;p&gt;科研用途展示； 如有问题， 加微信 372335839， 备注「姓名-学校-专业-留言板」。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;13-相关研究&#34;&gt;1.3 相关研究&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;
[1]郑石明, 兰雨潇, 黎枫. 网络公共舆论与政府回应的互动逻辑——基于新冠肺炎疫情期间“领导留言板”的数据分析[J]. 公共管理学报, 2021, 18 (03): 24-37+169.
王磊,易扬.公共卫生危机中的数字政府回应如何纾解网络负面舆情——基于人民网“领导留言板”回复情况的调查[J].公共管理学报,2022,19(04):65-78+169.

[2]Lu, Liangdong, Jia Xu, and Jiuchang Wei. &amp;#34;Understanding the effects of the textual complexity on government communication: Insights from China’s online public service platform.&amp;#34; Telematics and Informatics 83 (2023): 102028.
...
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;img src=&#34;img/2023a.png&#34; style=&#34;zoom:80%;&#34; /&gt;&lt;br&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;img/2023b.png&#34; style=&#34;zoom:80%;&#34; /&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二查看数据&#34;&gt;二、查看数据&lt;/h2&gt;
&lt;h3 id=&#34;21-读取数据&#34;&gt;2.1 读取数据&lt;/h3&gt;
&lt;p&gt;依次读取&lt;em&gt;&lt;strong&gt;2011-2019.csv.gz&lt;/strong&gt;&lt;/em&gt; 和  &lt;em&gt;&lt;strong&gt;2020-2023.csv.gz&lt;/strong&gt;&lt;/em&gt;  两个csv文件，    &lt;em&gt;&lt;strong&gt;.csv.gz&lt;/strong&gt;&lt;/em&gt; 解压得到  &lt;em&gt;&lt;strong&gt;.csv&lt;/strong&gt;&lt;/em&gt; 后再读取。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;


&lt;span class=&#34;n&#34;&gt;df11_19&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;2011-2019.csv&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#df11_19 = pd.read_csv(&amp;#39;2011-2019.csv.gz&amp;#39;, compression=&amp;#39;gzip&amp;#39;)&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df11_19&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;head&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/02-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df20_23&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;2020-2023.csv&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#df20_23 = pd.read_csv(&amp;#39;2020-2023.csv.gz&amp;#39;, compression=&amp;#39;gzip&amp;#39;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df20_23&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;head&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/03-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;22-字段&#34;&gt;2.2 字段&lt;/h3&gt;
&lt;p&gt;10多年的时间，网站会变动，写爬虫运行爬虫的人也会变动。为了让大家更丝滑的使用数据，大邓对所有的年份进行了字段矫正和统一， 最后字段只有两大类，&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;2011-2019&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df11_19&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;columns&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;se&#34;&gt;\n&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;2020-2023&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df20_23&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;columns&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;2011-2019
Index([&amp;#39;留言领导&amp;#39;, &amp;#39;留言标题&amp;#39;, &amp;#39;省份&amp;#39;, &amp;#39;市&amp;#39;, &amp;#39;状态&amp;#39;, &amp;#39;主题类别&amp;#39;, &amp;#39;投诉种类&amp;#39;, &amp;#39;留言人&amp;#39;, &amp;#39;留言时间&amp;#39;, &amp;#39;留言内容&amp;#39;, &amp;#39;回复机构&amp;#39;, 
       &amp;#39;回复内容&amp;#39;, &amp;#39;回复时间&amp;#39;, &amp;#39;留言评价&amp;#39;, &amp;#39;评价时间&amp;#39;],
      dtype=&amp;#39;object&amp;#39;)


2020-2023
Index([&amp;#39;留言领导&amp;#39;, &amp;#39;留言标题&amp;#39;, &amp;#39;省份&amp;#39;, &amp;#39;市&amp;#39;, &amp;#39;状态&amp;#39;, &amp;#39;主题类别&amp;#39;, &amp;#39;投诉种类&amp;#39;, &amp;#39;留言人&amp;#39;, &amp;#39;留言时间&amp;#39;, &amp;#39;留言内容&amp;#39;,
       &amp;#39;回复内容&amp;#39;, &amp;#39;回复时间&amp;#39;, &amp;#39;回复机构&amp;#39;, &amp;#39;办理速度评分&amp;#39;, &amp;#39;办理态度评分&amp;#39;, &amp;#39;解决程度评分&amp;#39;, &amp;#39;用户评价&amp;#39;, &amp;#39;评价标签&amp;#39;],
      dtype=&amp;#39;object&amp;#39;)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;23-记录数&#34;&gt;2.3 记录数&lt;/h3&gt;
&lt;p&gt;数据集总记录数&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;总记录数: &amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df11_19&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df20_23&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;总记录数: 3914385
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;24-每年是否包含年末数据&#34;&gt;2.4 每年是否包含年末数据&lt;/h3&gt;
&lt;p&gt;由于人民网只 “&lt;strong&gt;可查询留言为上一年1月1日至今的所有留言&lt;/strong&gt;”, 有同学没看懂这句话含义，担心每年12月月末或1月月初是否会缺失数据。这里我们检查下数据集每年的年初是否为1.1， 年底是否为12.31&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;year&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;year_df&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df11_19&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;groupby&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df11_19&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;留言时间&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;dt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;year&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
    &lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;year&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;year_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;留言时间&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;min&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;date&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(),&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;year_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;留言时间&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;max&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;date&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;())&lt;/span&gt;
    
    
&lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;year&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;year_df&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df20_23&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;groupby&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df20_23&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;留言时间&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;dt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;year&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
    &lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;year&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;year_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;留言时间&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;min&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;date&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(),&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;year_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;留言时间&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;max&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;date&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;mi&#34;&gt;2011&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2011&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;01&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;01&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2011&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;12&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;31&lt;/span&gt;
&lt;span class=&#34;mi&#34;&gt;2012&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2012&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;01&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;01&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2012&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;12&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;31&lt;/span&gt;
&lt;span class=&#34;mi&#34;&gt;2013&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2013&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;01&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;01&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2013&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;12&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;31&lt;/span&gt;
&lt;span class=&#34;mi&#34;&gt;2014&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2014&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;01&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;01&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2014&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;12&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;31&lt;/span&gt;
&lt;span class=&#34;mi&#34;&gt;2015&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2015&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;01&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;01&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2015&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;12&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;31&lt;/span&gt;
&lt;span class=&#34;mi&#34;&gt;2016&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2016&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;01&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;01&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2016&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;12&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;31&lt;/span&gt;
&lt;span class=&#34;mi&#34;&gt;2017&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2017&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;01&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;01&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2017&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;12&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;31&lt;/span&gt;
&lt;span class=&#34;mi&#34;&gt;2018&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2018&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;01&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;01&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2018&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;12&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;31&lt;/span&gt;
&lt;span class=&#34;mi&#34;&gt;2019&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2019&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;01&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;01&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2019&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;12&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;31&lt;/span&gt;
&lt;span class=&#34;mi&#34;&gt;2020&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2020&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;01&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;01&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2020&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;12&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;31&lt;/span&gt;
&lt;span class=&#34;mi&#34;&gt;2021&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2021&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;01&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;01&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2021&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;12&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;31&lt;/span&gt;
&lt;span class=&#34;mi&#34;&gt;2022&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2022&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;01&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;01&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2022&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;12&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;31&lt;/span&gt;
&lt;span class=&#34;mi&#34;&gt;2023&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2023&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;01&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;01&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2023&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;12&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;06&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;因为数据集是 2023.12.6 运行的， 日期截止到 2023.12.6 。不过不用担心， 下次更新数据时候会覆盖到  2023.12.31 。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;25-年度记录数&#34;&gt;2.5 年度记录数&lt;/h3&gt;
&lt;p&gt;两个 dataframe 中都有 &lt;em&gt;&lt;strong&gt;留言日期&lt;/strong&gt;&lt;/em&gt; ， 我们根据该字段查看每个年份的记录数。首先，要先将该字段转化为 datetime 日期类型。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;data&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[]&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df11_19&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;留言时间&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;to_datetime&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df11_19&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;留言时间&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df20_23&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;留言时间&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;to_datetime&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df20_23&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;留言时间&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;

&lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;year&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;year_df&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df11_19&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;groupby&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df11_19&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;留言时间&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;dt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;year&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;data&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;append&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;({&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;year&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;year&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;volume&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;year_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)})&lt;/span&gt;
    &lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;year&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39; &amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;year_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;

&lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;year&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;year_df&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df20_23&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;groupby&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df20_23&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;留言时间&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;dt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;year&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;data&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;append&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;({&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;year&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;year&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;volume&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;year_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)})&lt;/span&gt;
    &lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;year&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39; &amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;year_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
    

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;2011 23307
2012 20178
2013 42950
2014 97640
2015 131930
2016 201525
2017 202793
2018 243648
2019 464622
2020 517167
2021 783139
2022 648055
2023 537422
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;plt&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;matplotlib&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;scienceplots&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;platform&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;matplotlib_inline&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;matplotlib_inline&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;backend_inline&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;set_matplotlib_formats&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;png&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;svg&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;jieba&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;warnings&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;warnings&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;filterwarnings&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;ignore&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;style&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;use&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;([&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;science&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;no-latex&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;cjk-sc-font&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;system&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;platform&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;system&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;  &lt;span class=&#34;c1&#34;&gt;# 获取操作系统类型&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;if&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;system&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;==&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;Windows&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;font&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;family&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;SimHei&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;elif&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;system&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;==&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;Darwin&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;font&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;family&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;Arial Unicode MS&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;else&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;font&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;family&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;sans-serif&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;matplotlib&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;rc&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;font&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;**&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;font&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;  &lt;span class=&#34;c1&#34;&gt;# 设置全局字体&lt;/span&gt;
    
&lt;span class=&#34;n&#34;&gt;year_volume_df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;DataFrame&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;data&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#year_volume_df[&amp;#39;year&amp;#39;] = pd.to_datetime(year_volume_df[&amp;#39;year&amp;#39;])&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;year_volume_df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;set_index&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;year&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;inplace&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;year_volume_df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;plot&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;kind&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;bar&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;figsize&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;7&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;4&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;title&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;人民网留言板留言数量(2011 ~ 2023)&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;15&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;xticks&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;10&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;rotation&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;xlabel&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;年份&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;13&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ylabel&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;留言数量&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;size&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;13&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/plot.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;需要声明， 采集的数据量与真实数据量肯定会有出入的， 例如爬虫运行的时间点、IP被封、请求失败、文件编码(格式)问题等会遗失一定量的记录量。&lt;/p&gt;
&lt;p&gt;但是大家做Python定量文本分析， 不用担心这个问题。  Python为代表的大规模数据挖掘，只要满足  &lt;strong&gt;Earnings(规模带来的信息增益) &amp;raquo; Loss(数据质量产生的损失)&lt;/strong&gt; ，做文本分析就是可行的，有意义的。 而咱们的数据， 数据规模近 400 万条， 数据质量也是有保证的。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h3 id=&#34;26-value_counts&#34;&gt;2.6 value_counts&lt;/h3&gt;
&lt;p&gt;查看2011-2019年， 不同留 &lt;em&gt;&lt;strong&gt;主题类别&lt;/strong&gt;&lt;/em&gt;  的记录数&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;#2011-2019&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df11_19&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;主题类别&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;value_counts&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;主题类别
城建    474413
交通    180195
其他    177262
三农    116151
环保     94344
教育     90603
政务     69910
治安     63752
就业     47854
医疗     37215
企业     36826
旅游     18675
文娱      9866
金融      6778
征集      4741
求助         3
咨询         2
建言         2
投诉         1
Name: count, dtype: int64
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;27-查看是否含某词&#34;&gt;2.7 查看是否含某词&lt;/h3&gt;
&lt;p&gt;查看字段 &lt;em&gt;&lt;strong&gt;留言内容&lt;/strong&gt;&lt;/em&gt;, 是否出现 &lt;em&gt;&lt;strong&gt;扰民|噪音&lt;/strong&gt;&lt;/em&gt; 等词语&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df11_19&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;留言内容&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fillna&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;contains&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;扰民|噪音&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;0          False
1          False
2          False
3          False
4          False
           ...  
1428614    False
1428615    False
1428616    False
1428617    False
1428618    False
Name: 留言内容, Length: 1428619, dtype: bool
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;p&gt;噪音的留言记录数&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df11_19&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;留言内容&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fillna&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;contains&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;扰民|噪音&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;sum&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;57845
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;p&gt;噪音的留言记录占总留言数的比例&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df11_19&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;留言内容&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fillna&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;contains&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;扰民|噪音&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;sum&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;/&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df11_19&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;0.04049063350044309
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;有4%的留言是跟扰民、噪音相关的 。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;三-相关研究&#34;&gt;三、 相关研究&lt;/h2&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;郑石明, 兰雨潇, 黎枫. 网络公共舆论与政府回应的互动逻辑——基于新冠肺炎疫情期间“领导留言板”的数据分析[J]. 公共管理学报, 2021, 18 (03): 24-37+169.
王磊,易扬.公共卫生危机中的数字政府回应如何纾解网络负面舆情——基于人民网“领导留言板”回复情况的调查[J].公共管理学报,2022,19(04):65-78+169.
Lu, Liangdong, Jia Xu, and Jiuchang Wei. &amp;#34;Understanding the effects of the textual complexity on government communication: Insights from China’s online public service platform.&amp;#34; Telematics and Informatics 83 (2023): 102028.
...
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id=&#34;四相关代码&#34;&gt;四、相关代码&lt;/h2&gt;
&lt;p&gt;想用 python 对 csv、xlsx 进行分析， 要学会尽量用 pandas 写代码。 以下是近期 pandas 的一些处理推文免费教程， 感兴趣的可以进去浏览浏览。&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2024-06-05-wenzheng-hunan-dataset/&#34;&gt;数据集(付费) | 30w条「问政湖南」领导留言回复记录(2010-2024)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-12-29-china-area-dataset/&#34;&gt;数据集 | 2024年中国全国5级行政区划（省、市、县、镇、村）&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-12-28-train-word2vec-using-renmin-gov-leader-board-dataset/&#34;&gt;词向量  | 使用&lt;strong&gt;人民网领导留言板&lt;/strong&gt;语料训练Word2Vec模型&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-12-17-how-to-generate-panel-data-from-gov-report-dataset/&#34;&gt;&lt;strong&gt;代码 | 使用地方gov工作报告生成某类概念词频「面板数据」&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-12-18-how-to-generate-panel-data-from-daily-news-dataset/&#34;&gt;&lt;strong&gt;代码 | 使用「新闻数据」构造概念词提及量「面板数据」&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-02-26-cctv1-xwlb-news-text-dataset/&#34;&gt;&lt;strong&gt;数据代码| 使用cctv新闻联播文稿构造「面板数据」&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2028-12-18-how-to-extract-data-from-patent-application-dataset/&#34;&gt;&lt;strong&gt;代码 | 使用3571w专利申请数据集构造「面板数据」&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-12-20-measure-china-economic-policy-uncertainty/&#34;&gt;&lt;strong&gt;代码 | 使用「新闻数据」计算 「经济政策不确定性」指数&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
</description>
      <content:encoded><![CDATA[<img src="img/04-dataset.png" style="zoom:80%;" />
<br>
<h2 id="一数据集">一、数据集</h2>
<h3 id="11-概况">1.1 概况</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">数据来源: 人民网地方领导留言板

覆盖时间: 2011-01-01 ~ 2023.12.06

记录条数: 3914385

文件格式: xlsx、csv
    
所含字段:
 -  留言领导
 -  留言标题
 -  省份
 -  市
 -  状态
 -  主题类别
 -  投诉种类
 -  留言人
 -  留言时间
 -  留言内容
 -  回复内容
 -  回复时间
 -  回复机构
 -  办理速度评分(该字段出现在2019之后)
 -  办理态度评分(该字段出现在2019之后)
 -  解决程度评分(该字段出现在2019之后)
 -  用户评价(该字段出现在2019之后)
 -  评价标签(该字段出现在2019之后)
</code></pre></div><br>
<h3 id="12-说明">1.2 说明</h3>
<p>科研用途展示； 如有问题， 加微信 372335839， 备注「姓名-学校-专业-留言板」。</p>
<br>
<h3 id="13-相关研究">1.3 相关研究</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">
[1]郑石明, 兰雨潇, 黎枫. 网络公共舆论与政府回应的互动逻辑——基于新冠肺炎疫情期间“领导留言板”的数据分析[J]. 公共管理学报, 2021, 18 (03): 24-37+169.
王磊,易扬.公共卫生危机中的数字政府回应如何纾解网络负面舆情——基于人民网“领导留言板”回复情况的调查[J].公共管理学报,2022,19(04):65-78+169.

[2]Lu, Liangdong, Jia Xu, and Jiuchang Wei. &#34;Understanding the effects of the textual complexity on government communication: Insights from China’s online public service platform.&#34; Telematics and Informatics 83 (2023): 102028.
...
</code></pre></div><p><br><img src="img/2023a.png" style="zoom:80%;" /><br></p>
<p><img src="img/2023b.png" style="zoom:80%;" /><br><br></p>
<h2 id="二查看数据">二、查看数据</h2>
<h3 id="21-读取数据">2.1 读取数据</h3>
<p>依次读取<em><strong>2011-2019.csv.gz</strong></em> 和  <em><strong>2020-2023.csv.gz</strong></em>  两个csv文件，    <em><strong>.csv.gz</strong></em> 解压得到  <em><strong>.csv</strong></em> 后再读取。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>


<span class="n">df11_19</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;2011-2019.csv&#39;</span><span class="p">)</span>
<span class="c1">#df11_19 = pd.read_csv(&#39;2011-2019.csv.gz&#39;, compression=&#39;gzip&#39;)</span>

<span class="n">df11_19</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/02-df.png" alt=""  />
</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df20_23</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;2020-2023.csv&#39;</span><span class="p">)</span>
<span class="c1">#df20_23 = pd.read_csv(&#39;2020-2023.csv.gz&#39;, compression=&#39;gzip&#39;)</span>
<span class="n">df20_23</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/03-df.png" alt=""  />
</p>
<br>
<h3 id="22-字段">2.2 字段</h3>
<p>10多年的时间，网站会变动，写爬虫运行爬虫的人也会变动。为了让大家更丝滑的使用数据，大邓对所有的年份进行了字段矫正和统一， 最后字段只有两大类，</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="nb">print</span><span class="p">(</span><span class="s1">&#39;2011-2019&#39;</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">df11_19</span><span class="o">.</span><span class="n">columns</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;</span><span class="se">\n</span><span class="s1">&#39;</span><span class="p">)</span>

<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;2020-2023&#39;</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">df20_23</span><span class="o">.</span><span class="n">columns</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">2011-2019
Index([&#39;留言领导&#39;, &#39;留言标题&#39;, &#39;省份&#39;, &#39;市&#39;, &#39;状态&#39;, &#39;主题类别&#39;, &#39;投诉种类&#39;, &#39;留言人&#39;, &#39;留言时间&#39;, &#39;留言内容&#39;, &#39;回复机构&#39;, 
       &#39;回复内容&#39;, &#39;回复时间&#39;, &#39;留言评价&#39;, &#39;评价时间&#39;],
      dtype=&#39;object&#39;)


2020-2023
Index([&#39;留言领导&#39;, &#39;留言标题&#39;, &#39;省份&#39;, &#39;市&#39;, &#39;状态&#39;, &#39;主题类别&#39;, &#39;投诉种类&#39;, &#39;留言人&#39;, &#39;留言时间&#39;, &#39;留言内容&#39;,
       &#39;回复内容&#39;, &#39;回复时间&#39;, &#39;回复机构&#39;, &#39;办理速度评分&#39;, &#39;办理态度评分&#39;, &#39;解决程度评分&#39;, &#39;用户评价&#39;, &#39;评价标签&#39;],
      dtype=&#39;object&#39;)
</code></pre></div><br>
<h3 id="23-记录数">2.3 记录数</h3>
<p>数据集总记录数</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="nb">print</span><span class="p">(</span><span class="s1">&#39;总记录数: &#39;</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">df11_19</span><span class="p">)</span><span class="o">+</span><span class="nb">len</span><span class="p">(</span><span class="n">df20_23</span><span class="p">))</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">总记录数: 3914385
</code></pre></div><br>
<h3 id="24-每年是否包含年末数据">2.4 每年是否包含年末数据</h3>
<p>由于人民网只 “<strong>可查询留言为上一年1月1日至今的所有留言</strong>”, 有同学没看懂这句话含义，担心每年12月月末或1月月初是否会缺失数据。这里我们检查下数据集每年的年初是否为1.1， 年底是否为12.31</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="k">for</span> <span class="n">year</span><span class="p">,</span> <span class="n">year_df</span> <span class="ow">in</span> <span class="n">df11_19</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">df11_19</span><span class="p">[</span><span class="s1">&#39;留言时间&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">dt</span><span class="o">.</span><span class="n">year</span><span class="p">):</span>
    <span class="nb">print</span><span class="p">(</span><span class="n">year</span><span class="p">,</span> <span class="n">year_df</span><span class="p">[</span><span class="s1">&#39;留言时间&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">min</span><span class="p">()</span><span class="o">.</span><span class="n">date</span><span class="p">(),</span> <span class="n">year_df</span><span class="p">[</span><span class="s1">&#39;留言时间&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">()</span><span class="o">.</span><span class="n">date</span><span class="p">())</span>
    
    
<span class="k">for</span> <span class="n">year</span><span class="p">,</span> <span class="n">year_df</span> <span class="ow">in</span> <span class="n">df20_23</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">df20_23</span><span class="p">[</span><span class="s1">&#39;留言时间&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">dt</span><span class="o">.</span><span class="n">year</span><span class="p">):</span>
    <span class="nb">print</span><span class="p">(</span><span class="n">year</span><span class="p">,</span> <span class="n">year_df</span><span class="p">[</span><span class="s1">&#39;留言时间&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">min</span><span class="p">()</span><span class="o">.</span><span class="n">date</span><span class="p">(),</span> <span class="n">year_df</span><span class="p">[</span><span class="s1">&#39;留言时间&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">()</span><span class="o">.</span><span class="n">date</span><span class="p">())</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="mi">2011</span> <span class="mi">2011</span><span class="o">-</span><span class="mi">01</span><span class="o">-</span><span class="mi">01</span> <span class="mi">2011</span><span class="o">-</span><span class="mi">12</span><span class="o">-</span><span class="mi">31</span>
<span class="mi">2012</span> <span class="mi">2012</span><span class="o">-</span><span class="mi">01</span><span class="o">-</span><span class="mi">01</span> <span class="mi">2012</span><span class="o">-</span><span class="mi">12</span><span class="o">-</span><span class="mi">31</span>
<span class="mi">2013</span> <span class="mi">2013</span><span class="o">-</span><span class="mi">01</span><span class="o">-</span><span class="mi">01</span> <span class="mi">2013</span><span class="o">-</span><span class="mi">12</span><span class="o">-</span><span class="mi">31</span>
<span class="mi">2014</span> <span class="mi">2014</span><span class="o">-</span><span class="mi">01</span><span class="o">-</span><span class="mi">01</span> <span class="mi">2014</span><span class="o">-</span><span class="mi">12</span><span class="o">-</span><span class="mi">31</span>
<span class="mi">2015</span> <span class="mi">2015</span><span class="o">-</span><span class="mi">01</span><span class="o">-</span><span class="mi">01</span> <span class="mi">2015</span><span class="o">-</span><span class="mi">12</span><span class="o">-</span><span class="mi">31</span>
<span class="mi">2016</span> <span class="mi">2016</span><span class="o">-</span><span class="mi">01</span><span class="o">-</span><span class="mi">01</span> <span class="mi">2016</span><span class="o">-</span><span class="mi">12</span><span class="o">-</span><span class="mi">31</span>
<span class="mi">2017</span> <span class="mi">2017</span><span class="o">-</span><span class="mi">01</span><span class="o">-</span><span class="mi">01</span> <span class="mi">2017</span><span class="o">-</span><span class="mi">12</span><span class="o">-</span><span class="mi">31</span>
<span class="mi">2018</span> <span class="mi">2018</span><span class="o">-</span><span class="mi">01</span><span class="o">-</span><span class="mi">01</span> <span class="mi">2018</span><span class="o">-</span><span class="mi">12</span><span class="o">-</span><span class="mi">31</span>
<span class="mi">2019</span> <span class="mi">2019</span><span class="o">-</span><span class="mi">01</span><span class="o">-</span><span class="mi">01</span> <span class="mi">2019</span><span class="o">-</span><span class="mi">12</span><span class="o">-</span><span class="mi">31</span>
<span class="mi">2020</span> <span class="mi">2020</span><span class="o">-</span><span class="mi">01</span><span class="o">-</span><span class="mi">01</span> <span class="mi">2020</span><span class="o">-</span><span class="mi">12</span><span class="o">-</span><span class="mi">31</span>
<span class="mi">2021</span> <span class="mi">2021</span><span class="o">-</span><span class="mi">01</span><span class="o">-</span><span class="mi">01</span> <span class="mi">2021</span><span class="o">-</span><span class="mi">12</span><span class="o">-</span><span class="mi">31</span>
<span class="mi">2022</span> <span class="mi">2022</span><span class="o">-</span><span class="mi">01</span><span class="o">-</span><span class="mi">01</span> <span class="mi">2022</span><span class="o">-</span><span class="mi">12</span><span class="o">-</span><span class="mi">31</span>
<span class="mi">2023</span> <span class="mi">2023</span><span class="o">-</span><span class="mi">01</span><span class="o">-</span><span class="mi">01</span> <span class="mi">2023</span><span class="o">-</span><span class="mi">12</span><span class="o">-</span><span class="mi">06</span>
</code></pre></div><p>因为数据集是 2023.12.6 运行的， 日期截止到 2023.12.6 。不过不用担心， 下次更新数据时候会覆盖到  2023.12.31 。</p>
<br>
<h3 id="25-年度记录数">2.5 年度记录数</h3>
<p>两个 dataframe 中都有 <em><strong>留言日期</strong></em> ， 我们根据该字段查看每个年份的记录数。首先，要先将该字段转化为 datetime 日期类型。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">data</span> <span class="o">=</span> <span class="p">[]</span>

<span class="n">df11_19</span><span class="p">[</span><span class="s1">&#39;留言时间&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df11_19</span><span class="p">[</span><span class="s1">&#39;留言时间&#39;</span><span class="p">])</span>
<span class="n">df20_23</span><span class="p">[</span><span class="s1">&#39;留言时间&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df20_23</span><span class="p">[</span><span class="s1">&#39;留言时间&#39;</span><span class="p">])</span>

<span class="k">for</span> <span class="n">year</span><span class="p">,</span> <span class="n">year_df</span> <span class="ow">in</span> <span class="n">df11_19</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">df11_19</span><span class="p">[</span><span class="s1">&#39;留言时间&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">dt</span><span class="o">.</span><span class="n">year</span><span class="p">):</span>
    <span class="n">data</span><span class="o">.</span><span class="n">append</span><span class="p">({</span><span class="s1">&#39;year&#39;</span><span class="p">:</span> <span class="n">year</span><span class="p">,</span> <span class="s1">&#39;volume&#39;</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="n">year_df</span><span class="p">)})</span>
    <span class="nb">print</span><span class="p">(</span><span class="n">year</span><span class="p">,</span> <span class="s1">&#39; &#39;</span><span class="p">,</span><span class="nb">len</span><span class="p">(</span><span class="n">year_df</span><span class="p">))</span>

<span class="k">for</span> <span class="n">year</span><span class="p">,</span> <span class="n">year_df</span> <span class="ow">in</span> <span class="n">df20_23</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">df20_23</span><span class="p">[</span><span class="s1">&#39;留言时间&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">dt</span><span class="o">.</span><span class="n">year</span><span class="p">):</span>
    <span class="n">data</span><span class="o">.</span><span class="n">append</span><span class="p">({</span><span class="s1">&#39;year&#39;</span><span class="p">:</span> <span class="n">year</span><span class="p">,</span> <span class="s1">&#39;volume&#39;</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="n">year_df</span><span class="p">)})</span>
    <span class="nb">print</span><span class="p">(</span><span class="n">year</span><span class="p">,</span> <span class="s1">&#39; &#39;</span><span class="p">,</span><span class="nb">len</span><span class="p">(</span><span class="n">year_df</span><span class="p">))</span>
    

</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">2011 23307
2012 20178
2013 42950
2014 97640
2015 131930
2016 201525
2017 202793
2018 243648
2019 464622
2020 517167
2021 783139
2022 648055
2023 537422
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">import</span> <span class="nn">matplotlib</span>
<span class="kn">import</span> <span class="nn">scienceplots</span>
<span class="kn">import</span> <span class="nn">platform</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">import</span> <span class="nn">matplotlib_inline</span>
<span class="n">matplotlib_inline</span><span class="o">.</span><span class="n">backend_inline</span><span class="o">.</span><span class="n">set_matplotlib_formats</span><span class="p">(</span><span class="s1">&#39;png&#39;</span><span class="p">,</span> <span class="s1">&#39;svg&#39;</span><span class="p">)</span>
<span class="kn">import</span> <span class="nn">jieba</span>
<span class="kn">import</span> <span class="nn">warnings</span>
<span class="n">warnings</span><span class="o">.</span><span class="n">filterwarnings</span><span class="p">(</span><span class="s1">&#39;ignore&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">style</span><span class="o">.</span><span class="n">use</span><span class="p">([</span><span class="s1">&#39;science&#39;</span><span class="p">,</span> <span class="s1">&#39;no-latex&#39;</span><span class="p">,</span> <span class="s1">&#39;cjk-sc-font&#39;</span><span class="p">])</span>
<span class="n">system</span> <span class="o">=</span> <span class="n">platform</span><span class="o">.</span><span class="n">system</span><span class="p">()</span>  <span class="c1"># 获取操作系统类型</span>
<span class="k">if</span> <span class="n">system</span> <span class="o">==</span> <span class="s1">&#39;Windows&#39;</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;SimHei&#39;</span><span class="p">}</span>
<span class="k">elif</span> <span class="n">system</span> <span class="o">==</span> <span class="s1">&#39;Darwin&#39;</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;Arial Unicode MS&#39;</span><span class="p">}</span>
<span class="k">else</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;sans-serif&#39;</span><span class="p">}</span>
<span class="n">matplotlib</span><span class="o">.</span><span class="n">rc</span><span class="p">(</span><span class="s1">&#39;font&#39;</span><span class="p">,</span> <span class="o">**</span><span class="n">font</span><span class="p">)</span>  <span class="c1"># 设置全局字体</span>
    
<span class="n">year_volume_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
<span class="c1">#year_volume_df[&#39;year&#39;] = pd.to_datetime(year_volume_df[&#39;year&#39;])</span>
<span class="n">year_volume_df</span><span class="o">.</span><span class="n">set_index</span><span class="p">(</span><span class="s1">&#39;year&#39;</span><span class="p">,</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">year_volume_df</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">kind</span><span class="o">=</span><span class="s1">&#39;bar&#39;</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">7</span><span class="p">,</span> <span class="mi">4</span><span class="p">))</span>

<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">&#39;人民网留言板留言数量(2011 ~ 2023)&#39;</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">15</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xticks</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">rotation</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s1">&#39;年份&#39;</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">13</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s1">&#39;留言数量&#39;</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">13</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/plot.png" alt=""  />
</p>
<p>需要声明， 采集的数据量与真实数据量肯定会有出入的， 例如爬虫运行的时间点、IP被封、请求失败、文件编码(格式)问题等会遗失一定量的记录量。</p>
<p>但是大家做Python定量文本分析， 不用担心这个问题。  Python为代表的大规模数据挖掘，只要满足  <strong>Earnings(规模带来的信息增益) &raquo; Loss(数据质量产生的损失)</strong> ，做文本分析就是可行的，有意义的。 而咱们的数据， 数据规模近 400 万条， 数据质量也是有保证的。</p>
<p><br><br></p>
<h3 id="26-value_counts">2.6 value_counts</h3>
<p>查看2011-2019年， 不同留 <em><strong>主题类别</strong></em>  的记录数</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#2011-2019</span>
<span class="n">df11_19</span><span class="p">[</span><span class="s1">&#39;主题类别&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">value_counts</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">主题类别
城建    474413
交通    180195
其他    177262
三农    116151
环保     94344
教育     90603
政务     69910
治安     63752
就业     47854
医疗     37215
企业     36826
旅游     18675
文娱      9866
金融      6778
征集      4741
求助         3
咨询         2
建言         2
投诉         1
Name: count, dtype: int64
</code></pre></div><br>
<h3 id="27-查看是否含某词">2.7 查看是否含某词</h3>
<p>查看字段 <em><strong>留言内容</strong></em>, 是否出现 <em><strong>扰民|噪音</strong></em> 等词语</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df11_19</span><span class="p">[</span><span class="s1">&#39;留言内容&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">&#39;扰民|噪音&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">0          False
1          False
2          False
3          False
4          False
           ...  
1428614    False
1428615    False
1428616    False
1428617    False
1428618    False
Name: 留言内容, Length: 1428619, dtype: bool
</code></pre></div><br>
<p>噪音的留言记录数</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df11_19</span><span class="p">[</span><span class="s1">&#39;留言内容&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">&#39;扰民|噪音&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">57845
</code></pre></div><br>
<p>噪音的留言记录占总留言数的比例</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df11_19</span><span class="p">[</span><span class="s1">&#39;留言内容&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">&#39;扰民|噪音&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span><span class="o">/</span><span class="nb">len</span><span class="p">(</span><span class="n">df11_19</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">0.04049063350044309
</code></pre></div><p>有4%的留言是跟扰民、噪音相关的 。</p>
<p><br><br></p>
<h2 id="三-相关研究">三、 相关研究</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">郑石明, 兰雨潇, 黎枫. 网络公共舆论与政府回应的互动逻辑——基于新冠肺炎疫情期间“领导留言板”的数据分析[J]. 公共管理学报, 2021, 18 (03): 24-37+169.
王磊,易扬.公共卫生危机中的数字政府回应如何纾解网络负面舆情——基于人民网“领导留言板”回复情况的调查[J].公共管理学报,2022,19(04):65-78+169.
Lu, Liangdong, Jia Xu, and Jiuchang Wei. &#34;Understanding the effects of the textual complexity on government communication: Insights from China’s online public service platform.&#34; Telematics and Informatics 83 (2023): 102028.
...
</code></pre></div><h2 id="四相关代码">四、相关代码</h2>
<p>想用 python 对 csv、xlsx 进行分析， 要学会尽量用 pandas 写代码。 以下是近期 pandas 的一些处理推文免费教程， 感兴趣的可以进去浏览浏览。</p>
<ul>
<li><a href="https://textdata.cn/blog/2024-06-05-wenzheng-hunan-dataset/">数据集(付费) | 30w条「问政湖南」领导留言回复记录(2010-2024)</a></li>
<li><a href="https://textdata.cn/blog/2023-12-29-china-area-dataset/">数据集 | 2024年中国全国5级行政区划（省、市、县、镇、村）</a></li>
<li><a href="https://textdata.cn/blog/2023-12-28-train-word2vec-using-renmin-gov-leader-board-dataset/">词向量  | 使用<strong>人民网领导留言板</strong>语料训练Word2Vec模型</a></li>
<li><a href="https://textdata.cn/blog/2023-12-17-how-to-generate-panel-data-from-gov-report-dataset/"><strong>代码 | 使用地方gov工作报告生成某类概念词频「面板数据」</strong></a></li>
<li><a href="https://textdata.cn/blog/2023-12-18-how-to-generate-panel-data-from-daily-news-dataset/"><strong>代码 | 使用「新闻数据」构造概念词提及量「面板数据」</strong></a></li>
<li><a href="https://textdata.cn/blog/2023-02-26-cctv1-xwlb-news-text-dataset/"><strong>数据代码| 使用cctv新闻联播文稿构造「面板数据」</strong></a></li>
<li><a href="https://textdata.cn/blog/2028-12-18-how-to-extract-data-from-patent-application-dataset/"><strong>代码 | 使用3571w专利申请数据集构造「面板数据」</strong></a></li>
<li><a href="https://textdata.cn/blog/2023-12-20-measure-china-economic-policy-uncertainty/"><strong>代码 | 使用「新闻数据」计算 「经济政策不确定性」指数</strong></a></li>
</ul>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>代码 | 使用「新闻数据」构造概念词提及量「面板数据」</title>
      <link>https://textdata.cn/blog/2023-12-18-how-to-generate-panel-data-from-daily-news-dataset/</link>
      <pubDate>Sun, 17 Dec 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-12-18-how-to-generate-panel-data-from-daily-news-dataset/</guid>
      <description>使用新闻联播、经济日报、人民日报，生成某「个体」关于「某概念词频」的「面板数据」</description>
      <content:encoded><![CDATA[<h2 id="一任务">一、任务</h2>
<p><a href="https://textdata.cn/blog/2023-12-14-daily-news-dataset/">新闻数据集 | 含 人民日报/经济日报/光明日报 等120家媒体(2024.05)</a></p>
<br>
<p>利用 经济日报和人民日报 这两套数据集，可以生成面板数据，字段有</p>
<ul>
<li><strong>Object</strong> 提及的概念词(Object)，可以是某类概念词(创新/三农) 或 行为主体(省、市、公司法人）。</li>
<li><strong>Date</strong> 日期， 粒度可以是年(月、周、日)</li>
<li><strong>MentionTimes</strong>  在Date期间，提及概念词(Object)的新闻条数</li>
<li><strong>MentionRatio</strong>  在Date期间，提及概念词(Object)的新闻条数/总新闻条数</li>
</ul>
<p>今天利用该数据集， 生成 <code>省份、日期(周/天）、提及该省新闻次数、提及该省新闻占比</code> 面板数据。</p>
<p><img loading="lazy" src="img/06-panel.png" alt=""  />
</p>
<p><img loading="lazy" src="img/07-plot.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="二数据操作">二、数据操作</h2>
<h3 id="21-原始数据格式">2.1 原始数据格式</h3>
<p>今天更新这两个数据集， 增加 <strong>经济日报csv.gz</strong> 和 <strong>人民日报.csv.gz</strong>。 已购买 <strong>经济日报csv.gz</strong> 和 <strong>人民日报.csv.gz</strong>的同学，可以微信 37233539 ，来获取这两个文件。</p>
<br>
<h3 id="22-读取数据">2.2 读取数据</h3>
<p>pandas可以直接读取 <em><strong>经济日报.csv.gz</strong></em> 和 <em><strong>人民日报.csv.gz</strong></em> 压缩文件，且这样读取的速度是比 经济日报.csv 和 人民日报.csv 要更快的。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">jjrb_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;经济日报.csv.gz&#39;</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">)</span>
<span class="n">rmrb_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;人民日报.csv.gz&#39;</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">)</span>

<span class="n">jjrb_df</span><span class="p">[</span><span class="s1">&#39;date&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">jjrb_df</span><span class="p">[</span><span class="s1">&#39;date&#39;</span><span class="p">])</span>
<span class="n">rmrb_df</span><span class="p">[</span><span class="s1">&#39;date&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">rmrb_df</span><span class="p">[</span><span class="s1">&#39;date&#39;</span><span class="p">])</span>

<span class="n">rmrb_df</span>
</code></pre></div><p><img loading="lazy" src="img/01-rmrb_df.png" alt=""  />
</p>
<br>
<h3 id="23-记录存储形式">2.3 记录存储形式</h3>
<p>这两个新闻数据， 任意日期(日)内一般都会有多条新闻记录， 每条新闻记录是以一行单独存储。</p>
<p>以 <strong>rmrb_df</strong> 为例， 查看 <strong>2013-06-08</strong> 新闻记录，可以看到有多条记录。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#任意日期</span>
<span class="n">rmrb_df</span><span class="p">[</span><span class="n">rmrb_df</span><span class="p">[</span><span class="s1">&#39;date&#39;</span><span class="p">]</span><span class="o">==</span><span class="s1">&#39;2013-06-08&#39;</span><span class="p">]</span>
</code></pre></div><p><img loading="lazy" src="img/03-filter-date.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="三生成面板数据">三、生成面板数据</h2>
<p>因为 人民日报 和 经济日报 的数据格式基本一样，接下来以 人民日报 为例， 逐步生成 <code>省份、日期(年度）、提及该省新闻次数、提及该省新闻占比</code> 面板数据， 字段名定义为 <code>Object、Date、MentionTimes、MentionRatio</code>。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">provs</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;浙江省&#39;</span><span class="p">,</span> <span class="s1">&#39;山东省&#39;</span><span class="p">,</span> <span class="s1">&#39;新疆维吾尔族自治区&#39;</span><span class="p">,</span> <span class="s1">&#39;上海市&#39;</span><span class="p">,</span> <span class="s1">&#39;四川省&#39;</span><span class="p">,</span> <span class="s1">&#39;重庆市&#39;</span><span class="p">,</span> <span class="s1">&#39;海南省&#39;</span><span class="p">,</span> <span class="s1">&#39;河北省&#39;</span><span class="p">,</span>
       <span class="s1">&#39;广西壮族自治区&#39;</span><span class="p">,</span> <span class="s1">&#39;云南省&#39;</span><span class="p">,</span> <span class="s1">&#39;黑龙江省&#39;</span><span class="p">,</span> <span class="s1">&#39;河南省&#39;</span><span class="p">,</span> <span class="s1">&#39;内蒙古自治区&#39;</span><span class="p">,</span> <span class="s1">&#39;北京市&#39;</span><span class="p">,</span> <span class="s1">&#39;宁夏回族自治区&#39;</span><span class="p">,</span> <span class="s1">&#39;甘肃省&#39;</span><span class="p">,</span>
       <span class="s1">&#39;安徽省&#39;</span><span class="p">,</span> <span class="s1">&#39;吉林省&#39;</span><span class="p">,</span> <span class="s1">&#39;陕西省&#39;</span><span class="p">,</span> <span class="s1">&#39;湖北省&#39;</span><span class="p">,</span> <span class="s1">&#39;青海省&#39;</span><span class="p">,</span> <span class="s1">&#39;江西省&#39;</span><span class="p">,</span> <span class="s1">&#39;天津市&#39;</span><span class="p">,</span> <span class="s1">&#39;山西省&#39;</span><span class="p">,</span> <span class="s1">&#39;广东省&#39;</span><span class="p">,</span>
       <span class="s1">&#39;贵州省&#39;</span><span class="p">,</span> <span class="s1">&#39;福建省&#39;</span><span class="p">,</span> <span class="s1">&#39;西藏自治区&#39;</span><span class="p">,</span> <span class="s1">&#39;湖南省&#39;</span><span class="p">,</span> <span class="s1">&#39;江苏省&#39;</span><span class="p">,</span> <span class="s1">&#39;辽宁省&#39;</span><span class="p">]</span>


<span class="n">prov_date_counts</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">date</span><span class="p">,</span> <span class="n">weekly_df</span> <span class="ow">in</span> <span class="n">人民日报_df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">pd</span><span class="o">.</span><span class="n">Grouper</span><span class="p">(</span><span class="n">key</span><span class="o">=</span><span class="s1">&#39;date&#39;</span><span class="p">,</span> <span class="n">freq</span><span class="o">=</span><span class="s1">&#39;Y&#39;</span><span class="p">)):</span>
    <span class="k">for</span> <span class="n">prov</span> <span class="ow">in</span> <span class="n">provs</span><span class="p">:</span>
        <span class="n">mention_times</span> <span class="o">=</span> <span class="n">weekly_df</span><span class="p">[</span><span class="s1">&#39;content&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="n">prov</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>
        <span class="n">data</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;Date&#39;</span><span class="p">:</span> <span class="n">date</span><span class="p">,</span> 
                <span class="s1">&#39;Object&#39;</span><span class="p">:</span> <span class="n">prov</span><span class="p">,</span> 
                <span class="s1">&#39;MentionTimes&#39;</span><span class="p">:</span> <span class="n">mention_times</span><span class="p">,</span>
                <span class="s1">&#39;MentionRatio&#39;</span><span class="p">:</span> <span class="n">mention_times</span><span class="o">/</span><span class="nb">len</span><span class="p">(</span><span class="n">weekly_df</span><span class="p">)</span>
               <span class="p">}</span>
        <span class="n">prov_date_counts</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
        
<span class="n">panel_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">prov_date_counts</span><span class="p">)</span>
<span class="n">panel_df</span>
</code></pre></div><p><img loading="lazy" src="img/04-year-panel.png" alt=""  />
</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">panel_df</span><span class="p">[</span><span class="n">panel_df</span><span class="p">[</span><span class="s1">&#39;Object&#39;</span><span class="p">]</span><span class="o">==</span><span class="s1">&#39;浙江省&#39;</span><span class="p">]</span>
</code></pre></div><p><img loading="lazy" src="img/05-filter-prov.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="四封装代码">四、封装代码</h2>
<p>我封装了代码， 大家可以拿来直接用。 支持csv/xls/xlsx新闻类文件数据， 字段可设定，周期(年Y月M周W日D时H)可设定。</p>
<br>
<h3 id="41-generate_panel_data">4.1 generate_panel_data</h3>
<p>generate_panel_data(file, objects, text_field=&lsquo;content&rsquo;, date_field=&lsquo;date&rsquo;, encoding=&lsquo;utf-8&rsquo;, freq=&lsquo;W&rsquo;)</p>
<ul>
<li><strong>file</strong> 数据文件路径， .csv 或 .csv.gzip、xlsx、xls</li>
<li><strong>objects</strong> 支持list和dict</li>
<li><strong>text_field</strong> 指定数据文件中「文本」字段名，默认为&rsquo;content'</li>
<li><strong>date_field</strong> 指定数据文件中「日期」字段名，默认为&rsquo;date'</li>
<li><strong>freq</strong> 生成面板数据日期的周期， 年Y、月M、周W、日D、时H</li>
<li><strong>encoding</strong> 数据文件编码格式， 默认utf-8编码， 可能有的csv文件需要调整该参数</li>
</ul>
<p>返回<strong>DataFrame</strong>，DataFrame字段含 <strong>Date</strong>、<strong>Object</strong>、<strong>MentionTimes</strong>、<strong>MentionRatio</strong></p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">generate_panel_data</span><span class="p">(</span><span class="n">file</span><span class="p">,</span> <span class="n">objects</span><span class="p">,</span> <span class="n">text_field</span><span class="o">=</span><span class="s1">&#39;content&#39;</span><span class="p">,</span> <span class="n">date_field</span><span class="o">=</span><span class="s1">&#39;date&#39;</span><span class="p">,</span> <span class="n">freq</span><span class="o">=</span><span class="s1">&#39;W&#39;</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8&#39;</span><span class="p">):</span>
    <span class="s2">&#34;&#34;&#34;
</span><span class="s2">    - file 数据文件路径， .csv 或 .csv.gzip、xlsx、xls
</span><span class="s2">    - objects 支持list和dict
</span><span class="s2">    - text_field 指定数据文件中「文本」字段名，默认为&#39;content&#39;
</span><span class="s2">    - date_field 指定数据文件中「日期」字段名，默认为&#39;date&#39;
</span><span class="s2">    - freq 生成面板数据日期的周期， 年Y、月M、周W、日D、时H
</span><span class="s2">    - encoding 数据文件编码格式， 默认utf-8编码， 可能有的csv文件需要调整该参数
</span><span class="s2">
</span><span class="s2">    返回DataFrame，DataFrame字段含Date、Object、MentionTimes、MentionRatio
</span><span class="s2">    &#34;&#34;&#34;</span>
    
    <span class="c1">#读取数据文件</span>
    <span class="k">if</span> <span class="s1">&#39;csv&#39;</span> <span class="ow">in</span> <span class="n">file</span><span class="p">:</span>
        <span class="k">try</span><span class="p">:</span>
            <span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="n">file</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="n">encoding</span><span class="p">)</span>
        <span class="k">except</span><span class="p">:</span>
            <span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="n">file</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="n">encoding</span><span class="p">)</span>
    <span class="k">elif</span> <span class="s1">&#39;.xlsx&#39;</span> <span class="ow">in</span> <span class="n">file</span><span class="p">:</span>
        <span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_excel</span><span class="p">(</span><span class="n">file</span><span class="p">)</span>
    <span class="k">elif</span> <span class="s1">&#39;.xsx&#39;</span> <span class="ow">in</span> <span class="n">file</span><span class="p">:</span>
        <span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_excel</span><span class="p">(</span><span class="n">file</span><span class="p">)</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="k">raise</span> <span class="ne">ValueError</span><span class="p">(</span><span class="s2">&#34;目前只支持csv、xlsx、xlsx三种文件格式&#34;</span><span class="p">)</span> 

        
    <span class="c1">#更改日期格式</span>
    <span class="n">df</span><span class="p">[</span><span class="n">date_field</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="n">date_field</span><span class="p">])</span>
    <span class="n">prov_date_counts</span> <span class="o">=</span> <span class="p">[]</span>
    
    <span class="c1">#构造面板数据</span>
    <span class="k">for</span> <span class="n">date</span><span class="p">,</span> <span class="n">freq_df</span> <span class="ow">in</span> <span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">pd</span><span class="o">.</span><span class="n">Grouper</span><span class="p">(</span><span class="n">key</span><span class="o">=</span><span class="n">date_field</span><span class="p">,</span> <span class="n">freq</span><span class="o">=</span><span class="n">freq</span><span class="p">)):</span>
        
        <span class="c1">#objects为list的操作</span>
        <span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">objects</span><span class="p">,</span> <span class="nb">list</span><span class="p">):</span>
            <span class="k">for</span> <span class="n">obj</span> <span class="ow">in</span> <span class="n">objects</span><span class="p">:</span>
                <span class="c1">#统计出现obj新闻的次数</span>
                <span class="n">mention_times</span> <span class="o">=</span> <span class="n">freq_df</span><span class="p">[</span><span class="n">text_field</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="n">obj</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>
                <span class="n">data</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;Date&#39;</span><span class="p">:</span> <span class="n">date</span><span class="p">,</span> 
                        <span class="s1">&#39;Object&#39;</span><span class="p">:</span> <span class="n">obj</span><span class="p">,</span> 
                        <span class="s1">&#39;MentionTimes&#39;</span><span class="p">:</span> <span class="n">mention_times</span><span class="p">,</span>
                        <span class="s1">&#39;MentionRatio&#39;</span><span class="p">:</span> <span class="n">mention_times</span><span class="o">/</span><span class="nb">len</span><span class="p">(</span><span class="n">freq_df</span><span class="p">)}</span>
                <span class="n">prov_date_counts</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
                
        <span class="c1">#objects为dict的操作</span>
        <span class="k">elif</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">objects</span><span class="p">,</span> <span class="nb">dict</span><span class="p">):</span>
            <span class="k">for</span> <span class="n">key</span><span class="p">,</span> <span class="n">words</span> <span class="ow">in</span> <span class="n">objects</span><span class="o">.</span><span class="n">items</span><span class="p">():</span>
                <span class="c1">#统计某概念词words出现的新闻的条数，等同于object出现次数。</span>
                <span class="n">mention_words_times</span> <span class="o">=</span> <span class="n">freq_df</span><span class="p">[</span><span class="n">text_field</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">&#39;|&#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">words</span><span class="p">))</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>
                <span class="n">data</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;Date&#39;</span><span class="p">:</span> <span class="n">date</span><span class="p">,</span> 
                        <span class="s1">&#39;Object&#39;</span><span class="p">:</span> <span class="n">key</span><span class="p">,</span> 
                        <span class="s1">&#39;MentionTimes&#39;</span><span class="p">:</span> <span class="n">mention_words_times</span><span class="p">,</span>
                        <span class="s1">&#39;MentionRatio&#39;</span><span class="p">:</span> <span class="n">mention_words_times</span><span class="o">/</span><span class="nb">len</span><span class="p">(</span><span class="n">freq_df</span><span class="p">)}</span>
                <span class="n">prov_date_counts</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
                
        <span class="k">else</span><span class="p">:</span>
            <span class="nb">print</span><span class="p">(</span><span class="s1">&#39;传入的objects参数有问题， 该参数必须是列表或字典&#39;</span><span class="p">)</span>
            <span class="k">break</span>
    <span class="n">panel_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">prov_date_counts</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">panel_df</span>
</code></pre></div><br>
<h3 id="42-plot_figure">4.2 plot_figure</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">plot_figure</span><span class="p">(</span><span class="n">panel_df</span><span class="p">,</span> <span class="n">title</span><span class="p">,</span> <span class="n">object_field</span><span class="o">=</span><span class="s1">&#39;Object&#39;</span><span class="p">,</span> <span class="n">date_field</span><span class="o">=</span><span class="s1">&#39;Date&#39;</span><span class="p">,</span> <span class="n">value_filed</span><span class="o">=</span><span class="s1">&#39;MentionRatio&#39;</span><span class="p">):</span>
    <span class="s2">&#34;&#34;&#34;
</span><span class="s2">    panel_df:  面板数据
</span><span class="s2">    title:  折线图标题
</span><span class="s2">    date_field: panel_df中的日期字段
</span><span class="s2">    value_filed: panel_df中的要绘图的值的字段名
</span><span class="s2">    &#34;&#34;&#34;</span>
    <span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
    <span class="kn">import</span> <span class="nn">matplotlib</span>
    <span class="kn">import</span> <span class="nn">scienceplots</span>
    <span class="kn">import</span> <span class="nn">platform</span>
    <span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
    <span class="kn">import</span> <span class="nn">matplotlib_inline</span>
    <span class="n">matplotlib_inline</span><span class="o">.</span><span class="n">backend_inline</span><span class="o">.</span><span class="n">set_matplotlib_formats</span><span class="p">(</span><span class="s1">&#39;png&#39;</span><span class="p">,</span> <span class="s1">&#39;svg&#39;</span><span class="p">)</span>
    <span class="kn">import</span> <span class="nn">jieba</span>
    <span class="kn">import</span> <span class="nn">warnings</span>
    <span class="n">warnings</span><span class="o">.</span><span class="n">filterwarnings</span><span class="p">(</span><span class="s1">&#39;ignore&#39;</span><span class="p">)</span>
    <span class="n">plt</span><span class="o">.</span><span class="n">style</span><span class="o">.</span><span class="n">use</span><span class="p">([</span><span class="s1">&#39;science&#39;</span><span class="p">,</span> <span class="s1">&#39;no-latex&#39;</span><span class="p">,</span> <span class="s1">&#39;cjk-sc-font&#39;</span><span class="p">])</span>
    <span class="n">system</span> <span class="o">=</span> <span class="n">platform</span><span class="o">.</span><span class="n">system</span><span class="p">()</span>  <span class="c1"># 获取操作系统类型</span>
    <span class="k">if</span> <span class="n">system</span> <span class="o">==</span> <span class="s1">&#39;Windows&#39;</span><span class="p">:</span>
        <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;SimHei&#39;</span><span class="p">}</span>
    <span class="k">elif</span> <span class="n">system</span> <span class="o">==</span> <span class="s1">&#39;Darwin&#39;</span><span class="p">:</span>
        <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;Arial Unicode MS&#39;</span><span class="p">}</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;sans-serif&#39;</span><span class="p">}</span>
    <span class="n">matplotlib</span><span class="o">.</span><span class="n">rc</span><span class="p">(</span><span class="s1">&#39;font&#39;</span><span class="p">,</span> <span class="o">**</span><span class="n">font</span><span class="p">)</span>  <span class="c1"># 设置全局字体</span>
    <span class="n">panel_df</span><span class="p">[</span><span class="n">date_field</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">panel_df</span><span class="p">[</span><span class="n">date_field</span><span class="p">])</span>
    
    <span class="n">new_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">pivot_table</span><span class="p">(</span><span class="n">panel_df</span><span class="p">,</span> 
                            <span class="n">index</span><span class="o">=</span><span class="n">date_field</span><span class="p">,</span>
                            <span class="n">columns</span><span class="o">=</span><span class="n">object_field</span><span class="p">,</span>
                            <span class="n">values</span><span class="o">=</span><span class="n">value_filed</span><span class="p">)</span>
    <span class="n">ax</span> <span class="o">=</span> <span class="n">new_df</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span>
    <span class="c1"># 添加图例，并指定位置和偏移</span>
    <span class="n">ax</span><span class="o">.</span><span class="n">legend</span><span class="p">(</span><span class="n">loc</span><span class="o">=</span><span class="s1">&#39;upper right&#39;</span><span class="p">,</span> <span class="n">bbox_to_anchor</span><span class="o">=</span><span class="p">(</span><span class="mf">1.15</span><span class="p">,</span> <span class="mf">1.05</span><span class="p">))</span>
    <span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="n">title</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">15</span><span class="p">)</span>
    <span class="n">plt</span><span class="o">.</span><span class="n">xticks</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="mi">12</span><span class="p">)</span>
    <span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s1">&#39;年份&#39;</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">13</span><span class="p">)</span>
    <span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s1">&#39;新闻提及次数&#39;</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">13</span><span class="p">)</span>
    <span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div><br>
<h3 id="43-objects为列表">4.3 objects为列表</h3>
<p>现在假设我拿到一个csv文件， 知道内部有date、text两个字段，现在我想得到提及 四省的新闻次数的面板数据，周期为月份</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">provs2</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;山东省&#39;</span><span class="p">,</span> <span class="s1">&#39;河北省&#39;</span><span class="p">,</span> <span class="s1">&#39;湖南省&#39;</span><span class="p">,</span> <span class="s1">&#39;黑龙江省&#39;</span><span class="p">]</span>
<span class="n">panel_df2</span> <span class="o">=</span> <span class="n">generate_panel_data</span><span class="p">(</span><span class="n">file</span><span class="o">=</span><span class="s1">&#39;人民日报.csv.gzip&#39;</span><span class="p">,</span> 
                                <span class="n">objects</span><span class="o">=</span><span class="n">provs2</span><span class="p">,</span> 
                                <span class="c1">#实验数据csv文件的日期字段为text</span>
                                <span class="n">text_field</span><span class="o">=</span><span class="s1">&#39;content&#39;</span><span class="p">,</span>  
                                 <span class="c1">#实验数据csv文件的日期字段为date</span>
                                <span class="n">date_field</span><span class="o">=</span><span class="s1">&#39;date&#39;</span><span class="p">,</span> 
                                <span class="n">freq</span><span class="o">=</span><span class="s1">&#39;Y&#39;</span><span class="p">,</span>  <span class="c1">#年度</span>
                                <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8&#39;</span><span class="p">)</span>
<span class="c1">#panel_df2.to_csv(&#39;人民日报新闻鲁冀湘黑四省(objects为列表)年度被提及占比.csv&#39;, index=False)</span>

<span class="n">panel_df2</span>
</code></pre></div><p><img loading="lazy" src="img/06-panel.png" alt=""  />
</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">plot_figure</span><span class="p">(</span><span class="n">panel_df</span><span class="o">=</span><span class="n">panel_df2</span><span class="p">,</span> 
            <span class="n">title</span><span class="o">=</span><span class="s1">&#39;人民日报新闻鲁、冀、湘、黑四省年度被提及占比(1946-2023)&#39;</span><span class="p">,</span> 
            <span class="n">object_field</span><span class="o">=</span><span class="s1">&#39;Object&#39;</span><span class="p">,</span> 
            <span class="n">date_field</span><span class="o">=</span><span class="s1">&#39;Date&#39;</span><span class="p">,</span> 
            <span class="n">value_filed</span><span class="o">=</span><span class="s1">&#39;MentionRatio&#39;</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/07-plot.png" alt=""  />
</p>
<br>
<h3 id="44-objects为字典">4.4 objects为字典</h3>
<p>现在假设我拿到一个csv文件， 知道内部有date、text两个字段，现在我想得到提及 三类概念词 新闻次数的面板数据，周期为月份</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#数据整理比较粗糙，大家能get到我的意思即可</span>
<span class="n">provs3</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;经济发展&#39;</span><span class="p">:</span> <span class="p">[</span><span class="s1">&#39;经济&#39;</span><span class="p">,</span> <span class="s1">&#39;发展&#39;</span><span class="p">,</span> <span class="s1">&#39;建设&#39;</span><span class="p">,</span> <span class="s1">&#39;经济发展&#39;</span><span class="p">],</span> 
          <span class="s1">&#39;环境保护&#39;</span><span class="p">:</span> <span class="p">[</span><span class="s1">&#39;环境保护&#39;</span><span class="p">,</span> <span class="s1">&#39;保护环境&#39;</span><span class="p">,</span> <span class="s1">&#39;绿水青山&#39;</span><span class="p">],</span>
          <span class="s1">&#39;司法建设&#39;</span><span class="p">:</span> <span class="p">[</span><span class="s1">&#39;法律&#39;</span><span class="p">,</span> <span class="s1">&#39;司法&#39;</span><span class="p">,</span> <span class="s1">&#39;司法建设&#39;</span><span class="p">],</span>
        <span class="p">}</span>


<span class="n">panel_df3</span> <span class="o">=</span> <span class="n">generate_panel_data</span><span class="p">(</span><span class="n">file</span><span class="o">=</span><span class="s1">&#39;人民日报.csv.gzip&#39;</span><span class="p">,</span> 
                                <span class="n">objects</span><span class="o">=</span><span class="n">provs3</span><span class="p">,</span> 
                                <span class="c1">#实验数据csv文件的日期字段为text</span>
                                <span class="n">text_field</span><span class="o">=</span><span class="s1">&#39;content&#39;</span><span class="p">,</span>  
                                 <span class="c1">#实验数据csv文件的日期字段为date</span>
                                <span class="n">date_field</span><span class="o">=</span><span class="s1">&#39;date&#39;</span><span class="p">,</span> 
                                <span class="n">freq</span><span class="o">=</span><span class="s1">&#39;Y&#39;</span><span class="p">,</span>  <span class="c1">#年度</span>
                                <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8&#39;</span><span class="p">)</span>
<span class="c1">#panel_df3.to_csv(&#39;人民日报新闻三概念词(objects为字典)年度被提及占比.csv&#39;, index=False)</span>
<span class="n">panel_df3</span>
</code></pre></div><p><img loading="lazy" src="img/08-panel.png" alt=""  />
</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">plot_figure</span><span class="p">(</span><span class="n">panel_df</span><span class="o">=</span><span class="n">panel_df3</span><span class="p">,</span> 
            <span class="n">title</span><span class="o">=</span><span class="s1">&#39;人民日报新闻经济、环境、司法三类概念词年度被提及占比(1946-2023)&#39;</span><span class="p">,</span> 
            <span class="n">object_field</span><span class="o">=</span><span class="s1">&#39;Object&#39;</span><span class="p">,</span> 
            <span class="n">date_field</span><span class="o">=</span><span class="s1">&#39;Date&#39;</span><span class="p">,</span> 
            <span class="n">value_filed</span><span class="o">=</span><span class="s1">&#39;MentionRatio&#39;</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/09-plot.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="四获取数据集">四、获取数据集</h2>
<p><a href="https://textdata.cn/blog/2023-12-14-daily-news-dataset/">新闻数据集 | 含 人民日报/经济日报/光明日报 等 120 家媒体(更新至2024.06)</a></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">加微信 372335839， 备注「姓名-学校-专业」
</code></pre></div><br>
<p>更多数据集，可点击前往 <a href="https://textdata.cn/blog/datasets_available_for_management_science/">LIST | 可供社科(经管)领域使用的数据集汇总</a></p>
<br>
<br>
]]></content:encoded>
    </item>
    
    <item>
      <title>代码 | 使用「新闻数据」测量 「经济政策不确定性EPU」指标</title>
      <link>https://textdata.cn/blog/2023-12-20-measure-china-economic-policy-uncertainty/</link>
      <pubDate>Sun, 17 Dec 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-12-20-measure-china-economic-policy-uncertainty/</guid>
      <description>使用新闻联播、经济日报、人民日报，计算经济政策不确定性指数</description>
      <content:encoded><![CDATA[<h2 id="一经济政策不确定性指标">一、经济政策不确定性指标</h2>
<p>经济政策不确定性(Economic Policy Uncertainty, EPU) 通常是用来衡量经济中政策不确定性水平的一种度量方式。 本文参考</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">Huang, Yun, and Paul Luk. &#34;Measuring economic policy uncertainty in China.&#34; China Economic Review 59 (2020): 101367
</code></pre></div><br>
<h3 id="11-新闻数据库">1.1 新闻数据库</h3>
<p><a href="https://textdata.cn/blog/2023-12-14-daily-news-dataset/">新闻数据集 | 含 人民日报/经济日报/光明日报 等 60+ 家媒体(2024.05.24)</a></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">人民日报rmrb:       1946-05-15 ~ 2024-05-24
光明日报gmrb:       1985-01-01 ~ 2024-05-24
人民政协报rmzxb:     2008-01-02 ~ 2024-05-24
经济日报jjrb:       2008-01-27 ~ 2024-05-24
中国青年报zqb:     2005-01-01 ~ 2024-05-24
南方周末nfzm:       2008-01-02 ~ 2023-5-31
</code></pre></div><br>
<h3 id="12-算法">1.2 算法</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">Step-1. 选择了114家中国大陆的报纸，其中包括北京、上海、广州和天津等主要城市的报纸。
Step-2. 对于每家报纸，搜索包含以下三个关键词之一的文章：经济、不确定性和政策。这些关键词的中文和英文对照可以在论文的表格1中找到。
Step-3. 将每个月的文章数量按照满足第一个关键词的文章数量进行缩放。
Step-4. 将时间序列标准化，使其在2000年1月至2011年12月期间的标准差为1。 保证所有媒体计算得到的epu是可比的。
Step-5. 对十家报纸的月度序列进行简单平均，并将指标归一化，使其在2000年1月至2011年12月期间的平均值为100。
</code></pre></div><p>文献中算法内容长， 结构化不足， 理解起来需要一些脑力。 大邓换种描述方式</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">EPU_t = m/n

- m  时期 t 同时含经济Economic、政策Policy、不确定Uncertainty三类词的新闻条数m
- n  时期 t 总的新闻条数n
</code></pre></div><p>本推文是利用一个媒体进行 <em><strong>EPU</strong></em> 指标的构建， 只需用到算法中的前 3 个步骤。</p>
<p><br><br></p>
<h2 id="二准备工作">二、准备工作</h2>
<p>EPU 算法代码已封装到 cntext 中， 计算这个指数， 就变得容易多了。</p>
<h3 id="21-安装cntext">2.1 安装cntext</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">pip3 install pdfdocx
pip3 install distinctiveness
pip3 install pandarallel
pip3 install cntext --upgrade
</code></pre></div><br>
<h3 id="22-查看内置词典">2.2 查看内置词典</h3>
<p>EPU词典已内置于 cntext 中</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="nb">print</span><span class="p">(</span><span class="n">ct</span><span class="o">.</span><span class="n">__version__</span><span class="p">)</span>
<span class="n">ct</span><span class="o">.</span><span class="n">get_dict_list</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">2.1.7

[&#39;zh_common_NTUSD.yaml&#39;,
 &#39;zh_common_DUTIR.yaml&#39;,
 &#39;enzh_common_StopWords.yaml&#39;,
 &#39;en_valence_Concreteness.yaml&#39;,
 &#39;en_common_LoughranMcDonald.yaml&#39;,
 &#39;zh_common_FinanceSenti.yaml&#39;,
 &#39;zh_common_TsinghuaPraiseDegrade.yaml&#39;,
 &#39;zh_common_FEPU.yaml&#39;,    
 &#39;en_common_ANEW.yaml&#39;,
 &#39;en_common_NRC.yaml&#39;,
 &#39;zh_valence_ChineseEmoBank.yaml&#39;,
 &#39;zh_valence_SixSemanticDimensionDatabase.yaml&#39;,
 &#39;zh_common_FinacialFormalUnformal.yaml&#39;,
 &#39;zh_common_LoughranMcDonald.yaml&#39;,
 &#39;enzh_common_AdvConj.yaml&#39;,
 &#39;en_common_SentiWS.yaml&#39;,
 &#39;zh_common_Digitalization.yaml&#39;,
 &#39;en_common_LSD2015.yaml&#39;,
 &#39;zh_common_HowNet.yaml&#39;,
 &#39;zh_common_EPU.yaml&#39;]      #Huang, Yun, and Paul Luk（2020）
</code></pre></div><br>
<h3 id="23-导入词典">2.3 导入词典</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>
<span class="n">EPU_infos</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">read_yaml_dict</span><span class="p">(</span><span class="s1">&#39;zh_common_EPU.yaml&#39;</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">EPU_infos</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">{&#39;Name&#39;: &#39;中文经济政策不确定性词典EPU&#39;, 

&#39;Desc&#39;: &#39;中文经济政策不确定性词典EPU, 含经济Economic、政策Policy、不确定性Uncertainty三个词表&#39;, &#39;Refer&#39;: &#39;Huang, Yun, and Paul Luk. &#34;Measuring economic policy uncertainty in China.&#34; China Economic Review 59 (2020): 101367&#39;, 

&#39;Category&#39;: [&#39;经济&#39;, &#39;政策&#39;, &#39;不确定&#39;], 

&#39;Dictionary&#39;: 
   {
      &#39;经济&#39;: [&#39;经济&#39;, &#39;金融&#39;], 
      &#39;政策&#39;: [&#39;政策&#39;, &#39;制度&#39;, &#39;体制&#39;, &#39;战略&#39;, &#39;措施&#39;, &#39;规章&#39;, &#39;规例&#39;, &#39;条例&#39;, &#39;政治&#39;, &#39;执政&#39;, &#39;政府&#39;, &#39;政委&#39;, &#39;国务院&#39;, &#39;人大&#39;, &#39;人民代表大会&#39;, &#39;中央&#39;, &#39;国家主席&#39;, &#39;总书记&#39;, &#39;国家领导人&#39;, &#39;总理&#39;, &#39;改革&#39;, &#39;整改&#39;, &#39;整治&#39;, &#39;规管&#39;, &#39;监管&#39;, &#39;财政&#39;, &#39;税&#39;, &#39;人民银行&#39;, &#39;央行&#39;, &#39;赤字&#39;, &#39;利率&#39;], 
      &#39;不确定&#39;: [&#39;不确定&#39;, &#39;不明确&#39;, &#39;波动&#39;, &#39;震荡&#39;, &#39;动荡&#39;, &#39;不稳&#39;, &#39;未明&#39;, &#39;不明朗&#39;, &#39;不清晰&#39;, &#39;未清晰&#39;, &#39;难料&#39;, &#39;难以预料&#39;, &#39;难以预测&#39;, &#39;难以预计&#39;, &#39;难以估计&#39;, &#39;无法预料&#39;, &#39;无法预测&#39;, &#39;无法预计&#39;, &#39;无法估计&#39;, &#39;不可预料&#39;, &#39;不可预测&#39;, &#39;不可预计&#39;, &#39;不可估计&#39;]
   }
}
</code></pre></div><br>
<h3 id="24-ctepu">2.4 ct.epu</h3>
<p>cntext 内置函数</p>
<p><em><strong>ct.epu(df,  freq=&lsquo;Y&rsquo;,e_pattern='', p_pattern='', u_pattern='')</strong></em></p>
<ul>
<li><em><strong>df</strong></em>  新闻DataFrame；  DataFrame必须含date和text两个字段；每行一条记录，含所有时期所有的新闻。</li>
<li><em><strong>freq</strong></em> 字符串；决定EPU的时间粒度， 年Y、月M、天D， 默认freq=&lsquo;Y&rsquo;</li>
<li><em><strong>e_pattern</strong></em>  字符串；经济类词典，用<code>|</code>间隔词语，形如 <strong>e_pattern = &lsquo;经济|金融&rsquo;</strong></li>
<li><em><strong>p_pattern</strong></em>  字符串；政策词典，用<code>|</code>间隔词语，形如 <strong>p_pattern = &lsquo;政策|治理|行政&rsquo;</strong></li>
<li><em><strong>u_pattern</strong></em> 字符串；不确定性词典，用<code>|</code>间隔词语，形如 <strong>u_pattern = &lsquo;风险|危机|难以预测&rsquo;</strong></li>
</ul>
<p>返回epu时间序列数据，格式为DataFrame</p>
<br>
<br>
<h2 id="三测量epu">三、测量EPU</h2>
<h3 id="31-读取数据">3.1 读取数据</h3>
<p>大邓的 <a href="https://textdata.cn/blog/2023-12-14-daily-news-dataset/"><strong>新闻数据集 | 含 人民日报/经济日报/光明日报 等 60+ 家媒体(2024.05.24)</strong></a>中的所有媒体， 均有csv格式， 内含 date 和 text 两个字段， csv中的每行是一条新闻。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">rmrb_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;人民日报.csv.gzip&#39;</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">)</span>
<span class="n">rmrb_df</span><span class="o">.</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="s1">&#39;content&#39;</span><span class="p">:</span> <span class="s1">&#39;text&#39;</span><span class="p">},</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">rmrb_df</span> <span class="o">=</span> <span class="n">rmrb_df</span><span class="p">[[</span><span class="s1">&#39;date&#39;</span><span class="p">,</span> <span class="s1">&#39;text&#39;</span><span class="p">]]</span>
<span class="n">rmrb_df</span>
</code></pre></div><p><img loading="lazy" src="img/01-df.png" alt=""  />
</p>
<br>
<h3 id="32-批量运算">3.2 批量运算</h3>
<p><em><strong>ct.epu(df,  freq=&lsquo;Y&rsquo;,  e_pattern='', p_pattern='', u_pattern='')</strong></em></p>
<ul>
<li><em><strong>df</strong></em>  新闻DataFrame；  DataFrame必须含date和text两个字段；每行一条记录，含所有时期所有的新闻。</li>
<li><em><strong>freq</strong></em> 字符串；决定EPU的时间粒度， 年Y、月M、天D， 默认freq=&lsquo;Y&rsquo;</li>
<li><em><strong>e_pattern</strong></em>  字符串；经济类词典，用<code>|</code>间隔词语，形如 <strong>e_pattern = &lsquo;经济|金融&rsquo;</strong></li>
<li><em><strong>p_pattern</strong></em>  字符串；政策词典，用<code>|</code>间隔词语，形如 <strong>p_pattern = &lsquo;政策|治理|行政&rsquo;</strong></li>
<li><em><strong>u_pattern</strong></em> 字符串；不确定性词典，用<code>|</code>间隔词语，形如 <strong>u_pattern = &lsquo;风险|危机|难以预测&rsquo;</strong></li>
</ul>
<p>返回epu时间序列数据，格式为DataFrame</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="n">ct</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>


<span class="n">rmrb_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;人民日报.csv.gzip&#39;</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">)</span>
<span class="n">rmrb_df</span><span class="o">.</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="s1">&#39;content&#39;</span><span class="p">:</span> <span class="s1">&#39;text&#39;</span><span class="p">},</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">rmrb_df</span> <span class="o">=</span> <span class="n">rmrb_df</span><span class="p">[[</span><span class="s1">&#39;date&#39;</span><span class="p">,</span> <span class="s1">&#39;text&#39;</span><span class="p">]]</span>

<span class="c1">#默认使用内置的zh_common_EPU.yaml，所以不设置参数e_pattern、p_pattern、u_pattern</span>
<span class="c1">#EPU的时间粒度是月度M</span>
<span class="n">rmrb_EPU_df</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">epu</span><span class="p">(</span><span class="n">df</span><span class="o">=</span><span class="n">rmrb_df</span><span class="p">,</span>
                <span class="n">freq</span><span class="o">=</span><span class="s1">&#39;M&#39;</span><span class="p">,</span>
                <span class="p">)</span>

<span class="n">rmrb_EPU_df</span>
</code></pre></div><p><img loading="lazy" src="img/02-df.png" alt=""  />
</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="n">ct</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>


<span class="n">gmrb_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;gmrb.csv.gzip&#39;</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">)</span>
<span class="n">gmrb_df</span><span class="o">.</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="s1">&#39;content&#39;</span><span class="p">:</span> <span class="s1">&#39;text&#39;</span><span class="p">},</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">gmrb_df</span> <span class="o">=</span> <span class="n">gmrb_df</span><span class="p">[[</span><span class="s1">&#39;date&#39;</span><span class="p">,</span> <span class="s1">&#39;text&#39;</span><span class="p">]]</span>

<span class="c1">#默认使用内置的zh_common_EPU.yaml，所以不设置参数e_pattern、p_pattern、u_pattern</span>
<span class="c1">#EPU的时间粒度是月度M</span>
<span class="n">gmrb_EPU_df</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">epu</span><span class="p">(</span><span class="n">df</span><span class="o">=</span><span class="n">gmrb_df</span><span class="p">,</span>
                <span class="n">freq</span><span class="o">=</span><span class="s1">&#39;M&#39;</span><span class="p">,</span>
                <span class="p">)</span>

<span class="n">gmrb_EPU_df</span>
</code></pre></div><p><img loading="lazy" src="img/03-df.png" alt=""  />
</p>
<p><br><br></p>
<h3 id="33-注意">3.3 注意</h3>
<p>需要注意， 以上结果都是对一个媒体进行计算，所以没有进行标准化和归一化。</p>
<p>所以媒体1、媒体2计算得到的两个 <em><strong>epu1</strong></em>、<em><strong>epu2</strong></em> 进行数值大小的比较是没有意义的。 如果你有多个媒体，计算得到多个 <em><strong>epu1</strong></em> 、<em><strong>epu2</strong></em>、 <em><strong>epu3</strong></em>， 想计算 <em><strong>mean_epu</strong></em> , 那么记得实现论文算法里的 <em><strong>step4</strong></em>， 再执行 <em><strong>step5</strong></em> 求均值。</p>
<p><br><br></p>
<h2 id="四可视化">四、可视化</h2>
<h3 id="41-dfplot">4.1 df.plot</h3>
<p>df.plot使用的前提是要将日期字段调整为index, 满足下面形态的数据可以使用.plot绘图</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">rmrb_EPU_df</span><span class="o">.</span><span class="n">set_index</span><span class="p">(</span><span class="s1">&#39;date&#39;</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/04-df.png" alt=""  />
</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">rmrb_EPU_df</span><span class="o">.</span><span class="n">set_index</span><span class="p">(</span><span class="s1">&#39;date&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">5</span><span class="p">),</span> <span class="n">title</span><span class="o">=</span><span class="s1">&#39;EPU Index </span><span class="se">\n</span><span class="s1">source: China Renmin Daily News&#39;</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/05-fig.png" alt=""  />
</p>
<br>
<h3 id="42-支持中文">4.2 支持中文</h3>
<p>支持中文的代码，无脑copy</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">import</span> <span class="nn">matplotlib</span>
<span class="kn">import</span> <span class="nn">scienceplots</span>
<span class="kn">import</span> <span class="nn">platform</span>
<span class="kn">import</span> <span class="nn">matplotlib_inline</span>
<span class="n">matplotlib_inline</span><span class="o">.</span><span class="n">backend_inline</span><span class="o">.</span><span class="n">set_matplotlib_formats</span><span class="p">(</span><span class="s1">&#39;png&#39;</span><span class="p">,</span> <span class="s1">&#39;svg&#39;</span><span class="p">)</span>

<span class="n">plt</span><span class="o">.</span><span class="n">style</span><span class="o">.</span><span class="n">use</span><span class="p">([</span><span class="s1">&#39;science&#39;</span><span class="p">,</span> <span class="s1">&#39;no-latex&#39;</span><span class="p">,</span> <span class="s1">&#39;cjk-sc-font&#39;</span><span class="p">])</span>
<span class="n">system</span> <span class="o">=</span> <span class="n">platform</span><span class="o">.</span><span class="n">system</span><span class="p">()</span>  <span class="c1"># 获取操作系统类型</span>
<span class="k">if</span> <span class="n">system</span> <span class="o">==</span> <span class="s1">&#39;Windows&#39;</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;SimHei&#39;</span><span class="p">}</span>
<span class="k">elif</span> <span class="n">system</span> <span class="o">==</span> <span class="s1">&#39;Darwin&#39;</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;Arial Unicode MS&#39;</span><span class="p">}</span>
<span class="k">else</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;sans-serif&#39;</span><span class="p">}</span>
<span class="n">matplotlib</span><span class="o">.</span><span class="n">rc</span><span class="p">(</span><span class="s1">&#39;font&#39;</span><span class="p">,</span> <span class="o">**</span><span class="n">font</span><span class="p">)</span>  <span class="c1"># 设置全局字体</span>



<span class="n">rmrb_EPU_df</span><span class="o">.</span><span class="n">set_index</span><span class="p">(</span><span class="s1">&#39;date&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">&#39;经济政策不确定性EPU </span><span class="se">\n</span><span class="s1">source: 人民日报&#39;</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">15</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xticks</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="mi">12</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s1">&#39;年份&#39;</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">13</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s1">&#39;EPU值&#39;</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">13</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/06-fig.png" alt=""  />
</p>
<br>
<h3 id="43-比较两个媒体的走势">4.3 比较两个媒体的走势</h3>
<p>两个新闻媒体覆盖的时间段不同，</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">人民日报rmrb:       1946-05-15 ~ 2023-12-18
光明日报gmrb:       1985-01-01 ~ 2023-12-18
</code></pre></div><p>截取1985-01-01之后的数据，进行比较。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">rmrb_EPU_df2</span> <span class="o">=</span> <span class="n">rmrb_EPU_df</span><span class="p">[</span><span class="n">rmrb_EPU_df</span><span class="p">[</span><span class="s1">&#39;date&#39;</span><span class="p">]</span><span class="o">&gt;</span><span class="s1">&#39;1985-01-01&#39;</span><span class="p">]</span>
<span class="n">gmrb_EPU_df2</span> <span class="o">=</span> <span class="n">gmrb_EPU_df</span><span class="p">[</span><span class="n">gmrb_EPU_df</span><span class="p">[</span><span class="s1">&#39;date&#39;</span><span class="p">]</span><span class="o">&gt;</span><span class="s1">&#39;1985-01-01&#39;</span><span class="p">]</span>


<span class="n">rmrb_EPU_df2</span><span class="o">.</span><span class="n">set_index</span><span class="p">(</span><span class="s1">&#39;date&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">&#39;月度经济政策不确定性EPU </span><span class="se">\n</span><span class="s1">source: 人民日报&#39;</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">15</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xticks</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="mi">12</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s1">&#39;年份&#39;</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">13</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s1">&#39;EPU值&#39;</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">13</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/07-fig.png" alt=""  />
</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">gmrb_EPU_df2</span><span class="o">.</span><span class="n">set_index</span><span class="p">(</span><span class="s1">&#39;date&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">&#39;月度经济政策不确定性EPU </span><span class="se">\n</span><span class="s1">source: 光明日报&#39;</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">15</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xticks</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="mi">12</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s1">&#39;年份&#39;</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">13</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s1">&#39;EPU值&#39;</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">13</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/08-fig.png" alt=""  />
</p>
<p>光明日报数据中缺失了1989年了，所以图中有空挡。但是从两个图中可以看到 epu 的走势大致一致。</p>
<p>作为事后诸葛的大邓， 从人民日报和光明日报计算出的EPU可以看到， 23年不应该投资，应该保守点。</p>
<p>嗯嗯， 同时作为投资小白，人群中的反向指标人，今年本人收益率-20%，大家开心不~</p>
<p><br><br></p>
<h2 id="五相关内容">五、相关内容</h2>
<p>用到以上操作的代码，通过本文以及这4个推文，巩固 pandas 操作知识点。</p>
<ul>
<li><a href="https://textdata.cn/blog/2023-12-17-how-to-generate-panel-data-from-gov-report-dataset/">代码 | 使用 <strong>地方gov工作报告</strong> 生成某类概念词频「<strong>面板数据</strong>」</a></li>
<li><a href="https://textdata.cn/blog/2023-12-18-how-to-generate-panel-data-from-daily-news-dataset/">代码 | 使用「<strong>新闻数据</strong>」构造概念词提及量「<strong>面板数据</strong>」</a></li>
<li><a href="https://textdata.cn/blog/2023-02-26-cctv1-xwlb-news-text-dataset/">数据代码| 使用 <strong>cctv新闻联播文稿</strong> 构造「<strong>面板数据</strong>」</a></li>
<li><a href="https://textdata.cn/blog/2028-12-18-how-to-extract-data-from-patent-application-dataset/">代码 | 使用 <strong>3571w专利申请数据</strong> 构造「<strong>面板数据</strong>」</a></li>
<li><a href="https://textdata.cn/blog/2024-04-25-firm-economic-policy-uncertainty/">代码 | 使用 <strong>MD&amp;A文本</strong> 测量「<strong>企业不确定性感知FEPU指标</strong>」</a></li>
</ul>
<br>
<p>相关文献</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[1]Huang, Yun, and Paul Luk. &#34;Measuring economic policy uncertainty in China.&#34; China Economic Review 59 (2020): 101367
[2]Caldara, Dario, Matteo Iacoviello, Patrick Molligo, Andrea Prestipino, and Andrea Raffo. &#34;The economic effects of trade policy uncertainty.&#34; Journal of Monetary Economics 109 (2020): 38-59.
</code></pre></div><p><br><br></p>
<h2 id="六获取资料">六、获取资料</h2>
<ul>
<li>
<p>免费领取<a href="rmrb_epu.csv">rmrb_epu.csv</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-12-14-daily-news-dataset/">按需购买 <strong>新闻数据集 | 含 人民日报/经济日报/光明日报 等 120 家媒体(2025.06)</strong></a></p>
</li>
</ul>
<p><br><br></p>
<h2 id="cntext使用声明">cntext使用声明</h2>
<p>如在研究或项目中使用 cntext ，请在文中介绍并附引用声明。引用格式可参考 <a href="https://textdata.cn/blog/2025-09-09-transform-text-data-into-structured-data-with-lmstudio-and-cntext/#%E4%BD%BF%E7%94%A8%E5%A3%B0%E6%98%8E">cntext 推荐引用格式</a></p>
<p><br><br></p>
<h2 id="精选内容">精选内容</h2>
<ul>
<li><a href="https://textdata.cn/blog/datasets_available_for_management_science/">LIST | 可供社科(经管)领域使用的数据集汇总</a></li>
<li><a href="https://textdata.cn/blog/the_text_analysis_list_about_ms/">LIST | 社科(经管)数据挖掘文献资料汇总</a></li>
<li><a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/">推荐 | 文本分析库cntext2.x使用手册</a></li>
<li><a href="https://textdata.cn/blog/management_python_course/">付费视频课 | Python实证指标构建与文本分析</a>
<br>
<br></li>
</ul>
]]></content:encoded>
    </item>
    
    <item>
      <title>数据(付费) | 使用cctv新闻联播文稿构造面板数据</title>
      <link>https://textdata.cn/blog/2023-02-26-cctv1-xwlb-news-text-dataset/</link>
      <pubDate>Sat, 16 Dec 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-02-26-cctv1-xwlb-news-text-dataset/</guid>
      <description>cctv新闻联播文稿数据集，可使用Python对其进行挖掘，借助文本挖掘技术研究鸿观经济政策、社会学、传播学等领域。</description>
      <content:encoded><![CDATA[<h2 id="一新闻联播">一、新闻联播</h2>
<h3 id="11-数据集概况">1.1 数据集概况</h3>
<p>全网最全的数据集， 记录缺失率最低的<strong>xwlb数据集</strong>，  <strong>新</strong>(fan)<strong>闻</strong>(rong)<strong>联</strong>(chang)<strong>播</strong>(sheng) 。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">数据来源: 央视网https://tv.cctv.com/lm/xwlb/ 
覆盖日期: 2006-09-01 ~ 2023-12-15
日记录数: 6315天
字段: date、 text
</code></pre></div><br>
<h3 id="12-研究用途">1.2 研究用途</h3>
<p>可从中提取丰富的指标，包括但不限于经济政策不确定性指数EPU 、 媒体关注度、媒体情绪、文本相似度。此外， 可训练词向量，开发新的概念词典，构建新的指标指数。数据带时间， 参照前面指标， 依主体、日期、指标进行计算， 可构造面板数据，因此在经济学、管理学、新闻传播学、公共管理等领域均有较高的研究价值。</p>
<p>相关参考文献</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[1]洪永淼,刘俸奇,薛涧坡.政府与市场心理因素的经济影响及其测度[J].管理世界,2023,39(03):30-51.
[2]刘景江,郑畅然,洪永淼.机器学习如何赋能管理学研究？——国内外前沿综述和未来展望[J].管理世界,2023,39(09):191-216.
[3]张一帆,林建浩,樊嘉诚.新闻文本大数据与消费增速实时预测——基于叙事经济学的视角[J].金融研究,2023,(05):152-169.
[4]Huang, Yun, and Paul Luk. &#34;Measuring economic policy uncertainty in China.&#34; China Economic Review 59 (2020): 101367
[5]欧阳资生,陈世丽,杨希特,刘凤根,周学伟.经济政策不确定性、网络舆情与金融机构系统性风险[J].管理科学学报,2023,26(04):62-86.
[6]逯东,宋昕倍.媒体报道、上市公司年报可读性与融资约束[J].管理科学学报,2021,24(12):45-61.
[7]彭涛,黄福广,孙凌霞.经济政策不确定性与风险承担:基于风险投资的证据[J].管理科学学报,2021,24(03):98-114.
[8]庞锐.采纳与内化：多重制度压力如何影响河长制创新扩散——基于省级政府的定向配对事件史分析[J].公共管理学报,2023,20(02):25-37+165-166.
</code></pre></div><br>
<h3 id="13-获取数据">1.3 获取数据</h3>
<p>【新闻联播xwlb】按年度，每年50元。 全量购买200元。</p>
<p><strong>加微信 372335839， 备注「姓名-学校-专业」</strong>。</p>
<p>更多新闻类数据  <a href="https://textdata.cn/blog/2023-12-14-daily-news-dataset">数据集 | 人民日报/经济日报/光明日报 等 7 家新闻类文本数据集</a></p>
<p><br><br></p>
<h2 id="二数据检查">二、数据检查</h2>
<h3 id="21-读取数据">2.1 读取数据</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="c1">#6315天</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;cctv_xwlb.csv&#39;</span><span class="p">)</span>

<span class="c1">#变更日期格式，可进行日期计算</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;date&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;date&#39;</span><span class="p">])</span>
<span class="nb">print</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">))</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">6315
</code></pre></div><p><img loading="lazy" src="img/df.png" alt=""  />
</p>
<br>
<h3 id="22-日期涵盖">2.2 日期涵盖</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">#执行过 df[&#39;date&#39;] = pd.to_datetime(df[&#39;date&#39;])
#才能进行日期计算

print(df[&#39;date&#39;].min().date())
print(df[&#39;date&#39;].max().date())
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">2006-09-01
2023-12-15
</code></pre></div><br>
<h3 id="33-缺失率">3.3 缺失率</h3>
<p>查看是否存在某些日期对应的文本是空</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s1">&#39;text&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">isna</span><span class="p">()</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span><span class="o">/</span><span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">0
</code></pre></div><br>
<p>生成2006-09-01-2023-12-15之间所有的日期datelist， 查看datelist哪些日期不在数据集中，以判断是否遗漏某些日期。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">datetime</span> <span class="k">as</span> <span class="nn">dt</span>  <span class="c1">#import datetime, timedelta  </span>
  
<span class="n">start_date</span> <span class="o">=</span> <span class="n">dt</span><span class="o">.</span><span class="n">datetime</span><span class="p">(</span><span class="mi">2006</span><span class="p">,</span> <span class="mi">9</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>  
<span class="n">end_date</span> <span class="o">=</span> <span class="n">dt</span><span class="o">.</span><span class="n">datetime</span><span class="p">(</span><span class="mi">2023</span><span class="p">,</span> <span class="mi">12</span><span class="p">,</span> <span class="mi">15</span><span class="p">)</span>  
<span class="n">delta</span> <span class="o">=</span> <span class="n">dt</span><span class="o">.</span><span class="n">timedelta</span><span class="p">(</span><span class="n">days</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>  
  
<span class="n">date_list</span> <span class="o">=</span> <span class="p">[]</span>  
<span class="n">current_date</span> <span class="o">=</span> <span class="n">start_date</span>  
<span class="k">while</span> <span class="n">current_date</span> <span class="o">&lt;=</span> <span class="n">end_date</span><span class="p">:</span>  
    <span class="n">date_list</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">current_date</span><span class="p">)</span>  
    <span class="n">current_date</span> <span class="o">+=</span> <span class="n">delta</span>  
  
<span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">(</span><span class="n">date_list</span><span class="p">)</span><span class="o">.</span><span class="n">isin</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;date&#39;</span><span class="p">])</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span><span class="o">/</span><span class="nb">len</span><span class="p">(</span><span class="n">date_list</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">1.0
</code></pre></div><p>2006-09-01~2023-12-15之间所有的日期， 均存在于新闻数据集中，也就是说数据集没有遗漏这期间任何一天的新闻。</p>
<p><br><br></p>
<h2 id="三实验">三、实验</h2>
<p>按月份(也可调整为周、年)计算一下正负面情绪词在新闻中出现次数， 然后转化为情感分值， 绘制成折线图。</p>
<ol>
<li>导入词典</li>
<li>设计算法, 如统计新闻总词数、正面词数、负面词数。</li>
<li>转化为情感分值</li>
<li>按月份汇总</li>
<li>绘制折线图</li>
</ol>
<h3 id="31-导入词典">3.1 导入词典</h3>
<p>使用cntext内置的中文经济金融场景的情感词典，该词典比较适合xwlb这种题材，我们查看一下</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#pip3 install cntext --upgrade</span>
<span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="n">diction</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">read_yaml_dict</span><span class="p">(</span><span class="s1">&#39;zh_common_FinanceSenti.yaml&#39;</span><span class="p">)[</span><span class="s1">&#39;Dictionary&#39;</span><span class="p">]</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;pos词数&#39;</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">diction</span><span class="p">[</span><span class="s1">&#39;pos&#39;</span><span class="p">]))</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;neg词数&#39;</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">diction</span><span class="p">[</span><span class="s1">&#39;neg&#39;</span><span class="p">]))</span>


<span class="c1">#词典整理自论文， 大家也可自行整理</span>
<span class="nb">print</span><span class="p">(</span><span class="n">ct</span><span class="o">.</span><span class="n">read_yaml_dict</span><span class="p">(</span><span class="s1">&#39;zh_common_FinanceSenti.yaml&#39;</span><span class="p">))</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">pos词数 3338
neg词数 5890


{&#39;Refer&#39;: &#39;Fuwei Jiang, Joshua Lee, Xiumin Martin, and Guofu Zhou.“Manager Sentiment and Stock Returns” Journal of Financial Economics 132(1), 2019,126-149&#39;,
 
 &#39;Desc&#39;: &#39;Chinese Financial Sentiment Dictionary&#39;,
 &#39;Category&#39;: [&#39;pos&#39;, &#39;neg&#39;],
 
 &#39;Name&#39;: &#39;Chinese Financial Sentiment Dictionary&#39;,
 
 &#39;Dictionary&#39;: {&#39;pos&#39;: [&#39;安定&#39;, &#39;安康&#39;, &#39;帮助&#39;, &#39;榜样&#39;, &#39;饱满&#39;, ...  &#39;最合适&#39;, &#39;最小&#39;, &#39;最新进展&#39;, &#39;最早&#39;, &#39;遵法&#39;],
  
                &#39;neg&#39;: [&#39;败坏名声&#39;, &#39;被没收的&#39;, &#39;变节&#39;, &#39;不便&#39;, &#39;不适当&#39;, &#39;妨碍&#39;,  &#39;腐败&#39;,...&#39;唉声叹气&#39;, &#39;哀怨&#39;, &#39;哀叹&#39;, &#39;哀伤&#39;, &#39;哀悼&#39;]
}
</code></pre></div><br>
<h3 id="32-统计词频">3.2 统计词频</h3>
<p>这里</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>  
<span class="kn">import</span> <span class="nn">jieba</span>
<span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>

<span class="n">diction</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">read_yaml_dict</span><span class="p">(</span><span class="s1">&#39;zh_common_FinanceSenti.yaml&#39;</span><span class="p">)[</span><span class="s1">&#39;Dictionary&#39;</span><span class="p">]</span>
<span class="n">pos_patern</span> <span class="o">=</span> <span class="s1">&#39;|&#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">diction</span><span class="p">[</span><span class="s1">&#39;pos&#39;</span><span class="p">])</span>
<span class="n">neg_patern</span> <span class="o">=</span> <span class="s1">&#39;|&#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">diction</span><span class="p">[</span><span class="s1">&#39;neg&#39;</span><span class="p">])</span>

<span class="n">df</span><span class="p">[</span><span class="s1">&#39;word_num&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;text&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">text</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="n">jieba</span><span class="o">.</span><span class="n">lcut</span><span class="p">(</span><span class="n">text</span><span class="p">)))</span>

<span class="c1">#正面词数</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;pos_num&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;text&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">count</span><span class="p">(</span><span class="n">pos_patern</span><span class="p">)</span>

<span class="c1">#负面词数</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;neg_num&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;text&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">count</span><span class="p">(</span><span class="n">neg_patern</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/df2.png" alt=""  />
</p>
<br>
<h3 id="33-计算情感值">3.3 计算情感值</h3>
<p>使用 <code>score = pos-neg/(pos+neg)</code>， 可以将数值范围调整到 <code>-1 ~ 1</code>之间。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s1">&#39;senti_score&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;pos_num&#39;</span><span class="p">]</span> <span class="o">-</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;neg_num&#39;</span><span class="p">])</span><span class="o">/</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;pos_num&#39;</span><span class="p">]</span> <span class="o">+</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;neg_num&#39;</span><span class="p">])</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/df3.png" alt=""  />
</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="nb">print</span><span class="p">(</span><span class="s1">&#39;最小值&#39;</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;senti_score&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">min</span><span class="p">())</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;均值&#39;</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;senti_score&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">mean</span><span class="p">())</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;中位数&#39;</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;senti_score&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">median</span><span class="p">())</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;最大&#39;</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;senti_score&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">())</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">最小值 -0.36633663366336633
均值 0.5448464974146746
中位数 0.5657256687535572
最大 1.0
</code></pre></div><br>
<h3 id="34-按月份">3.4 按月份</h3>
<p>这里用到df.groupby方法， 可以按某种分组方法，得到不同组的dataframe集合。</p>
<p>dataframe集合可以通过for循环逐个迭代，分别计算对应年度的信息。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">month_datas</span> <span class="o">=</span> <span class="p">[]</span>

<span class="k">for</span> <span class="n">date</span><span class="p">,</span> <span class="n">year_df</span> <span class="ow">in</span> <span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">pd</span><span class="o">.</span><span class="n">Grouper</span><span class="p">(</span><span class="n">key</span><span class="o">=</span><span class="s1">&#39;date&#39;</span><span class="p">,</span> <span class="n">freq</span><span class="o">=</span><span class="s1">&#39;M&#39;</span><span class="p">)):</span>
    <span class="n">data</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span>
    
    <span class="n">data</span><span class="p">[</span><span class="s1">&#39;date&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">date</span>
    
    <span class="n">data</span><span class="p">[</span><span class="s1">&#39;senti_score&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">year_df</span><span class="p">[</span><span class="s1">&#39;senti_score&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span>
    <span class="n">month_datas</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
    
<span class="n">month_info_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">month_datas</span><span class="p">)</span>
<span class="n">month_info_df</span>
</code></pre></div><p><img loading="lazy" src="img/df4.png" alt=""  />
</p>
<br>
<h3 id="35-绘制月情感折线图">3.5 绘制月情感折线图</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">import matplotlib.pyplot as plt
import matplotlib
import matplotlib_inline
matplotlib_inline.backend_inline.set_matplotlib_formats(&#39;png&#39;, &#39;svg&#39;)
import scienceplots
import platform
import pandas as pd
import numpy as np


plt.style.use([&#39;science&#39;, &#39;no-latex&#39;, &#39;cjk-sc-font&#39;])
system = platform.system()  # 获取操作系统类型

if system == &#39;Windows&#39;:
    font = {&#39;family&#39;: &#39;SimHei&#39;}
elif system == &#39;Darwin&#39;:
    font = {&#39;family&#39;: &#39;Arial Unicode MS&#39;}
else:
    font = {&#39;family&#39;: &#39;sans-serif&#39;}
matplotlib.rc(&#39;font&#39;, **font)  # 设置全局字体


plt.figure(figsize=(12, 5))
plt.plot(month_info_df[&#39;date&#39;], month_info_df[&#39;senti_score&#39;])
plt.title(&#39;XWLB月度情感值折线图(2006-2023)&#39;)
plt.show()
</code></pre></div><p><img loading="lazy" src="img/plot.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="四相关内容">四、相关内容</h2>
<ul>
<li>
<p><a href="https://textdata.cn/blog/2023-12-27-measure-gov-digitalization/">代码 | 使用gov工作报告生成数字化词频「面板数据」</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-12-18-how-to-generate-panel-data-from-daily-news-dataset/">代码 | 使用「新闻数据」构造概念词提及量「面板数据」</a></p>
</li>
<li>
<p><a href="https://textdata.cn/blog/2023-02-26-cctv1-xwlb-news-text-dataset/">数据(付费) | 使用cctv新闻联播文稿构造面板数据</a></p>
</li>
</ul>
<p><br><br></p>
<h2 id="五获取数据">五、获取数据</h2>
<p>【新闻联播xwlb】按年度，每年50元。 全量购买100元。</p>
<p><strong>加微信 372335839， 备注「姓名-学校-专业」</strong>。</p>
<p>更多新闻类数据  <a href="https://textdata.cn/blog/2023-12-14-daily-news-dataset">数据集 | 人民日报/经济日报/光明日报 等 7 家新闻类文本数据集</a></p>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>数据集 | 上市公司 208 万条专利数据集 (1991-2022)</title>
      <link>https://textdata.cn/blog/2023-12-07-patent-application-dataset-of-listed-company-in-china-a-market/</link>
      <pubDate>Thu, 07 Dec 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-12-07-patent-application-dataset-of-listed-company-in-china-a-market/</guid>
      <description>上市公司专利申请数据集</description>
      <content:encoded><![CDATA[<h2 id="一上市公司专利数据集">一、上市公司专利数据集</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">上市公司数:  4393
专利记录数:  2083784
专利申请日:  1991-01-30 ~ 2022-12-31
原始来源:   国家知识产权局
</code></pre></div><br>
<h3 id="声明">声明</h3>
<p>科研用途；如有问题， 请加微信372335839，备注「姓名-学校-专业」</p>
<br>
<br>
<h2 id="二数据探索">二、数据探索</h2>
<h3 id="21-读取数据">2.1 读取数据</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="c1">#df = pd.read_csv(&#39;上市公司-专利明细数据1991-2022.csv&#39;)</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;上市公司-专利明细数据1991-2022.csv.gz&#39;</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">)</span>

<span class="c1">#剔除重复的</span>
<span class="n">df</span><span class="o">.</span><span class="n">drop_duplicates</span><span class="p">(</span><span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/df1.png" alt=""  />
</p>
<br>
<h3 id="22-上市公司数--记录数">2.2 上市公司数 &amp; 记录数</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;上市公司数: </span><span class="si">{</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">nunique</span><span class="p">()</span> <span class="si">}</span><span class="s2">&#34;</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;专利申请数: </span><span class="si">{</span> <span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">)</span> <span class="si">}</span><span class="s2">&#34;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">上市公司数: 4393
专利申请数: 2083784
</code></pre></div><br>
<h3 id="23-字段缺失率">2.3 字段缺失率</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="nb">print</span><span class="p">(</span><span class="s1">&#39;字段缺失率统计&#39;</span><span class="p">,</span> <span class="n">end</span><span class="o">=</span><span class="s1">&#39;</span><span class="se">\n\n</span><span class="s1">&#39;</span><span class="p">)</span>
<span class="k">for</span> <span class="n">col</span> <span class="ow">in</span> <span class="n">df</span><span class="o">.</span><span class="n">columns</span><span class="p">:</span>
    <span class="n">ratio</span> <span class="o">=</span> <span class="nb">round</span><span class="p">(</span><span class="mi">100</span> <span class="o">*</span> <span class="n">df</span><span class="p">[</span><span class="n">col</span><span class="p">]</span><span class="o">.</span><span class="n">isna</span><span class="p">()</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span><span class="o">/</span><span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">),</span> <span class="mi">2</span><span class="p">)</span>
    <span class="c1">#print(f&#34;{col}: {ratio}%&#34;)</span>
    <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;</span><span class="si">{</span><span class="n">col</span><span class="si">:</span><span class="s2">&lt;</span><span class="si">{</span><span class="mi">10</span><span class="si">}}</span><span class="s2">: </span><span class="si">{</span><span class="n">ratio</span><span class="si">}</span><span class="s2">%&#34;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">字段缺失率统计

股票代码      : 0.0%
原始企业名称    : 0.0%
专利申请主体    : 0.0%
专利名称      : 0.0%
发明人       : 0.0%
地址        : 0.04%
专利类型      : 0.04%
专利申请号     : 0.04%
申请公布号     : 58.61%
授权公布号     : 41.43%
专利申请日     : 0.0%
公开公告日     : 58.61%
授权公告日     : 41.43%
专利申请年份    : 0.0%
原始来源      : 0.0%
统计截至日期    : 0.0%
更新时间      : 0.0%
</code></pre></div><br>
<h3 id="24-记录的日期范围">2.4 记录的日期范围</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s1">&#39;专利申请日&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;专利申请日&#39;</span><span class="p">],</span> <span class="n">errors</span><span class="o">=</span><span class="s1">&#39;ignore&#39;</span><span class="p">)</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;公开公告日&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;公开公告日&#39;</span><span class="p">],</span> <span class="n">errors</span><span class="o">=</span><span class="s1">&#39;ignore&#39;</span><span class="p">)</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;授权公告日&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;授权公告日&#39;</span><span class="p">],</span> <span class="n">errors</span><span class="o">=</span><span class="s1">&#39;ignore&#39;</span><span class="p">)</span>


<span class="nb">print</span><span class="p">(</span><span class="s2">&#34;专利申请日范围: </span><span class="si">{start}</span><span class="s2"> ~ </span><span class="si">{end}</span><span class="s2">&#34;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">start</span><span class="o">=</span><span class="nb">str</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;专利申请日&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">min</span><span class="p">())[:</span><span class="mi">10</span><span class="p">],</span>
                                           <span class="n">end</span><span class="o">=</span><span class="nb">str</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;专利申请日&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">())[:</span><span class="mi">10</span><span class="p">]))</span>
      
      
<span class="nb">print</span><span class="p">(</span><span class="s2">&#34;公开公告日范围: </span><span class="si">{start}</span><span class="s2"> ~ </span><span class="si">{end}</span><span class="s2">&#34;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">start</span><span class="o">=</span><span class="nb">str</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;公开公告日&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">min</span><span class="p">())[:</span><span class="mi">10</span><span class="p">],</span>
                                            <span class="n">end</span><span class="o">=</span><span class="nb">str</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;公开公告日&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">())[:</span><span class="mi">10</span><span class="p">]))</span>
      
      
<span class="nb">print</span><span class="p">(</span><span class="s2">&#34;授权公布日范围: </span><span class="si">{start}</span><span class="s2"> ~ </span><span class="si">{end}</span><span class="s2">&#34;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">start</span><span class="o">=</span><span class="nb">str</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;授权公告日&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">min</span><span class="p">())[:</span><span class="mi">10</span><span class="p">],</span>
                                            <span class="n">end</span><span class="o">=</span><span class="nb">str</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;授权公告日&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">())[:</span><span class="mi">10</span><span class="p">]))</span>

</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">专利申请日范围: 1991-01-30 ~ 2022-12-31
公开公告日范围: 1994-08-31 ~ 2023-08-25
授权公布日范围: 1993-12-01 ~ 2023-08-25
</code></pre></div><p>日期的三种字段， <em><strong>专利申请日</strong></em> 缺失率为0， 而 <em><strong>公开公告日</strong></em> 、 <em><strong>授权公告日</strong></em> 都分别高达 58.61%、 41.43%。 个人认为数据集涵盖的日期范围，使用<em><strong>专利申请日</strong></em>  更合适一些。</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">import</span> <span class="nn">matplotlib</span>
<span class="kn">import</span> <span class="nn">matplotlib_inline</span>
<span class="n">matplotlib_inline</span><span class="o">.</span><span class="n">backend_inline</span><span class="o">.</span><span class="n">set_matplotlib_formats</span><span class="p">(</span><span class="s1">&#39;png&#39;</span><span class="p">,</span> <span class="s1">&#39;svg&#39;</span><span class="p">)</span>
<span class="kn">import</span> <span class="nn">scienceplots</span>
<span class="kn">import</span> <span class="nn">platform</span>
<span class="n">plt</span><span class="o">.</span><span class="n">style</span><span class="o">.</span><span class="n">use</span><span class="p">([</span><span class="s1">&#39;science&#39;</span><span class="p">,</span> <span class="s1">&#39;no-latex&#39;</span><span class="p">,</span> <span class="s1">&#39;cjk-sc-font&#39;</span><span class="p">])</span>
<span class="n">system</span> <span class="o">=</span> <span class="n">platform</span><span class="o">.</span><span class="n">system</span><span class="p">()</span>  <span class="c1"># 获取操作系统类型</span>

<span class="k">if</span> <span class="n">system</span> <span class="o">==</span> <span class="s1">&#39;Windows&#39;</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;SimHei&#39;</span><span class="p">}</span>
<span class="k">elif</span> <span class="n">system</span> <span class="o">==</span> <span class="s1">&#39;Darwin&#39;</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;Arial Unicode MS&#39;</span><span class="p">}</span>
<span class="k">else</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;sans-serif&#39;</span><span class="p">}</span>
<span class="n">matplotlib</span><span class="o">.</span><span class="n">rc</span><span class="p">(</span><span class="s1">&#39;font&#39;</span><span class="p">,</span> <span class="o">**</span><span class="n">font</span><span class="p">)</span>  <span class="c1"># 设置全局字体</span>
    
<span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">12</span><span class="p">,</span> <span class="mi">8</span><span class="p">))</span>


<span class="n">df</span><span class="p">[</span><span class="s1">&#39;专利申请日&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">dt</span><span class="o">.</span><span class="n">year</span><span class="o">.</span><span class="n">value_counts</span><span class="p">(</span><span class="n">ascending</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">kind</span><span class="o">=</span><span class="s1">&#39;bar&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">&#39;上市公司专利数量(1991 ~ 2022)&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xticks</span><span class="p">(</span><span class="n">rotation</span><span class="o">=</span><span class="mi">45</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s1">&#39;年份&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s1">&#39;专利数量&#39;</span><span class="p">,</span> <span class="n">rotation</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/plot.png" alt=""  />
</p>
<br>
<h3 id="25-多个申请主体">2.5 多个申请主体</h3>
<p>申请主体可以是多个人，只要在 <em><strong>专利申请主体</strong></em> 中出现了 <code>;</code> , 则表示申请主体是对方的。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>

<span class="c1">#专利申请人主体可以是单个人(组织)，也可以是多人(组织)</span>
<span class="n">df</span><span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;专利申请主体&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">&#39;;&#39;</span><span class="p">),</span> <span class="kc">True</span><span class="p">,</span> <span class="kc">False</span><span class="p">)][</span><span class="s1">&#39;专利申请主体&#39;</span><span class="p">]</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">4          浙江南都电源动力股份有限公司; 杭州南都能源科技有限公司; 杭州南都电池有限公司
8                         中国海洋石油总公司;  中海油能源发展股份有限公司
9                       格力电器(武汉)有限公司;  珠海格力电器股份有限公司
10                        广东美的制冷设备有限公司;  美的集团股份有限公司
13             中国石油化工股份有限公司;  中国石油化工股份有限公司石油化工科学研究院
                             ...                   
2085560                 新疆大全新能源股份有限公司; 内蒙古大全新能源有限公司
2085562         大族激光科技产业集团股份有限公司; 深圳市大族鼎盛智能装备科技有限公司
2085572     中国石油化工股份有限公司;  中国石油化工股份有限公司胜利油田分公司物探研究院
2085573                    广东工业大学;  中船海洋与防务装备股份有限公司
2085574            平高集团有限公司;  河南平高电气股份有限公司;  国家电网公司
Name: 专利申请主体, Length: 516473, dtype: object
</code></pre></div><br>
<p>申请主体超过10个的记录，为了展示方便，这里只显示 <code>['股票代码', '专利申请主体', '专利名称', '专利申请日']</code>这四个字段。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">df[df[&#39;专利申请主体&#39;].str.count(&#39;;&#39;)&gt;9][[&#39;股票代码&#39;, &#39;专利申请主体&#39;, &#39;专利名称&#39;, &#39;专利申请日&#39;]]
</code></pre></div><p><img loading="lazy" src="img/df2.png" alt=""  />
</p>
<br>
<p><strong>申请主体数</strong></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">df[&#39;专利申请主体&#39;].str.count(&#39;;&#39;)+1
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">0          1.0
1          1.0
2          1.0
3          1.0
4          3.0
          ... 
2085572    2.0
2085573    2.0
2085574    3.0
2085575    1.0
2085576    1.0
Name: 专利申请主体, Length: 2083784, dtype: float64
</code></pre></div><br>
<p>申请主体数的汇总</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;专利申请主体&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">count</span><span class="p">(</span><span class="s1">&#39;;&#39;</span><span class="p">)</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">value_counts</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">专利申请主体
1.0     1567311
2.0      428833
3.0       67820
4.0       13130
5.0        4364
6.0        1894
7.0         282
8.0          59
10.0         27
9.0          23
11.0         14
16.0          9
12.0          7
19.0          4
13.0          2
14.0          2
Name: count, dtype: int64
</code></pre></div><br>
<p>均值和方差</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">mainbody_mean = (df[&#39;专利申请主体&#39;].str.count(&#39;;&#39;)+1).mean()
mainbody_std = (df[&#39;专利申请主体&#39;].str.count(&#39;;&#39;)+1).std()

print(&#39;申请主体数均值:&#39;, mainbody_mean)
print(&#39;申请主体数标准差:&#39;,mainbody_std)
</code></pre></div><br>
<p>中学学过正态分布， 在一个正负标准差范围内， 能落下大部分的记录数。咱们看看 <strong>均值加减一个标准差</strong> 占总体的比例</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">mask1</span> <span class="o">=</span> <span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;专利申请主体&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">count</span><span class="p">(</span><span class="s1">&#39;;&#39;</span><span class="p">)</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span> <span class="o">&gt;</span> <span class="p">(</span><span class="n">mainbody_mean</span><span class="o">-</span><span class="n">mainbody_std</span><span class="p">)</span>
<span class="n">mask2</span> <span class="o">=</span> <span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;专利申请主体&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">count</span><span class="p">(</span><span class="s1">&#39;;&#39;</span><span class="p">)</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span> <span class="o">&lt;</span> <span class="p">(</span><span class="n">mainbody_mean</span><span class="o">+</span><span class="n">mainbody_std</span><span class="p">)</span>

<span class="c1">#落在 均值加减一个标准差范围内的数据占比75%</span>
<span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="n">mask1</span> <span class="o">&amp;</span> <span class="n">mask2</span><span class="p">])</span><span class="o">/</span><span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">0.7521465756527548
</code></pre></div><p><br><br></p>
<h2 id="三相关文献">三、相关文献</h2>
<p>使用专利数据的相关文献</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[1]Bellstam, Gustaf, Sanjai Bhagat, and J. Anthony Cookson. &#34;A text-based analysis of corporate innovation.&#34; _Management Science_ 67, no. 7 (2021): 4004-4031.
[2]Arts, Sam, Bruno Cassiman, and Jianan Hou. &#34;Position and Differentiation of Firms in Technology Space.&#34; Management Science (2023).
</code></pre></div><p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>数据集 | 2.49亿条中国工商注册企业信息(23.9更新)</title>
      <link>https://textdata.cn/blog/2023-12-03-china-mainland-corporate-registration-information/</link>
      <pubDate>Sun, 03 Dec 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-12-03-china-mainland-corporate-registration-information/</guid>
      <description>341个地市， 2亿条工商注册信息， 网盘压缩文件夹体积17.6G</description>
      <content:encoded><![CDATA[<h2 id="一工商数据集">一、工商数据集</h2>
<h3 id="11-概况">1.1 概况</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">数据来源: 国家企业信用信息公示系统
记录条数: 2.49亿条
文件体积: 160G(解压后)
涵盖日期: 1949.10.1~2023.9.19

数据集已脱敏处理， 没有手机号、邮箱等联系信息，无商业营销价值。
科研用途，仅供展示。如有问题，加微信372335839，备注「姓名-学校-专业」
</code></pre></div><p><img loading="lazy" src="img/dataset-screen.png" alt=""  />
</p>
<br>
<h3 id="12-字段">1.2 字段</h3>
<p>任意csv文件的字段包括</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">- 企业名称
- 英文名称
- 统一社会信用代码
- 企业类型
- 经营状态
- 成立日期
- 核准日期
- 法定代表人
- 注册咨本
- 实缴资本
- 参保人数
- 公司规模
- 经营范围
- 注册地址
- 营业期限
- 纳税人识别号
- 工商注册号
- 组织机构代码
- 纳税人资质
- 曾用名
- 所属省份
- 所属城市
- 所属区县
- 网站链接
- 所属行业
- 登记机关
- 经度
- 纬度
</code></pre></div><br>
<h3 id="13--查看文件">1.3  查看文件</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">os</span>

<span class="n">os</span><span class="o">.</span><span class="n">listdir</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"> [
 &#39;北京.csv.gz&#39;,
 &#39;上海.csv.gz&#39;,
 &#39;南京.csv.gz&#39;,
 ...
 &#39;重庆.csv.gz&#39;,
  ]
</code></pre></div><p><br><br></p>
<h2 id="二实验代码">二、实验代码</h2>
<h3 id="21-读取数据">2.1 读取数据</h3>
<p>不考虑电脑内存容量限制， 读取 石家庄市、长沙市、杭州市。如果电脑内存很小，请先阅读  <a href="https://textdata.cn/blog/2023-11-17-how-handle-mega-csv-that-far-exceed-memory/">推荐 | 如何处理远超电脑内存的csv文件</a></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">sjz_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;石家庄.csv.gz&#39;</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8&#39;</span><span class="p">,</span> <span class="n">low_memory</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
<span class="n">cs_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;长沙.csv.gz&#39;</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8&#39;</span><span class="p">,</span> <span class="n">low_memory</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
<span class="n">hz_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;杭州.csv.gz&#39;</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s1">&#39;gzip&#39;</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8&#39;</span><span class="p">,</span> <span class="n">low_memory</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>

<span class="c1">#随机显示2条记录</span>
<span class="n">sjz_df</span><span class="o">.</span><span class="n">sample</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/df.png" alt=""  />
</p>
<br>
<h3 id="22-记录数">2.2 记录数</h3>
<p>石家庄.csv 企业记录数</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="nb">len</span><span class="p">(</span><span class="n">sjz_df</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">2010163
</code></pre></div><br>
<h3 id="23-所含字段">2.3 所含字段</h3>
<p>含有的字段有</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">sjz_df</span><span class="o">.</span><span class="n">columns</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">    Index([&#39;企业组织机构代码&#39;, &#39;企业名称&#39;, &#39;注册资本&#39;, &#39;实缴资本&#39;, &#39;纳税人识别号&#39;, &#39;法定代表人&#39;, &#39;企业状态&#39;, &#39;所属行业&#39;,
           &#39;企业名称&#39;, &#39;英文名称&#39;, &#39;统一社会信用代码&#39;, &#39;企业类型&#39;, &#39;经营状态&#39;, &#39;成立日期&#39;, &#39;核准日期&#39;, &#39;法定代表人&#39;,
           &#39;注册咨本&#39;, &#39;实缴资本&#39;, &#39;参保人数&#39;, &#39;公司规模&#39;, &#39;经营范围&#39;, &#39;注册地址&#39;, &#39;营业期限&#39;, &#39;纳税人识别号&#39;, &#39;工商注册号&#39;, &#39;组织机构代码&#39;, &#39;联系电话&#39;, &#39;邮箱&#39;, &#39;纳税人资质&#39;, &#39;曾用名&#39;, &#39;所属省份&#39;, &#39;所属城市&#39;, &#39;所属区县&#39;, &#39;网站链接&#39;, &#39;所属行业&#39;, &#39;登记机关&#39;, &#39;经度&#39;, &#39;纬度&#39;],
          dtype=&#39;object&#39;)
</code></pre></div><br>
<h3 id="24-日期转换">2.4 日期转换</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">sjz_df</span><span class="p">[</span><span class="s1">&#39;成立日期&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">sjz_df</span><span class="p">[</span><span class="s1">&#39;成立日期&#39;</span><span class="p">])</span>

<span class="c1">#石家庄数据集日期范围</span>
<span class="nb">print</span><span class="p">(</span><span class="n">sjz_df</span><span class="p">[</span><span class="s1">&#39;成立日期&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">min</span><span class="p">())</span>
<span class="nb">print</span><span class="p">(</span><span class="n">sjz_df</span><span class="p">[</span><span class="s1">&#39;成立日期&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">())</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">1917-01-30 00:00:00
2023-09-19 00:00:00
</code></pre></div><br>
<p>查看成立日期为1917-01-30的信息</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">datetime</span>

<span class="n">sjz_df</span><span class="p">[</span><span class="n">sjz_df</span><span class="p">[</span><span class="s1">&#39;成立日期&#39;</span><span class="p">]</span><span class="o">==</span><span class="n">datetime</span><span class="o">.</span><span class="n">datetime</span><span class="p">(</span><span class="n">year</span><span class="o">=</span><span class="mi">1917</span><span class="p">,</span> <span class="n">month</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">day</span><span class="o">=</span><span class="mi">30</span><span class="p">)]</span><span class="o">.</span><span class="n">to_dict</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">{&#39;企业组织机构代码&#39;: {913555: &#39;81130000MC0611518K&#39;},
 &#39;企业名称&#39;: {913555: &#39;中国铁路工会石家庄站委员会&#39;},
 &#39;注册资本&#39;: {913555: &#39;276.5万元人民币&#39;},
 &#39;实缴资本&#39;: {913555: &#39;-&#39;},
 &#39;纳税人识别号&#39;: {913555: &#39;81130000MC0611518K&#39;},
 &#39;法定代表人&#39;: {913555: &#39;韩海峰&#39;},
 &#39;企业状态&#39;: {913555: &#39;暂无&#39;},
 &#39;所属行业&#39;: {913555: &#39;公共管理、社会保障和社会组织&#39;},
 &#39;统一社会信用代码&#39;: {913555: &#39;81130000MC0611518K&#39;},
 &#39;工商注册号&#39;: {913555: nan},
 &#39;组织机构代码&#39;: {913555: &#39;-&#39;},
 &#39;登记机关&#39;: {913555: &#39;河北省总工会&#39;},
 &#39;成立日期&#39;: {913555: Timestamp(&#39;1917-01-30 00:00:00&#39;)},
 &#39;核准日期&#39;: {913555: &#39;1949-10-01&#39;},
 &#39;企业类型&#39;: {913555: &#39;-&#39;},
 &#39;经营期限&#39;: {913555: &#39;2019-04-01 至 2022-02-09&#39;},
 &#39;注册所在地&#39;: {913555: nan},
 &#39;地区编码&#39;: {913555: &#39;130105&#39;},
 &#39;详细地址&#39;: {913555: &#39;石家庄市新华区大桥路2号&#39;},
 &#39;经营范围&#39;: {913555: &#39;-&#39;},
 &#39;参保人数&#39;: {913555: 478.0},
 &#39;企业电话&#39;: {913555: nan},
 &#39;企业座机&#39;: {913555: nan},
 &#39;企业邮箱&#39;: {913555: nan}}
</code></pre></div><p><br><br></p>
<h2 id="三可视化">三、可视化</h2>
<p>绘制一个1992-2023年的注册量折线图</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">import</span> <span class="nn">matplotlib</span>
<span class="kn">import</span> <span class="nn">matplotlib_inline</span>
<span class="n">matplotlib_inline</span><span class="o">.</span><span class="n">backend_inline</span><span class="o">.</span><span class="n">set_matplotlib_formats</span><span class="p">(</span><span class="s1">&#39;png&#39;</span><span class="p">,</span> <span class="s1">&#39;svg&#39;</span><span class="p">)</span>
<span class="kn">import</span> <span class="nn">scienceplots</span>
<span class="kn">import</span> <span class="nn">platform</span>
<span class="n">plt</span><span class="o">.</span><span class="n">style</span><span class="o">.</span><span class="n">use</span><span class="p">([</span><span class="s1">&#39;science&#39;</span><span class="p">,</span> <span class="s1">&#39;no-latex&#39;</span><span class="p">,</span> <span class="s1">&#39;cjk-sc-font&#39;</span><span class="p">])</span>
<span class="n">system</span> <span class="o">=</span> <span class="n">platform</span><span class="o">.</span><span class="n">system</span><span class="p">()</span>  <span class="c1"># 获取操作系统类型</span>

<span class="k">if</span> <span class="n">system</span> <span class="o">==</span> <span class="s1">&#39;Windows&#39;</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;SimHei&#39;</span><span class="p">}</span>
<span class="k">elif</span> <span class="n">system</span> <span class="o">==</span> <span class="s1">&#39;Darwin&#39;</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;Arial Unicode MS&#39;</span><span class="p">}</span>
<span class="k">else</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;sans-serif&#39;</span><span class="p">}</span>
<span class="n">matplotlib</span><span class="o">.</span><span class="n">rc</span><span class="p">(</span><span class="s1">&#39;font&#39;</span><span class="p">,</span> <span class="o">**</span><span class="n">font</span><span class="p">)</span>  <span class="c1"># 设置全局字体</span>
    
<span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">12</span><span class="p">,</span> <span class="mi">8</span><span class="p">))</span>
<span class="n">years</span> <span class="o">=</span> <span class="p">[</span><span class="nb">str</span><span class="p">(</span><span class="n">y</span><span class="p">)</span> <span class="k">for</span> <span class="n">y</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1992</span><span class="p">,</span> <span class="mi">2023</span><span class="p">)]</span>

<span class="n">sjz_df</span><span class="p">[</span><span class="s1">&#39;成立日期&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">slice</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">4</span><span class="p">)</span><span class="o">.</span><span class="n">value_counts</span><span class="p">(</span><span class="n">ascending</span><span class="o">=</span><span class="kc">True</span><span class="p">)[</span><span class="n">years</span><span class="p">]</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">label</span><span class="o">=</span><span class="s1">&#39;石家庄&#39;</span><span class="p">)</span>
<span class="n">cs_df</span><span class="p">[</span><span class="s1">&#39;成立日期&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">slice</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">4</span><span class="p">)</span><span class="o">.</span><span class="n">value_counts</span><span class="p">(</span><span class="n">ascending</span><span class="o">=</span><span class="kc">True</span><span class="p">)[</span><span class="n">years</span><span class="p">]</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">label</span><span class="o">=</span><span class="s1">&#39;长沙&#39;</span><span class="p">)</span>
<span class="n">hz_df</span><span class="p">[</span><span class="s1">&#39;成立日期&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">slice</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">4</span><span class="p">)</span><span class="o">.</span><span class="n">value_counts</span><span class="p">(</span><span class="n">ascending</span><span class="o">=</span><span class="kc">True</span><span class="p">)[</span><span class="n">years</span><span class="p">]</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">label</span><span class="o">=</span><span class="s1">&#39;杭州&#39;</span><span class="p">)</span>

<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">&#39;工商企业注册量1992-2019年&#39;</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">16</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s1">&#39;black&#39;</span><span class="p">,</span> <span class="n">ha</span><span class="o">=</span><span class="s1">&#39;center&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s1">&#39;年份&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s1">&#39;注册量&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">legend</span><span class="p">(</span><span class="n">loc</span><span class="o">=</span><span class="s1">&#39;upper right&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>    
</code></pre></div><p><img loading="lazy" src="img/output_8_0.png" alt="svg"  />
</p>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>使用 Ruptures 识别时间序列数据中的变化点</title>
      <link>https://textdata.cn/blog/2023-11-26-using-ruptures-to-detect-change-point/</link>
      <pubDate>Sun, 26 Nov 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-11-26-using-ruptures-to-detect-change-point/</guid>
      <description>&lt;p&gt;&lt;strong&gt;时间序列数据&lt;/strong&gt; 在各个领域中都占据着重要地位，从金融市场到生产制造，都需要对时间序列数据进行分析和监测。其中一个关键任务是识别时间序列数据中的变化点，这些变化点可能代表了重要的事件或趋势转折点。例如之前分享过 &lt;a href=&#34;https://textdata.cn/blog/2023-01-10-similarity_of_cental_bank_monetary_policy/&#34;&gt;金融研究 | 央行货币政策文本相似度计算与可视化&lt;/a&gt;, 仅仅构造了相似度时序数据， 但是如果要做让程序自动识别政策变化时间点， 还需要今日分享的内容。&lt;/p&gt;
&lt;p&gt;为了解决这一问题，&lt;strong&gt;Ruptures 库是一个非常强大的工具，它提供了多种算法，可用于检测时间序列数据的变化点&lt;/strong&gt;。本文将介绍如何使用 Ruptures 库来解决时间序列数据分析中的变化点检测问题。 &lt;a href=&#34;code.ipynb&#34;&gt;&lt;strong&gt;点击下载本文代码&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;一问题场景&#34;&gt;一、问题场景&lt;/h2&gt;
&lt;p&gt;在各种应用场景中，需要识别时间序列数据中的变化点，例如：&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;金融市场&lt;/strong&gt;：检测股票价格中的趋势转折点，以指导投资决策。&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;生产制造&lt;/strong&gt;：监测生产线上的设备状态变化，及时发现问题并采取措施维护。&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;气象数据&lt;/strong&gt;：发现天气数据中的异常变化，如风暴的到来或气温剧烈波动。&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;网络流量&lt;/strong&gt;：检测网络流量中的异常行为，可能是网络攻击的迹象。&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;在这些场景下，Ruptures 库可以帮助我们识别变化点，从而更好地理解时间序列数据的特点。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二ruptures-库介绍&#34;&gt;二、Ruptures 库介绍&lt;/h2&gt;
&lt;p&gt;Ruptures 库是一个用于信号分割和变化点检测的 Python 库，它提供了多种算法和工具，可用于处理不同类型的时间序列数据。&lt;/p&gt;
&lt;p&gt;以下是 Ruptures 库的一些关键特点：&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;多种算法支持&lt;/strong&gt;：Ruptures 提供了多种变化点检测算法，包括 Pelt、Binary Segmentation、Window-based Methods 等，适用于不同类型的时间序列数据和问题。&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;简单易用&lt;/strong&gt;：库的 API 设计简洁，容易上手，用户可以轻松地进行变化点检测任务。&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;高性能&lt;/strong&gt;：Ruptures 经过优化，能够处理大规模的时间序列数据集，同时具有较低的计算复杂度。&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;三常用算法&#34;&gt;三、常用算法&lt;/h2&gt;
&lt;p&gt;下面是 Ruptures 库中常用的一些变化点检测算法：&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Pelt (Pruned Exact Linear Time)&lt;/strong&gt;：Pelt算法是一种基于动态规划的算法，适用于多个变化点的检测任务。它的优点在于其精确性和高效性，通常能够找到全局最优的变化点位置。Pelt算法通过将时间序列数据划分为多个分段，使得每个分段内的变化点数目最小化，从而找到最优的分段方式。&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Binary Segmentation (BS)&lt;/strong&gt;：Binary Segmentation算法是一种简单而有效的分割方法，通过迭代地将时间序列数据分为两个部分来检测变化点。该算法的计算复杂度较低，适用于中等规模的数据集。主要缺点是可能会导致分段的粒度过粗。&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Window-based Methods&lt;/strong&gt;：这些方法使用滑动窗口的方式来检测时间序列数据中的变化点。窗口会在时间序列上滑动，对窗口内的数据进行分析，然后根据某种准则来确定窗口内是否存在变化点。优点是简单易懂，但需要调整窗口大小和准则参数。&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Bottom-Up Methods&lt;/strong&gt;：Bottom-Up方法从小的分段开始，逐渐合并以检测变化点。它从最小的分段（每个数据点都是一个分段）开始，然后合并相邻的分段，直到满足某种准则为止。优点在于能够处理多个变化点，但计算复杂度较高。&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;四实验&#34;&gt;四、实验&lt;/h2&gt;
&lt;h3 id=&#34;41-导入包&#34;&gt;4.1 导入包&lt;/h3&gt;
&lt;p&gt;导入本文需要的包， 使得matplotlib支持中文，绘制高清图；&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;import matplotlib.pyplot as plt
import ruptures as rpt
import matplotlib

#绘制高清图
import matplotlib_inline
matplotlib_inline.backend_inline.set_matplotlib_formats(&amp;#39;png&amp;#39;, &amp;#39;svg&amp;#39;)

#支持中文
import platform
system = platform.system()  # 获取操作系统类型
if system == &amp;#39;Windows&amp;#39;:
    font = {&amp;#39;family&amp;#39;: &amp;#39;SimHei&amp;#39;}
elif system == &amp;#39;Darwin&amp;#39;:
    font = {&amp;#39;family&amp;#39;: &amp;#39;Arial Unicode MS&amp;#39;}
else:
    font = {&amp;#39;family&amp;#39;: &amp;#39;sans-serif&amp;#39;}
matplotlib.rc(&amp;#39;font&amp;#39;, **font)  # 设置全局字体
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;42-生成实验数据&#34;&gt;4.2 生成实验数据&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;# 生成示例时间序列数据
n_samples, dim, sigma = 1000, 1, 1
n_bkps = 4  # 假设有4个变化点
signal, bkps = rpt.pw_constant(n_samples, dim, n_bkps, noise_std=sigma)
print(signal.shape)
print(bkps)
signal
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;(1000, 1)
[198, 415, 608, 807, 1000]

array([[-10.36078315],
       [-10.20386008],
       [ -9.97983878],
       [-10.53406566],
       ...
       [-11.43256337],
       [-10.61377906],
       [-10.56300421],
       [-10.83854557],
       [-10.21754732]])
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;ruptures为我们生成的实验数据signal是一个长度为1000的array型数据。生成的变化点bkps的位置序列 [198, 415, 608, 807, 1000]。 数据不够直观，我们可视化一下&lt;/p&gt;
&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# 创建时间序列图&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;figure&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;figsize&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;12&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;6&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;plot&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;signal&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;lw&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;label&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;时间序列数据&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;legend&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#保存&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;savefig&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;ts-data.png&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;dpi&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;200&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;# 显示&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;show&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/ts-data.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;从上图可以清楚的看到，ruptures为我们生成了1000个点，大致有4个变化点，将数据分成了五部分。现在我们使用ruptures为我们识别变化点&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;43-识别变化点&#34;&gt;4.3 识别变化点&lt;/h3&gt;
&lt;p&gt;Pelt算法是Ruptures库中的一种高效而准确的变化点检测算法，它的全称是Pruned Exact Linear Time（修剪的线性时间精确算法）。它的性能取决于成本函数的选择和 &lt;em&gt;&lt;strong&gt;pen&lt;/strong&gt;&lt;/em&gt;参数的调整，&lt;em&gt;&lt;strong&gt;pen&lt;/strong&gt;&lt;/em&gt; 参数的全称是 &lt;em&gt;&lt;strong&gt;Penalty&lt;/strong&gt;&lt;/em&gt;，它代表了在检测到变化点时的成本或惩罚值。这里将 pen 设置为 10&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# 使用 Ruptures 库进行变化点检测&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;algo&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;rpt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Pelt&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;model&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;rbf&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fit&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;signal&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;result&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;algo&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;predict&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;pen&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;10&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;[200, 415, 610, 805, 1000]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;431-matplotlib可视化&#34;&gt;4.3.1 matplotlib可视化&lt;/h3&gt;
&lt;p&gt;我们现在比较原数据变化点bkps 和 预测出来的变化点result，为了直观一些，进行可视化&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;plt&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;ruptures&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;rpt&lt;/span&gt;


&lt;span class=&#34;c1&#34;&gt;# 创建时间序列图&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;figure&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;figsize&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;12&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;6&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;# 绘制时间序列数据&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;plot&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;signal&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;lw&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;label&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;时间序列数据&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;color&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;blue&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;# 绘制实际变化点位置&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;bkp&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;bkps&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;axvline&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;bkp&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;color&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;red&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;linestyle&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;--&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;label&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;实际变化点&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;# 绘制检测到的变化点位置&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;bkp&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;result&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;axvline&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;bkp&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;color&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;green&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;linestyle&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;--&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;label&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;检测到的变化点&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;title&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;变化点检测示例&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;xlabel&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;时间步长&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ylabel&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;数值&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;# 显示单独的图例&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;handles&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;labels&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;gca&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;get_legend_handles_labels&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;unique_labels&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;list&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;set&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;labels&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;  &lt;span class=&#34;c1&#34;&gt;# 去除重复的标签&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;unique_handles&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;handles&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;labels&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;index&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;label&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)]&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;label&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;unique_labels&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;  &lt;span class=&#34;c1&#34;&gt;# 获取对应的图例项&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;legend&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;unique_handles&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;unique_labels&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#保存图&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;savefig&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;change-point2.png&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;dpi&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;200&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;show&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/change-point2.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h4 id=&#34;432-ruptures自带可视化&#34;&gt;4.3.2 ruptures自带可视化&lt;/h4&gt;
&lt;p&gt;matplotlib代码复杂， 使用ruptures更简洁一些。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# 绘制结果&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;# 时间序列数据、 实际变化点、 预测的变化点&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;rpt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;display&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;signal&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;bkps&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;result&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;title&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;变化点检测示例&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#保存图&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;savefig&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;change-point.png&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;dpi&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;200&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#显示&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;show&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/change-point.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;44-关于pen&#34;&gt;4.4 关于pen&lt;/h3&gt;
&lt;p&gt;更具体地说，&lt;em&gt;&lt;strong&gt;pen&lt;/strong&gt;&lt;/em&gt; 参数的值越大，算法就会倾向于检测更少的变化点，而值越小，算法就会倾向于检测更多的变化点。&lt;/p&gt;
&lt;p&gt;通常情况下，您可以根据自己的数据和问题来调整&lt;em&gt;&lt;strong&gt;pen&lt;/strong&gt;&lt;/em&gt;参数的值。以下是一些常见的情况和建议：&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;如果您希望检测到较少的变化点，以捕捉主要的趋势转折点，可以选择较大的&lt;em&gt;&lt;strong&gt;pen&lt;/strong&gt;&lt;/em&gt;值。&lt;/li&gt;
&lt;li&gt;如果您希望检测到更多的变化点，以捕捉数据中的细微变化，可以选择较小的&lt;em&gt;&lt;strong&gt;pen&lt;/strong&gt;&lt;/em&gt;值。&lt;/li&gt;
&lt;li&gt;如果您不确定要选择哪个&lt;em&gt;&lt;strong&gt;pen&lt;/strong&gt;&lt;/em&gt;值，可以尝试多个不同的值，然后根据结果的质量和实际需求来选择最合适的&lt;em&gt;&lt;strong&gt;pen&lt;/strong&gt;&lt;/em&gt;值。&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;在实践中，调整&lt;em&gt;&lt;strong&gt;pen&lt;/strong&gt;&lt;/em&gt;参数通常需要一些试验和经验，因为最佳的&lt;em&gt;&lt;strong&gt;pen&lt;/strong&gt;&lt;/em&gt;值取决于您的数据和分析目标。您可以尝试不同的&lt;em&gt;&lt;strong&gt;pen&lt;/strong&gt;&lt;/em&gt;值，然后根据检测结果和领域知识来选择最适合的参数值。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
</description>
      <content:encoded><![CDATA[<p><strong>时间序列数据</strong> 在各个领域中都占据着重要地位，从金融市场到生产制造，都需要对时间序列数据进行分析和监测。其中一个关键任务是识别时间序列数据中的变化点，这些变化点可能代表了重要的事件或趋势转折点。例如之前分享过 <a href="https://textdata.cn/blog/2023-01-10-similarity_of_cental_bank_monetary_policy/">金融研究 | 央行货币政策文本相似度计算与可视化</a>, 仅仅构造了相似度时序数据， 但是如果要做让程序自动识别政策变化时间点， 还需要今日分享的内容。</p>
<p>为了解决这一问题，<strong>Ruptures 库是一个非常强大的工具，它提供了多种算法，可用于检测时间序列数据的变化点</strong>。本文将介绍如何使用 Ruptures 库来解决时间序列数据分析中的变化点检测问题。 <a href="code.ipynb"><strong>点击下载本文代码</strong></a></p>
<p><br><br></p>
<h2 id="一问题场景">一、问题场景</h2>
<p>在各种应用场景中，需要识别时间序列数据中的变化点，例如：</p>
<ol>
<li><strong>金融市场</strong>：检测股票价格中的趋势转折点，以指导投资决策。</li>
<li><strong>生产制造</strong>：监测生产线上的设备状态变化，及时发现问题并采取措施维护。</li>
<li><strong>气象数据</strong>：发现天气数据中的异常变化，如风暴的到来或气温剧烈波动。</li>
<li><strong>网络流量</strong>：检测网络流量中的异常行为，可能是网络攻击的迹象。</li>
</ol>
<p>在这些场景下，Ruptures 库可以帮助我们识别变化点，从而更好地理解时间序列数据的特点。</p>
<p><br><br></p>
<h2 id="二ruptures-库介绍">二、Ruptures 库介绍</h2>
<p>Ruptures 库是一个用于信号分割和变化点检测的 Python 库，它提供了多种算法和工具，可用于处理不同类型的时间序列数据。</p>
<p>以下是 Ruptures 库的一些关键特点：</p>
<ul>
<li><strong>多种算法支持</strong>：Ruptures 提供了多种变化点检测算法，包括 Pelt、Binary Segmentation、Window-based Methods 等，适用于不同类型的时间序列数据和问题。</li>
<li><strong>简单易用</strong>：库的 API 设计简洁，容易上手，用户可以轻松地进行变化点检测任务。</li>
<li><strong>高性能</strong>：Ruptures 经过优化，能够处理大规模的时间序列数据集，同时具有较低的计算复杂度。</li>
</ul>
<p><br><br></p>
<h2 id="三常用算法">三、常用算法</h2>
<p>下面是 Ruptures 库中常用的一些变化点检测算法：</p>
<ol>
<li><strong>Pelt (Pruned Exact Linear Time)</strong>：Pelt算法是一种基于动态规划的算法，适用于多个变化点的检测任务。它的优点在于其精确性和高效性，通常能够找到全局最优的变化点位置。Pelt算法通过将时间序列数据划分为多个分段，使得每个分段内的变化点数目最小化，从而找到最优的分段方式。</li>
<li><strong>Binary Segmentation (BS)</strong>：Binary Segmentation算法是一种简单而有效的分割方法，通过迭代地将时间序列数据分为两个部分来检测变化点。该算法的计算复杂度较低，适用于中等规模的数据集。主要缺点是可能会导致分段的粒度过粗。</li>
<li><strong>Window-based Methods</strong>：这些方法使用滑动窗口的方式来检测时间序列数据中的变化点。窗口会在时间序列上滑动，对窗口内的数据进行分析，然后根据某种准则来确定窗口内是否存在变化点。优点是简单易懂，但需要调整窗口大小和准则参数。</li>
<li><strong>Bottom-Up Methods</strong>：Bottom-Up方法从小的分段开始，逐渐合并以检测变化点。它从最小的分段（每个数据点都是一个分段）开始，然后合并相邻的分段，直到满足某种准则为止。优点在于能够处理多个变化点，但计算复杂度较高。</li>
</ol>
<p><br><br></p>
<h2 id="四实验">四、实验</h2>
<h3 id="41-导入包">4.1 导入包</h3>
<p>导入本文需要的包， 使得matplotlib支持中文，绘制高清图；</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">import matplotlib.pyplot as plt
import ruptures as rpt
import matplotlib

#绘制高清图
import matplotlib_inline
matplotlib_inline.backend_inline.set_matplotlib_formats(&#39;png&#39;, &#39;svg&#39;)

#支持中文
import platform
system = platform.system()  # 获取操作系统类型
if system == &#39;Windows&#39;:
    font = {&#39;family&#39;: &#39;SimHei&#39;}
elif system == &#39;Darwin&#39;:
    font = {&#39;family&#39;: &#39;Arial Unicode MS&#39;}
else:
    font = {&#39;family&#39;: &#39;sans-serif&#39;}
matplotlib.rc(&#39;font&#39;, **font)  # 设置全局字体
</code></pre></div><br>
<h3 id="42-生成实验数据">4.2 生成实验数据</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"># 生成示例时间序列数据
n_samples, dim, sigma = 1000, 1, 1
n_bkps = 4  # 假设有4个变化点
signal, bkps = rpt.pw_constant(n_samples, dim, n_bkps, noise_std=sigma)
print(signal.shape)
print(bkps)
signal
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">(1000, 1)
[198, 415, 608, 807, 1000]

array([[-10.36078315],
       [-10.20386008],
       [ -9.97983878],
       [-10.53406566],
       ...
       [-11.43256337],
       [-10.61377906],
       [-10.56300421],
       [-10.83854557],
       [-10.21754732]])
</code></pre></div><p>ruptures为我们生成的实验数据signal是一个长度为1000的array型数据。生成的变化点bkps的位置序列 [198, 415, 608, 807, 1000]。 数据不够直观，我们可视化一下</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1"># 创建时间序列图</span>
<span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">12</span><span class="p">,</span> <span class="mi">6</span><span class="p">))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">signal</span><span class="p">,</span> <span class="n">lw</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s1">&#39;时间序列数据&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">legend</span><span class="p">()</span>

<span class="c1">#保存</span>
<span class="n">plt</span><span class="o">.</span><span class="n">savefig</span><span class="p">(</span><span class="s1">&#39;ts-data.png&#39;</span><span class="p">,</span> <span class="n">dpi</span><span class="o">=</span><span class="mi">200</span><span class="p">)</span>

<span class="c1"># 显示</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/ts-data.png" alt=""  />
</p>
<p>从上图可以清楚的看到，ruptures为我们生成了1000个点，大致有4个变化点，将数据分成了五部分。现在我们使用ruptures为我们识别变化点</p>
<br>
<h3 id="43-识别变化点">4.3 识别变化点</h3>
<p>Pelt算法是Ruptures库中的一种高效而准确的变化点检测算法，它的全称是Pruned Exact Linear Time（修剪的线性时间精确算法）。它的性能取决于成本函数的选择和 <em><strong>pen</strong></em>参数的调整，<em><strong>pen</strong></em> 参数的全称是 <em><strong>Penalty</strong></em>，它代表了在检测到变化点时的成本或惩罚值。这里将 pen 设置为 10</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1"># 使用 Ruptures 库进行变化点检测</span>
<span class="n">algo</span> <span class="o">=</span> <span class="n">rpt</span><span class="o">.</span><span class="n">Pelt</span><span class="p">(</span><span class="n">model</span><span class="o">=</span><span class="s2">&#34;rbf&#34;</span><span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">signal</span><span class="p">)</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">algo</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">pen</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[200, 415, 610, 805, 1000]
</code></pre></div><br>
<h3 id="431-matplotlib可视化">4.3.1 matplotlib可视化</h3>
<p>我们现在比较原数据变化点bkps 和 预测出来的变化点result，为了直观一些，进行可视化</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">import</span> <span class="nn">ruptures</span> <span class="k">as</span> <span class="nn">rpt</span>


<span class="c1"># 创建时间序列图</span>
<span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">12</span><span class="p">,</span> <span class="mi">6</span><span class="p">))</span>

<span class="c1"># 绘制时间序列数据</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">signal</span><span class="p">,</span> <span class="n">lw</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s1">&#39;时间序列数据&#39;</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s1">&#39;blue&#39;</span><span class="p">)</span>

<span class="c1"># 绘制实际变化点位置</span>
<span class="k">for</span> <span class="n">bkp</span> <span class="ow">in</span> <span class="n">bkps</span><span class="p">:</span>
    <span class="n">plt</span><span class="o">.</span><span class="n">axvline</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">bkp</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s1">&#39;red&#39;</span><span class="p">,</span> <span class="n">linestyle</span><span class="o">=</span><span class="s1">&#39;--&#39;</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s1">&#39;实际变化点&#39;</span><span class="p">)</span>

<span class="c1"># 绘制检测到的变化点位置</span>
<span class="k">for</span> <span class="n">bkp</span> <span class="ow">in</span> <span class="n">result</span><span class="p">:</span>
    <span class="n">plt</span><span class="o">.</span><span class="n">axvline</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">bkp</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s1">&#39;green&#39;</span><span class="p">,</span> <span class="n">linestyle</span><span class="o">=</span><span class="s1">&#39;--&#39;</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s1">&#39;检测到的变化点&#39;</span><span class="p">)</span>

<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s2">&#34;变化点检测示例&#34;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s2">&#34;时间步长&#34;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s2">&#34;数值&#34;</span><span class="p">)</span>

<span class="c1"># 显示单独的图例</span>
<span class="n">handles</span><span class="p">,</span> <span class="n">labels</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">gca</span><span class="p">()</span><span class="o">.</span><span class="n">get_legend_handles_labels</span><span class="p">()</span>
<span class="n">unique_labels</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="nb">set</span><span class="p">(</span><span class="n">labels</span><span class="p">))</span>  <span class="c1"># 去除重复的标签</span>
<span class="n">unique_handles</span> <span class="o">=</span> <span class="p">[</span><span class="n">handles</span><span class="p">[</span><span class="n">labels</span><span class="o">.</span><span class="n">index</span><span class="p">(</span><span class="n">label</span><span class="p">)]</span> <span class="k">for</span> <span class="n">label</span> <span class="ow">in</span> <span class="n">unique_labels</span><span class="p">]</span>  <span class="c1"># 获取对应的图例项</span>
<span class="n">plt</span><span class="o">.</span><span class="n">legend</span><span class="p">(</span><span class="n">unique_handles</span><span class="p">,</span> <span class="n">unique_labels</span><span class="p">)</span>

<span class="c1">#保存图</span>
<span class="n">plt</span><span class="o">.</span><span class="n">savefig</span><span class="p">(</span><span class="s1">&#39;change-point2.png&#39;</span><span class="p">,</span> <span class="n">dpi</span><span class="o">=</span><span class="mi">200</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>

</code></pre></div><p><img loading="lazy" src="img/change-point2.png" alt=""  />
</p>
<br>
<h4 id="432-ruptures自带可视化">4.3.2 ruptures自带可视化</h4>
<p>matplotlib代码复杂， 使用ruptures更简洁一些。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1"># 绘制结果</span>
<span class="c1"># 时间序列数据、 实际变化点、 预测的变化点</span>
<span class="n">rpt</span><span class="o">.</span><span class="n">display</span><span class="p">(</span><span class="n">signal</span><span class="p">,</span> <span class="n">bkps</span><span class="p">,</span> <span class="n">result</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s2">&#34;变化点检测示例&#34;</span><span class="p">)</span>

<span class="c1">#保存图</span>
<span class="n">plt</span><span class="o">.</span><span class="n">savefig</span><span class="p">(</span><span class="s1">&#39;change-point.png&#39;</span><span class="p">,</span> <span class="n">dpi</span><span class="o">=</span><span class="mi">200</span><span class="p">)</span>

<span class="c1">#显示</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>

</code></pre></div><p><img loading="lazy" src="img/change-point.png" alt=""  />
</p>
<br>
<h3 id="44-关于pen">4.4 关于pen</h3>
<p>更具体地说，<em><strong>pen</strong></em> 参数的值越大，算法就会倾向于检测更少的变化点，而值越小，算法就会倾向于检测更多的变化点。</p>
<p>通常情况下，您可以根据自己的数据和问题来调整<em><strong>pen</strong></em>参数的值。以下是一些常见的情况和建议：</p>
<ol>
<li>如果您希望检测到较少的变化点，以捕捉主要的趋势转折点，可以选择较大的<em><strong>pen</strong></em>值。</li>
<li>如果您希望检测到更多的变化点，以捕捉数据中的细微变化，可以选择较小的<em><strong>pen</strong></em>值。</li>
<li>如果您不确定要选择哪个<em><strong>pen</strong></em>值，可以尝试多个不同的值，然后根据结果的质量和实际需求来选择最合适的<em><strong>pen</strong></em>值。</li>
</ol>
<p>在实践中，调整<em><strong>pen</strong></em>参数通常需要一些试验和经验，因为最佳的<em><strong>pen</strong></em>值取决于您的数据和分析目标。您可以尝试不同的<em><strong>pen</strong></em>值，然后根据检测结果和领域知识来选择最适合的参数值。</p>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>使用patchwork包进行多图排版</title>
      <link>https://textdata.cn/blog/2023-11-25-r-patchwork/</link>
      <pubDate>Sat, 25 Nov 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-11-25-r-patchwork/</guid>
      <description>&lt;h2 id=&#34;一问题&#34;&gt;一、问题&lt;/h2&gt;
&lt;p&gt;如果想把多个图合并放在一个图里，如图，该如何实现&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/1.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;好在R语言 和 Python 都有对应的解决方案， 分别是patchwork包和patchworklib库。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二r语言&#34;&gt;二、R语言&lt;/h2&gt;
&lt;p&gt;安装&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;# install.packages(&amp;#34;devtools&amp;#34;)
devtools::install_github(&amp;#34;thomasp85/patchwork&amp;#34;)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;p&gt;两个图并排在一行，只需要导入patchwork， 然后相加即可&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-R&#34; data-lang=&#34;R&#34;&gt;&lt;span class=&#34;nf&#34;&gt;library&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ggplot2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;nf&#34;&gt;library&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;patchwork&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;p1&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;ggplot&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;mtcars&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;geom_point&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nf&#34;&gt;aes&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;mpg&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;disp&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;p2&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;ggplot&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;mtcars&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;geom_boxplot&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nf&#34;&gt;aes&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;gear&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;disp&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;group&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;gear&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;p1&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;p2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/1.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;两行，第一行三个图，第二行一个图&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-R&#34; data-lang=&#34;R&#34;&gt;&lt;span class=&#34;n&#34;&gt;p3&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;ggplot&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;mtcars&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;geom_smooth&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nf&#34;&gt;aes&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;disp&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;qsec&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;p4&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;ggplot&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;mtcars&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;geom_bar&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nf&#34;&gt;aes&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;carb&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;

&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;p1&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;p2&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;p3&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;/&lt;/span&gt;
      &lt;span class=&#34;n&#34;&gt;p4&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/2.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;三python&#34;&gt;三、Python&lt;/h2&gt;
&lt;p&gt;&lt;a href=&#34;https://github.com/ponnhide/patchworklib&#34;&gt;Patchworklib &lt;/a&gt;是与 matplotlib 相关的绘图（简单 matplotlib 绘图、Seaborn 绘图（轴级和图形级）和plotnine 绘图）的通用编辑器。这个库的灵感来自于 ggplot2 的&lt;a href=&#34;https://patchwork.data-imaginist.com/&#34;&gt;patchwork&lt;/a&gt;。因此，作为原始拼凑，用户可以轻松地仅使用 &lt;code&gt;/&lt;/code&gt;和 &lt;code&gt;|&lt;/code&gt; 对齐 matplotlib 图。&lt;/p&gt;
&lt;p&gt;Patchworklib 提供了该问题的解决方案。通过使用 patchworklib，任何类型的seaborn 和plotnine 图都可以作为matplotlib 子图进行处理。安装&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;pip3 install patchworklib
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;patchworklib&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pw&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;seaborn&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;sns&lt;/span&gt; 

&lt;span class=&#34;n&#34;&gt;fmri&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;sns&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;load_dataset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;fmri&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;ax1&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pw&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Brick&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;figsize&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;3&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;sns&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;lineplot&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;timepoint&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;y&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;signal&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;hue&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;region&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;style&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;event&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;data&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fmri&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ax&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ax1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;ax1&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;legend&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;bbox_to_anchor&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mf&#34;&gt;1.05&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mf&#34;&gt;1.0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;loc&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;upper left&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;ax1&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;set_title&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;ax1&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
 
&lt;span class=&#34;n&#34;&gt;titanic&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;sns&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;load_dataset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;titanic&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;ax2&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pw&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Brick&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;figsize&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;sns&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;barplot&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;sex&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;y&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;survived&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;hue&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;class&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;data&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;titanic&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ax&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ax2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;ax2&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;move_legend&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;new_loc&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;upper left&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;bbox_to_anchor&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mf&#34;&gt;1.05&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mf&#34;&gt;1.0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;ax2&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;set_title&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;ax2&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;ax12&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ax1&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ax2&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;ax12&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;savefig&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;ax12.png&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/ax12.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;#省略 ax1、ax2、ax4绘制过程&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;ax124&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ax1&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ax2&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ax4&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;ax124&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;savefig&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;../img/ax124.png&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/ax124.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;#省略 ax124、ax3、ax5绘制过程&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;ax12435&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ax124&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;/&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ax3&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ax5&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;ax12435&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;savefig&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;../img/ax12435.png&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/ax12435.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
</description>
      <content:encoded><![CDATA[<h2 id="一问题">一、问题</h2>
<p>如果想把多个图合并放在一个图里，如图，该如何实现</p>
<p><img loading="lazy" src="img/1.png" alt=""  />
</p>
<p>好在R语言 和 Python 都有对应的解决方案， 分别是patchwork包和patchworklib库。</p>
<p><br><br></p>
<h2 id="二r语言">二、R语言</h2>
<p>安装</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"># install.packages(&#34;devtools&#34;)
devtools::install_github(&#34;thomasp85/patchwork&#34;)
</code></pre></div><br>
<p>两个图并排在一行，只需要导入patchwork， 然后相加即可</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-R" data-lang="R"><span class="nf">library</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">)</span>
<span class="nf">library</span><span class="p">(</span><span class="n">patchwork</span><span class="p">)</span>

<span class="n">p1</span> <span class="o">&lt;-</span> <span class="nf">ggplot</span><span class="p">(</span><span class="n">mtcars</span><span class="p">)</span> <span class="o">+</span> <span class="nf">geom_point</span><span class="p">(</span><span class="nf">aes</span><span class="p">(</span><span class="n">mpg</span><span class="p">,</span> <span class="n">disp</span><span class="p">))</span>
<span class="n">p2</span> <span class="o">&lt;-</span> <span class="nf">ggplot</span><span class="p">(</span><span class="n">mtcars</span><span class="p">)</span> <span class="o">+</span> <span class="nf">geom_boxplot</span><span class="p">(</span><span class="nf">aes</span><span class="p">(</span><span class="n">gear</span><span class="p">,</span> <span class="n">disp</span><span class="p">,</span> <span class="n">group</span> <span class="o">=</span> <span class="n">gear</span><span class="p">))</span>

<span class="n">p1</span> <span class="o">+</span> <span class="n">p2</span>
</code></pre></div><p><img loading="lazy" src="img/1.png" alt=""  />
</p>
<br>
<p>两行，第一行三个图，第二行一个图</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-R" data-lang="R"><span class="n">p3</span> <span class="o">&lt;-</span> <span class="nf">ggplot</span><span class="p">(</span><span class="n">mtcars</span><span class="p">)</span> <span class="o">+</span> <span class="nf">geom_smooth</span><span class="p">(</span><span class="nf">aes</span><span class="p">(</span><span class="n">disp</span><span class="p">,</span> <span class="n">qsec</span><span class="p">))</span>
<span class="n">p4</span> <span class="o">&lt;-</span> <span class="nf">ggplot</span><span class="p">(</span><span class="n">mtcars</span><span class="p">)</span> <span class="o">+</span> <span class="nf">geom_bar</span><span class="p">(</span><span class="nf">aes</span><span class="p">(</span><span class="n">carb</span><span class="p">))</span>

<span class="p">(</span><span class="n">p1</span> <span class="o">|</span> <span class="n">p2</span> <span class="o">|</span> <span class="n">p3</span><span class="p">)</span> <span class="o">/</span>
      <span class="n">p4</span>
</code></pre></div><p><img loading="lazy" src="img/2.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="三python">三、Python</h2>
<p><a href="https://github.com/ponnhide/patchworklib">Patchworklib </a>是与 matplotlib 相关的绘图（简单 matplotlib 绘图、Seaborn 绘图（轴级和图形级）和plotnine 绘图）的通用编辑器。这个库的灵感来自于 ggplot2 的<a href="https://patchwork.data-imaginist.com/">patchwork</a>。因此，作为原始拼凑，用户可以轻松地仅使用 <code>/</code>和 <code>|</code> 对齐 matplotlib 图。</p>
<p>Patchworklib 提供了该问题的解决方案。通过使用 patchworklib，任何类型的seaborn 和plotnine 图都可以作为matplotlib 子图进行处理。安装</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">pip3 install patchworklib
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">patchworklib</span> <span class="k">as</span> <span class="nn">pw</span>
<span class="kn">import</span> <span class="nn">seaborn</span> <span class="k">as</span> <span class="nn">sns</span> 

<span class="n">fmri</span> <span class="o">=</span> <span class="n">sns</span><span class="o">.</span><span class="n">load_dataset</span><span class="p">(</span><span class="s2">&#34;fmri&#34;</span><span class="p">)</span>
<span class="n">ax1</span> <span class="o">=</span> <span class="n">pw</span><span class="o">.</span><span class="n">Brick</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span><span class="mi">2</span><span class="p">))</span>
<span class="n">sns</span><span class="o">.</span><span class="n">lineplot</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="s2">&#34;timepoint&#34;</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="s2">&#34;signal&#34;</span><span class="p">,</span> <span class="n">hue</span><span class="o">=</span><span class="s2">&#34;region&#34;</span><span class="p">,</span> <span class="n">style</span><span class="o">=</span><span class="s2">&#34;event&#34;</span><span class="p">,</span> <span class="n">data</span><span class="o">=</span><span class="n">fmri</span><span class="p">,</span> <span class="n">ax</span><span class="o">=</span><span class="n">ax1</span><span class="p">)</span>
<span class="n">ax1</span><span class="o">.</span><span class="n">legend</span><span class="p">(</span><span class="n">bbox_to_anchor</span><span class="o">=</span><span class="p">(</span><span class="mf">1.05</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">),</span> <span class="n">loc</span><span class="o">=</span><span class="s1">&#39;upper left&#39;</span><span class="p">)</span>
<span class="n">ax1</span><span class="o">.</span><span class="n">set_title</span><span class="p">(</span><span class="s2">&#34;ax1&#34;</span><span class="p">)</span>
 
<span class="n">titanic</span> <span class="o">=</span> <span class="n">sns</span><span class="o">.</span><span class="n">load_dataset</span><span class="p">(</span><span class="s2">&#34;titanic&#34;</span><span class="p">)</span>
<span class="n">ax2</span> <span class="o">=</span> <span class="n">pw</span><span class="o">.</span><span class="n">Brick</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">))</span>
<span class="n">sns</span><span class="o">.</span><span class="n">barplot</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="s2">&#34;sex&#34;</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="s2">&#34;survived&#34;</span><span class="p">,</span> <span class="n">hue</span><span class="o">=</span><span class="s2">&#34;class&#34;</span><span class="p">,</span> <span class="n">data</span><span class="o">=</span><span class="n">titanic</span><span class="p">,</span> <span class="n">ax</span><span class="o">=</span><span class="n">ax2</span><span class="p">)</span>
<span class="n">ax2</span><span class="o">.</span><span class="n">move_legend</span><span class="p">(</span><span class="n">new_loc</span><span class="o">=</span><span class="s1">&#39;upper left&#39;</span><span class="p">,</span> <span class="n">bbox_to_anchor</span><span class="o">=</span><span class="p">(</span><span class="mf">1.05</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">))</span>
<span class="n">ax2</span><span class="o">.</span><span class="n">set_title</span><span class="p">(</span><span class="s2">&#34;ax2&#34;</span><span class="p">)</span>

<span class="n">ax12</span> <span class="o">=</span> <span class="n">ax1</span><span class="o">|</span><span class="n">ax2</span>
<span class="n">ax12</span><span class="o">.</span><span class="n">savefig</span><span class="p">(</span><span class="s2">&#34;ax12.png&#34;</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/ax12.png" alt=""  />
</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#省略 ax1、ax2、ax4绘制过程</span>

<span class="n">ax124</span> <span class="o">=</span> <span class="n">ax1</span><span class="o">|</span><span class="n">ax2</span><span class="o">|</span><span class="n">ax4</span>
<span class="n">ax124</span><span class="o">.</span><span class="n">savefig</span><span class="p">(</span><span class="s2">&#34;../img/ax124.png&#34;</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/ax124.png" alt=""  />
</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#省略 ax124、ax3、ax5绘制过程</span>
<span class="n">ax12435</span> <span class="o">=</span> <span class="n">ax124</span><span class="o">/</span><span class="p">(</span><span class="n">ax3</span><span class="o">|</span><span class="n">ax5</span><span class="p">)</span>
<span class="n">ax12435</span><span class="o">.</span><span class="n">savefig</span><span class="p">(</span><span class="s2">&#34;../img/ax12435.png&#34;</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/ax12435.png" alt=""  />
</p>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>相关性分析 | 从模型预测出发挖掘更多特征之间的关系</title>
      <link>https://textdata.cn/blog/2023-11-25-ppsr-predictive-power-sccore/</link>
      <pubDate>Sat, 25 Nov 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-11-25-ppsr-predictive-power-sccore/</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;作者： Spectator&lt;/p&gt;
&lt;p&gt;链接: &lt;a href=&#34;https://zhuanlan.zhihu.com/p/557403755&#34;&gt;https://zhuanlan.zhihu.com/p/557403755&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id=&#34;一pps&#34;&gt;一、PPS&lt;/h2&gt;
&lt;p&gt;**Predictive Power Score(PPS)**是一种不对称、与数据类型无关的评分，可以检测两个变量之间的线性或非线性关系。分数范围从 0（无预测能力）到 1（完美预测能力）。 与Pearson相关性不同，它可以处理非线性关系、分类数据和不对称关系，例如变量 A 对变量 B 的影响大于变量 B 对变量 A 的影响。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二问题&#34;&gt;二、问题&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;相关性分析&lt;/strong&gt; 是对具有相关性的两个或多个变量元素进行研究，以衡量它们之间的相关性程度。当我们不清楚数据集特征的含义时，通常可以直接进行相关性分析，以检查特征之间的相关系数。&lt;/p&gt;
&lt;p&gt;在统计学中，常用的方法是使用 &lt;strong&gt;皮尔逊积矩相关系数&lt;/strong&gt;（Pearson product-moment correlation coefficient）来度量两组数据变量X和Y之间的线性相关性。这个系数是协方差除以它们的标准差的乘积，因此它实际上是协方差的标准化度量，其结果始终在 -1 和 1 之间。系数为1表示X和Y之间有很强的线性关系，所有数据点都近似位于一条直线上，Y随着X的增加而增加。系数为-1表示所有数据点都位于一条直线上，但Y随着X的增加而减少。系数为0表示两个变量之间没有线性关系。两个变量之间的皮尔逊相关系数定义为两个变量的协方差除以它们标准差的乘积：&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/person-formular.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;由于皮尔逊相关系数是度量变量之间的线性关系的，那么就无法检测到数据之间的非线性关系，如下图的示例。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/pps_01.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;由皮尔逊相关系数定义的公式可知，皮尔逊相关系数是对称的，即P(A,B) = P(B,A)， 但是在真实世界中，特征之间的关系往往是不对称的，例如：我可以根据你的手机号推断你是哪个城市的，但是不能根据你的城市推断出你的手机号。  同时我们也会发现，当特征是非数值向量时，例如是Onehot向量时，皮尔逊相关系数是没有办法对齐进行处理的。&lt;/p&gt;
&lt;p&gt;综上所述，常用的皮尔逊相关系数存在以下问题：&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;只能度量线性关系；&lt;/li&gt;
&lt;li&gt;度量的关系是对称的；&lt;/li&gt;
&lt;li&gt;不能处理非数值向量之间的关系。&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;计算“x 预测 y”的PPS得分&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;分数始终介于 0 到 1 之间，并且与数据类型无关。&lt;/li&gt;
&lt;li&gt;得分 0 意味着 x 列无法比朴素基线模型更好地预测 y 列。&lt;/li&gt;
&lt;li&gt;得分 1 意味着 x 列可以在给定模型的情况下完美预测 y 列。&lt;/li&gt;
&lt;li&gt;0 到 1 之间的分数表示模型与基线模型相比所实现的潜在预测能力的比率。&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;在 Python 和 R 中均有pps的库， 分别是 &lt;a href=&#34;https://github.com/8080labs/ppscore&#34;&gt;ppscore库&lt;/a&gt; 和 &lt;a href=&#34;https://github.com/paulvanderlaken/ppsr&#34;&gt;ppsr 包&lt;/a&gt;， 今天以 ppsr为例分享。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;三ppsr用法&#34;&gt;三、ppsr用法&lt;/h2&gt;
&lt;p&gt;该&lt;code&gt;ppsr&lt;/code&gt;软件包有四个主要函数来计算 PPS：&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;ppsr::score()&lt;/code&gt;计算 xy PPS&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ppsr::score_matrix()&lt;/code&gt;计算所有 XY PPS，并将它们显示在矩阵中&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ppsr::visualize_pps&lt;/code&gt;   pps得分矩阵&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ppsr::visualize_correlations&lt;/code&gt;  相关矩阵&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;其中&lt;code&gt;x&lt;/code&gt;和&lt;code&gt;y&lt;/code&gt;代表单个预测变量/目标，并且&lt;code&gt;X&lt;/code&gt;和 &lt;code&gt;Y&lt;/code&gt;代表给定数据集中的所有预测变量/目标。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;31-安装&#34;&gt;3.1 安装&lt;/h3&gt;
&lt;p&gt;在R中安装ppsr，打开命令行， 执行&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;install.packages(&amp;#39;ppsr&amp;#39;)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;32-score&#34;&gt;3.2 score()&lt;/h3&gt;
&lt;p&gt;&lt;code&gt;score()&lt;/code&gt;计算单个目标和预测变量的 PPS&lt;/p&gt;
&lt;p&gt;例如，使用决策树回归模型计算 x预测y 的PPS……&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-R&#34; data-lang=&#34;R&#34;&gt;&lt;span class=&#34;n&#34;&gt;ppsr&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;::&lt;/span&gt;&lt;span class=&#34;nf&#34;&gt;score&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;iris&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;x&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#39;Sepal.Length&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;y&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#39;Petal.Length&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;algorithm&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#39;tree&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;[[&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;&amp;#39;pps&amp;#39;&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;]]&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#&amp;gt; [1] 0.6160836&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;p&gt;使用广义线性回归模型计算 x预测y 的PPS……&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-R&#34; data-lang=&#34;R&#34;&gt;&lt;span class=&#34;n&#34;&gt;ppsr&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;::&lt;/span&gt;&lt;span class=&#34;nf&#34;&gt;score&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;iris&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;x&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#39;Sepal.Length&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;y&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#39;Petal.Length&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;algorithm&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#39;glm&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;[[&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;&amp;#39;pps&amp;#39;&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;]]&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#&amp;gt; [1] 0.5441131&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;33-score_matrix&#34;&gt;3.3 score_matrix()&lt;/h3&gt;
&lt;p&gt;类似于Pearson相关矩阵&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-R&#34; data-lang=&#34;R&#34;&gt;&lt;span class=&#34;n&#34;&gt;ppsr&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;::&lt;/span&gt;&lt;span class=&#34;nf&#34;&gt;score_matrix&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;iris&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#&amp;gt;              Sepal.Length Sepal.Width Petal.Length Petal.Width   Species&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#&amp;gt; Sepal.Length   1.00000000  0.04632352    0.5491398   0.4127668 0.4075487&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#&amp;gt; Sepal.Width    0.06790301  1.00000000    0.2376991   0.2174659 0.2012876&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#&amp;gt; Petal.Length   0.61608360  0.24263851    1.0000000   0.7917512 0.7904907&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#&amp;gt; Petal.Width    0.48735314  0.20124105    0.7437845   1.0000000 0.7561113&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#&amp;gt; Species        0.55918638  0.31344008    0.9167580   0.9398532 1.0000000&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;34-可视化&#34;&gt;3.4 可视化&lt;/h3&gt;
&lt;p&gt;pps得分矩阵&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-R&#34; data-lang=&#34;R&#34;&gt;&lt;span class=&#34;n&#34;&gt;ppsr&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;::&lt;/span&gt;&lt;span class=&#34;nf&#34;&gt;visualize_pps&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;iris&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/pps-matrix-heat.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;相关矩阵&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-R&#34; data-lang=&#34;R&#34;&gt;&lt;span class=&#34;n&#34;&gt;ppsr&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;::&lt;/span&gt;&lt;span class=&#34;nf&#34;&gt;visualize_correlations&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;iris&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/correlation-heatmap.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;并排生成 PPS 和相关矩阵，以便于比较。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-R&#34; data-lang=&#34;R&#34;&gt;&lt;span class=&#34;n&#34;&gt;ppsr&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;::&lt;/span&gt;&lt;span class=&#34;nf&#34;&gt;visualize_both&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;iris&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/pps-correlation-heat.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;四pps应用&#34;&gt;四、PPS应用&lt;/h2&gt;
&lt;p&gt;PPS的应用，了解了 PPS 的优点之后，我们来看看在现实生活中我们可以在哪些地方使用 PPS：&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;查找数据中的模式： PPS 查找相关性发现的每一个关系，甚至更多。因此，您可以使用 PPS 矩阵替代相关矩阵来检测和理解数据中的线性或非线性模式。使用始终在 0 到 1 之间的单个分数跨数据类型是可能的。&lt;/li&gt;
&lt;li&gt;特征选择：除了您通常的特征选择机制外，您还可以使用预测能力得分来为您的目标列找到好的预测变量。此外，您可以消除仅添加随机噪声的功能。这些特征有时在特征重要性指标上仍然得分很高。此外，您可以消除其他特征可以预测的特征，因为它们不会添加新信息。此外，您可以识别 PPS 矩阵中的相互预测特征对——这包括强相关特征，但也将检测非线性关系。&lt;/li&gt;
&lt;li&gt;检测信息泄露：使用 PPS 矩阵检测变量之间的信息泄露——即使信息泄露是通过其他变量介导的。&lt;/li&gt;
&lt;li&gt;数据规范化：通过将 PPS 矩阵解释为有向图来查找数据中的实体结构。当数据包含以前未知的潜在结构时，这可能会令人惊讶。&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
</description>
      <content:encoded><![CDATA[<blockquote>
<p>作者： Spectator</p>
<p>链接: <a href="https://zhuanlan.zhihu.com/p/557403755">https://zhuanlan.zhihu.com/p/557403755</a></p>
</blockquote>
<h2 id="一pps">一、PPS</h2>
<p>**Predictive Power Score(PPS)**是一种不对称、与数据类型无关的评分，可以检测两个变量之间的线性或非线性关系。分数范围从 0（无预测能力）到 1（完美预测能力）。 与Pearson相关性不同，它可以处理非线性关系、分类数据和不对称关系，例如变量 A 对变量 B 的影响大于变量 B 对变量 A 的影响。</p>
<p><br><br></p>
<h2 id="二问题">二、问题</h2>
<p><strong>相关性分析</strong> 是对具有相关性的两个或多个变量元素进行研究，以衡量它们之间的相关性程度。当我们不清楚数据集特征的含义时，通常可以直接进行相关性分析，以检查特征之间的相关系数。</p>
<p>在统计学中，常用的方法是使用 <strong>皮尔逊积矩相关系数</strong>（Pearson product-moment correlation coefficient）来度量两组数据变量X和Y之间的线性相关性。这个系数是协方差除以它们的标准差的乘积，因此它实际上是协方差的标准化度量，其结果始终在 -1 和 1 之间。系数为1表示X和Y之间有很强的线性关系，所有数据点都近似位于一条直线上，Y随着X的增加而增加。系数为-1表示所有数据点都位于一条直线上，但Y随着X的增加而减少。系数为0表示两个变量之间没有线性关系。两个变量之间的皮尔逊相关系数定义为两个变量的协方差除以它们标准差的乘积：</p>
<p><img loading="lazy" src="img/person-formular.png" alt=""  />
</p>
<p>由于皮尔逊相关系数是度量变量之间的线性关系的，那么就无法检测到数据之间的非线性关系，如下图的示例。</p>
<p><img loading="lazy" src="img/pps_01.png" alt=""  />
</p>
<p>由皮尔逊相关系数定义的公式可知，皮尔逊相关系数是对称的，即P(A,B) = P(B,A)， 但是在真实世界中，特征之间的关系往往是不对称的，例如：我可以根据你的手机号推断你是哪个城市的，但是不能根据你的城市推断出你的手机号。  同时我们也会发现，当特征是非数值向量时，例如是Onehot向量时，皮尔逊相关系数是没有办法对齐进行处理的。</p>
<p>综上所述，常用的皮尔逊相关系数存在以下问题：</p>
<ol>
<li>只能度量线性关系；</li>
<li>度量的关系是对称的；</li>
<li>不能处理非数值向量之间的关系。</li>
</ol>
<p>计算“x 预测 y”的PPS得分</p>
<ul>
<li>分数始终介于 0 到 1 之间，并且与数据类型无关。</li>
<li>得分 0 意味着 x 列无法比朴素基线模型更好地预测 y 列。</li>
<li>得分 1 意味着 x 列可以在给定模型的情况下完美预测 y 列。</li>
<li>0 到 1 之间的分数表示模型与基线模型相比所实现的潜在预测能力的比率。</li>
</ul>
<p>在 Python 和 R 中均有pps的库， 分别是 <a href="https://github.com/8080labs/ppscore">ppscore库</a> 和 <a href="https://github.com/paulvanderlaken/ppsr">ppsr 包</a>， 今天以 ppsr为例分享。</p>
<p><br><br></p>
<h2 id="三ppsr用法">三、ppsr用法</h2>
<p>该<code>ppsr</code>软件包有四个主要函数来计算 PPS：</p>
<ul>
<li><code>ppsr::score()</code>计算 xy PPS</li>
<li><code>ppsr::score_matrix()</code>计算所有 XY PPS，并将它们显示在矩阵中</li>
<li><code>ppsr::visualize_pps</code>   pps得分矩阵</li>
<li><code>ppsr::visualize_correlations</code>  相关矩阵</li>
</ul>
<p>其中<code>x</code>和<code>y</code>代表单个预测变量/目标，并且<code>X</code>和 <code>Y</code>代表给定数据集中的所有预测变量/目标。</p>
<br>
<h3 id="31-安装">3.1 安装</h3>
<p>在R中安装ppsr，打开命令行， 执行</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">install.packages(&#39;ppsr&#39;)
</code></pre></div><br>
<h3 id="32-score">3.2 score()</h3>
<p><code>score()</code>计算单个目标和预测变量的 PPS</p>
<p>例如，使用决策树回归模型计算 x预测y 的PPS……</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-R" data-lang="R"><span class="n">ppsr</span><span class="o">::</span><span class="nf">score</span><span class="p">(</span><span class="n">iris</span><span class="p">,</span> <span class="n">x</span> <span class="o">=</span> <span class="s">&#39;Sepal.Length&#39;</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="s">&#39;Petal.Length&#39;</span><span class="p">,</span> <span class="n">algorithm</span> <span class="o">=</span> <span class="s">&#39;tree&#39;</span><span class="p">)</span><span class="n">[[</span><span class="s">&#39;pps&#39;</span><span class="n">]]</span>
<span class="c1">#&gt; [1] 0.6160836</span>
</code></pre></div><br>
<p>使用广义线性回归模型计算 x预测y 的PPS……</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-R" data-lang="R"><span class="n">ppsr</span><span class="o">::</span><span class="nf">score</span><span class="p">(</span><span class="n">iris</span><span class="p">,</span> <span class="n">x</span> <span class="o">=</span> <span class="s">&#39;Sepal.Length&#39;</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="s">&#39;Petal.Length&#39;</span><span class="p">,</span> <span class="n">algorithm</span> <span class="o">=</span> <span class="s">&#39;glm&#39;</span><span class="p">)</span><span class="n">[[</span><span class="s">&#39;pps&#39;</span><span class="n">]]</span>
<span class="c1">#&gt; [1] 0.5441131</span>
</code></pre></div><br>
<h3 id="33-score_matrix">3.3 score_matrix()</h3>
<p>类似于Pearson相关矩阵</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-R" data-lang="R"><span class="n">ppsr</span><span class="o">::</span><span class="nf">score_matrix</span><span class="p">(</span><span class="n">df</span> <span class="o">=</span> <span class="n">iris</span><span class="p">)</span>
<span class="c1">#&gt;              Sepal.Length Sepal.Width Petal.Length Petal.Width   Species</span>
<span class="c1">#&gt; Sepal.Length   1.00000000  0.04632352    0.5491398   0.4127668 0.4075487</span>
<span class="c1">#&gt; Sepal.Width    0.06790301  1.00000000    0.2376991   0.2174659 0.2012876</span>
<span class="c1">#&gt; Petal.Length   0.61608360  0.24263851    1.0000000   0.7917512 0.7904907</span>
<span class="c1">#&gt; Petal.Width    0.48735314  0.20124105    0.7437845   1.0000000 0.7561113</span>
<span class="c1">#&gt; Species        0.55918638  0.31344008    0.9167580   0.9398532 1.0000000</span>
</code></pre></div><br>
<h3 id="34-可视化">3.4 可视化</h3>
<p>pps得分矩阵</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-R" data-lang="R"><span class="n">ppsr</span><span class="o">::</span><span class="nf">visualize_pps</span><span class="p">(</span><span class="n">df</span> <span class="o">=</span> <span class="n">iris</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/pps-matrix-heat.png" alt=""  />
</p>
<br>
<p>相关矩阵</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-R" data-lang="R"><span class="n">ppsr</span><span class="o">::</span><span class="nf">visualize_correlations</span><span class="p">(</span><span class="n">df</span> <span class="o">=</span> <span class="n">iris</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/correlation-heatmap.png" alt=""  />
</p>
<p>并排生成 PPS 和相关矩阵，以便于比较。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-R" data-lang="R"><span class="n">ppsr</span><span class="o">::</span><span class="nf">visualize_both</span><span class="p">(</span><span class="n">df</span> <span class="o">=</span> <span class="n">iris</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/pps-correlation-heat.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="四pps应用">四、PPS应用</h2>
<p>PPS的应用，了解了 PPS 的优点之后，我们来看看在现实生活中我们可以在哪些地方使用 PPS：</p>
<ul>
<li>查找数据中的模式： PPS 查找相关性发现的每一个关系，甚至更多。因此，您可以使用 PPS 矩阵替代相关矩阵来检测和理解数据中的线性或非线性模式。使用始终在 0 到 1 之间的单个分数跨数据类型是可能的。</li>
<li>特征选择：除了您通常的特征选择机制外，您还可以使用预测能力得分来为您的目标列找到好的预测变量。此外，您可以消除仅添加随机噪声的功能。这些特征有时在特征重要性指标上仍然得分很高。此外，您可以消除其他特征可以预测的特征，因为它们不会添加新信息。此外，您可以识别 PPS 矩阵中的相互预测特征对——这包括强相关特征，但也将检测非线性关系。</li>
<li>检测信息泄露：使用 PPS 矩阵检测变量之间的信息泄露——即使信息泄露是通过其他变量介导的。</li>
<li>数据规范化：通过将 PPS 矩阵解释为有向图来查找数据中的实体结构。当数据包含以前未知的潜在结构时，这可能会令人惊讶。</li>
</ul>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>2T数据集 | 使用GH Archive获取Github社区用户数据</title>
      <link>https://textdata.cn/blog/2023-11-22-open-dataset-gharchive-org/</link>
      <pubDate>Wed, 22 Nov 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-11-22-open-dataset-gharchive-org/</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;科研用途，仅供展示；如有任何问题，加微信372335839，备注「姓名-学校-专业」&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;一github&#34;&gt;一、Github&lt;/h2&gt;
&lt;p&gt;GitHub 是一个具有代表性的开发者社区，帮助了软件的在线开发，吸引了全球超过 3100 万开发者。 GitHub 将每个用户活动视为一个事件，例如新存储库或创建的分支的创建事件。 GitHub 总共支持 42 种事件类型。 典型的用户活动包括创建新存储库、克隆现有存储库、从 GitHub 提取存储库的最新更改以及提交本地所做的更改并将其推送到共享存储库。&lt;/p&gt;
&lt;p&gt;通过 GitHub，开发人员可以相互交流，通过在存储库下发布问题来分配和领取编程任务。 此外，还支持常规的“关注”功能，允许用户接收该平台上任何用户的状态更新通知。 在这些在线社区中，开发者之间的互动主要集中在协作开发和代码共享上，形成了一种特殊的社交网络。&lt;strong&gt;这些特点使得github数据可用于广泛的研究领域，包括但不限于科技创新、组织管理、社交媒体等&lt;/strong&gt;。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二gh-archive&#34;&gt;二、GH Archive&lt;/h2&gt;
&lt;p&gt;获取github数据，我们最容易想到是利用网站提供的api。 github提供了免费的api接口， 每小时的请求数量是有限制的（匿名用户60次，授权用户5000次）。 这对于想做大数据分析的我们而言， 限制太多， 短时间内难以获得大规模的数据。&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;GHArchive活动档案自 2011 年 2 月 12 日起提供。&lt;/li&gt;
&lt;li&gt;2011 年 2 月 12 日至 2014 年 12 月 31 日之间的活动档案是通过（现已弃用）时间线 API 记录的。&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;从 2015 年 1 月 1 日开始的活动档案是通过事件 API 记录的&lt;/strong&gt;。&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;可供下载GH Archive数据集体积远超 2T， 按年度&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;4.6G    2011
13G     2012
26G     2013
57G     2014
75G     2015
112G    2016
145G    2017
177G    2018
254G    2019
420G    2020
503G    2021
657G    2022
很大     2023
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;21--资源网址规律&#34;&gt;2.1  资源网址规律&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;GH Archive&lt;/strong&gt; 是一个开源的一个项目，用于记录公共GitHub时间轴，对其进行存档，并使其易于访问以进行进一步分析。GitHub Archive获取所有的GitHub events信息存储在一组JSON文件中，以便根据需要下载并脱机处理。&lt;strong&gt;GH Archive&lt;/strong&gt;数据是以小时为粒度，&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;数据获取任务&lt;/th&gt;
&lt;th&gt;命令行下载命令&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;获取2021.11.21下午4点(世界标准时间)的数据&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;wget https://data.gharchive.org/2021-11-21-16.json.gz&lt;/code&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;获取2021.11.21的数据&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;wget https://data.gharchive.org/2021-11-21-{0..23}.json.gz&lt;/code&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;获取2021.11月的数据&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;wget https://data.gharchive.org/2021-11-{0..30}-{0..23}.json.gz&lt;/code&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;每个下载下来的数据都是&lt;code&gt;.gz&lt;/code&gt;的压缩文件，解压后会得到 &lt;code&gt;.json&lt;/code&gt;文件。 &lt;strong&gt;需要注意， 一个小时的数据大概百兆级别， 如果是整天、正月，json的文件会非常大。 建议按小时为粒度进行数据采集&lt;/strong&gt;。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;22-构造urls&#34;&gt;2.2 构造urls&lt;/h3&gt;
&lt;p&gt;假设我要批量自动下载数据， 可以用python生成有规律的url列表， 然后用requests方式存储对应的.&lt;code&gt;gz&lt;/code&gt;文件数据。 &lt;strong&gt;假设我们需要采集 2021年11月21日全天的数据， 使用小时粒度存储数据集&lt;/strong&gt;。 &lt;strong&gt;需要注意， 本文教程默认是在jupyter notebook中撰写运行&lt;/strong&gt;。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;requests&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;date&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;2021-11-21&amp;#39;&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;urls&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[]&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;hour&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;range&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;24&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;url&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;sa&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;https://data.gharchive.org/&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;date&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;hour&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;.json.gz&amp;#39;&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;urls&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;append&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;url&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    
&lt;span class=&#34;n&#34;&gt;urls&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;
https://data.gharchive.org/2021-11-21-0.json.gz
https://data.gharchive.org/2021-11-21-1.json.gz
https://data.gharchive.org/2021-11-21-2.json.gz

...
...
https://data.gharchive.org/2021-11-21-20.json.gz
https://data.gharchive.org/2021-11-21-21.json.gz
https://data.gharchive.org/2021-11-21-22.json.gz
https://data.gharchive.org/2021-11-21-23.json.gz
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;23--python下载&#34;&gt;2.3  python下载&lt;/h3&gt;
&lt;p&gt;使用requests库下载一个数据集&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;requests&lt;/span&gt;

&lt;span class=&#34;k&#34;&gt;def&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;download&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;url&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;file&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;url&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;split&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;/&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)[&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
    &lt;span class=&#34;k&#34;&gt;with&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;open&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;file&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;wb&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;gf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
        &lt;span class=&#34;n&#34;&gt;resp&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;requests&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;get&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;url&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
        &lt;span class=&#34;n&#34;&gt;gf&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;write&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;resp&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;content&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
     
&lt;span class=&#34;c1&#34;&gt;#尝试下载&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;url&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;https://data.gharchive.org/2021-11-21-0.json.gz&amp;#39;&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;download&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;url&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;p&gt;批量下载2021年11月21日全天的数据， 使用小时粒度存储数据集。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;url&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;urls&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;download&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;url&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;三读取操作&#34;&gt;三、读取操作&lt;/h2&gt;
&lt;h3 id=&#34;31-数据解压&#34;&gt;3.1 数据解压&lt;/h3&gt;
&lt;p&gt;得到的 &lt;code&gt;.gz&lt;/code&gt;数据可以使用以下代码进行解压，解压后会得到 &lt;code&gt;.json&lt;/code&gt; 数据文件。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;import os
import gzip

gz_fs = [f for f in os.listdir(&amp;#39;.&amp;#39;) if &amp;#39;.gz&amp;#39; in f]
for gz_f in gz_fs:
    file = gz_f.replace(&amp;#39;.gz&amp;#39;, &amp;#39;&amp;#39;)
    content = gzip.GzipFile(gz_f).read()
    with open(file, &amp;#39;wb&amp;#39;) as jsonf:
        jsonf.write(content)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;32-读取json&#34;&gt;3.2 读取json&lt;/h3&gt;
&lt;p&gt;因为数据文件都很大，一次性读取会很消耗时间， 推荐阅读 &lt;a href=&#34;https://textdata.cn/blog/2023-11-17-how-handle-mega-csv-that-far-exceed-memory/&#34;&gt;&lt;strong&gt;如何处理远超电脑内存的csv文件&lt;/strong&gt;&lt;/a&gt; 。&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;pd.read_json(jsonf, nrows, lines, chunksize)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;jsonf: 文件路径&lt;/li&gt;
&lt;li&gt;nrows: 读取前nrows行&lt;/li&gt;
&lt;li&gt;lines: 以行的方式读取，默认False&lt;/li&gt;
&lt;li&gt;chunksize: 分批次读取，每批次的规模是chunksize行&lt;/li&gt;
&lt;/ul&gt;
&lt;br&gt;
&lt;h3 id=&#34;321-读取前n行&#34;&gt;3.2.1 读取前n行&lt;/h3&gt;
&lt;p&gt;使用pandas读取 &lt;code&gt;2021-11-21-0.json&lt;/code&gt;  &lt;strong&gt;前5条数据， 了解下数据集的字段&lt;/strong&gt;&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_json&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;2021-11-21-0.json&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;lines&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;nrows&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;5&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;322-查看折叠的字段&#34;&gt;3.2.2 查看折叠的字段&lt;/h3&gt;
&lt;p&gt;乍一看好像没啥数据，其实都折叠在字段之中。以actor为例，我们看看内部会折叠哪些字段&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;actor&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;values&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;array([
{&amp;#39;id&amp;#39;: 5355937, 
&amp;#39;login&amp;#39;: &amp;#39;austinkregel&amp;#39;, 
&amp;#39;display_login&amp;#39;: &amp;#39;austinkregel&amp;#39;, 
&amp;#39;gravatar_id&amp;#39;: &amp;#39;&amp;#39;, 
&amp;#39;url&amp;#39;: &amp;#39;https://api.github.com/users/austinkregel&amp;#39;, 
&amp;#39;avatar_url&amp;#39;: &amp;#39;https://avatars.githubusercontent.com/u/5355937?&amp;#39;},

{&amp;#39;id&amp;#39;: 89859977, 
&amp;#39;login&amp;#39;: &amp;#39;Nicoperez19&amp;#39;, 
&amp;#39;display_login&amp;#39;: &amp;#39;Nicoperez19&amp;#39;, 
&amp;#39;gravatar_id&amp;#39;: &amp;#39;&amp;#39;, 
&amp;#39;url&amp;#39;: &amp;#39;https://api.github.com/users/Nicoperez19&amp;#39;, 
&amp;#39;avatar_url&amp;#39;: &amp;#39;https://avatars.githubusercontent.com/u/89859977?&amp;#39;},

{&amp;#39;id&amp;#39;: 46858494, 
&amp;#39;login&amp;#39;: &amp;#39;kapone3047&amp;#39;, 
&amp;#39;display_login&amp;#39;: &amp;#39;kapone3047&amp;#39;, 
&amp;#39;gravatar_id&amp;#39;: &amp;#39;&amp;#39;, 
&amp;#39;url&amp;#39;: &amp;#39;https://api.github.com/users/kapone3047&amp;#39;, 
&amp;#39;avatar_url&amp;#39;: &amp;#39;https://avatars.githubusercontent.com/u/46858494?&amp;#39;},

       
 {&amp;#39;id&amp;#39;: 1843851, 
 &amp;#39;login&amp;#39;: &amp;#39;DerekEdwards&amp;#39;, 
 &amp;#39;display_login&amp;#39;: &amp;#39;DerekEdwards&amp;#39;, 
 &amp;#39;gravatar_id&amp;#39;: &amp;#39;&amp;#39;, 
 &amp;#39;url&amp;#39;: &amp;#39;https://api.github.com/users/DerekEdwards&amp;#39;, 
 &amp;#39;avatar_url&amp;#39;: &amp;#39;https://avatars.githubusercontent.com/u/1843851?&amp;#39;},
 
{&amp;#39;id&amp;#39;: 94767098, 
&amp;#39;login&amp;#39;: &amp;#39;hectorapweb&amp;#39;, 
&amp;#39;display_login&amp;#39;: &amp;#39;hectorapweb&amp;#39;, 
&amp;#39;gravatar_id&amp;#39;: &amp;#39;&amp;#39;, 
&amp;#39;url&amp;#39;: &amp;#39;https://api.github.com/users/hectorapweb&amp;#39;, 
&amp;#39;avatar_url&amp;#39;: &amp;#39;https://avatars.githubusercontent.com/u/94767098?&amp;#39;}

],dtype=object)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;323-恢复一个折叠的信息&#34;&gt;3.2.3 恢复一个折叠的信息&lt;/h3&gt;
&lt;p&gt;以actor为例&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;df[&amp;#39;actor&amp;#39;].apply(lambda x: pd.Series(x))
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/df2.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;324-合并结果&#34;&gt;3.2.4 合并结果&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;_ = df[&amp;#39;actor&amp;#39;].apply(lambda x: pd.Series(x))
df = pd.concat([df, _], axis=0)
df
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/df3.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;更新后的df含有的字段有&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;df.columns
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;Index([&amp;#39;id&amp;#39;, &amp;#39;type&amp;#39;, &amp;#39;actor&amp;#39;, &amp;#39;repo&amp;#39;, &amp;#39;payload&amp;#39;, &amp;#39;public&amp;#39;, &amp;#39;created_at&amp;#39;, &amp;#39;org&amp;#39;,
       &amp;#39;id&amp;#39;, &amp;#39;login&amp;#39;, &amp;#39;display_login&amp;#39;, &amp;#39;gravatar_id&amp;#39;, &amp;#39;url&amp;#39;, &amp;#39;avatar_url&amp;#39;],
      dtype=&amp;#39;object&amp;#39;)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;四相关数据集&#34;&gt;四、相关数据集&lt;/h2&gt;
&lt;h3 id=&#34;github-1000万用户&#34;&gt;Github 1000万用户&lt;/h3&gt;
&lt;p&gt;Gong, Q., Zhang, J., Chen, Y., Li, Q., &lt;a href=&#34;https://research.aalto.fi/en/persons/yu-xiao&#34;&gt;Xiao, Y.&lt;/a&gt;, Wang, X. &amp;amp; Hui, P., Nov 2019, &lt;em&gt;CIKM &amp;lsquo;19:Proceedings of the 28th ACM International Conference on Information and Knowledge Management.&lt;/em&gt; &lt;a href=&#34;https://research.aalto.fi/en/datasets/a-representative-user-centric-dataset-of-10-million-github-develo#&#34;&gt;ACM&lt;/a&gt;, p. 1251-1260 (ACM International Conference on Information &amp;amp; Knowledge Management).&lt;/p&gt;
&lt;p&gt;使用 GitHub API，我们构建了超过 1000 万 GitHub 用户的无偏见数据集。该数据收集于2018年7月20日至8月27日期间，涵盖10,649,574名用户、118,602,740次提交和20,999,258个存储库。每个数据条目都以 JSON 格式存储，代表一个 GitHub 用户，并包含用户个人资料页面中的描述信息、她的提交活动以及创建/分叉的公共存储库的信息。&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;数据集下载地址&lt;/strong&gt; &lt;a href=&#34;https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/T6ZRJT&#34;&gt;https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/T6ZRJT&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
</description>
      <content:encoded><![CDATA[<blockquote>
<p>科研用途，仅供展示；如有任何问题，加微信372335839，备注「姓名-学校-专业」</p>
</blockquote>
<p><br><br></p>
<h2 id="一github">一、Github</h2>
<p>GitHub 是一个具有代表性的开发者社区，帮助了软件的在线开发，吸引了全球超过 3100 万开发者。 GitHub 将每个用户活动视为一个事件，例如新存储库或创建的分支的创建事件。 GitHub 总共支持 42 种事件类型。 典型的用户活动包括创建新存储库、克隆现有存储库、从 GitHub 提取存储库的最新更改以及提交本地所做的更改并将其推送到共享存储库。</p>
<p>通过 GitHub，开发人员可以相互交流，通过在存储库下发布问题来分配和领取编程任务。 此外，还支持常规的“关注”功能，允许用户接收该平台上任何用户的状态更新通知。 在这些在线社区中，开发者之间的互动主要集中在协作开发和代码共享上，形成了一种特殊的社交网络。<strong>这些特点使得github数据可用于广泛的研究领域，包括但不限于科技创新、组织管理、社交媒体等</strong>。</p>
<p><br><br></p>
<h2 id="二gh-archive">二、GH Archive</h2>
<p>获取github数据，我们最容易想到是利用网站提供的api。 github提供了免费的api接口， 每小时的请求数量是有限制的（匿名用户60次，授权用户5000次）。 这对于想做大数据分析的我们而言， 限制太多， 短时间内难以获得大规模的数据。</p>
<ul>
<li>GHArchive活动档案自 2011 年 2 月 12 日起提供。</li>
<li>2011 年 2 月 12 日至 2014 年 12 月 31 日之间的活动档案是通过（现已弃用）时间线 API 记录的。</li>
<li><strong>从 2015 年 1 月 1 日开始的活动档案是通过事件 API 记录的</strong>。</li>
</ul>
<p>可供下载GH Archive数据集体积远超 2T， 按年度</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">4.6G    2011
13G     2012
26G     2013
57G     2014
75G     2015
112G    2016
145G    2017
177G    2018
254G    2019
420G    2020
503G    2021
657G    2022
很大     2023
</code></pre></div><br>
<h3 id="21--资源网址规律">2.1  资源网址规律</h3>
<p><strong>GH Archive</strong> 是一个开源的一个项目，用于记录公共GitHub时间轴，对其进行存档，并使其易于访问以进行进一步分析。GitHub Archive获取所有的GitHub events信息存储在一组JSON文件中，以便根据需要下载并脱机处理。<strong>GH Archive</strong>数据是以小时为粒度，</p>
<table>
<thead>
<tr>
<th>数据获取任务</th>
<th>命令行下载命令</th>
</tr>
</thead>
<tbody>
<tr>
<td>获取2021.11.21下午4点(世界标准时间)的数据</td>
<td><strong><code>wget https://data.gharchive.org/2021-11-21-16.json.gz</code></strong></td>
</tr>
<tr>
<td>获取2021.11.21的数据</td>
<td><strong><code>wget https://data.gharchive.org/2021-11-21-{0..23}.json.gz</code></strong></td>
</tr>
<tr>
<td>获取2021.11月的数据</td>
<td><strong><code>wget https://data.gharchive.org/2021-11-{0..30}-{0..23}.json.gz</code></strong></td>
</tr>
</tbody>
</table>
<p>每个下载下来的数据都是<code>.gz</code>的压缩文件，解压后会得到 <code>.json</code>文件。 <strong>需要注意， 一个小时的数据大概百兆级别， 如果是整天、正月，json的文件会非常大。 建议按小时为粒度进行数据采集</strong>。</p>
<br>
<h3 id="22-构造urls">2.2 构造urls</h3>
<p>假设我要批量自动下载数据， 可以用python生成有规律的url列表， 然后用requests方式存储对应的.<code>gz</code>文件数据。 <strong>假设我们需要采集 2021年11月21日全天的数据， 使用小时粒度存储数据集</strong>。 <strong>需要注意， 本文教程默认是在jupyter notebook中撰写运行</strong>。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">requests</span>

<span class="n">date</span> <span class="o">=</span> <span class="s1">&#39;2021-11-21&#39;</span>
<span class="n">urls</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">hour</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">24</span><span class="p">):</span>
    <span class="n">url</span> <span class="o">=</span> <span class="sa">f</span><span class="s1">&#39;https://data.gharchive.org/</span><span class="si">{</span><span class="n">date</span><span class="si">}</span><span class="s1">-</span><span class="si">{</span><span class="n">hour</span><span class="si">}</span><span class="s1">.json.gz&#39;</span>
    <span class="n">urls</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
    
<span class="n">urls</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">
https://data.gharchive.org/2021-11-21-0.json.gz
https://data.gharchive.org/2021-11-21-1.json.gz
https://data.gharchive.org/2021-11-21-2.json.gz

...
...
https://data.gharchive.org/2021-11-21-20.json.gz
https://data.gharchive.org/2021-11-21-21.json.gz
https://data.gharchive.org/2021-11-21-22.json.gz
https://data.gharchive.org/2021-11-21-23.json.gz
</code></pre></div><br>
<h3 id="23--python下载">2.3  python下载</h3>
<p>使用requests库下载一个数据集</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">requests</span>

<span class="k">def</span> <span class="nf">download</span><span class="p">(</span><span class="n">url</span><span class="p">):</span>
    <span class="n">file</span> <span class="o">=</span> <span class="n">url</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">&#39;/&#39;</span><span class="p">)[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
    <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">file</span><span class="p">,</span> <span class="s1">&#39;wb&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">gf</span><span class="p">:</span>
        <span class="n">resp</span> <span class="o">=</span> <span class="n">requests</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
        <span class="n">gf</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">resp</span><span class="o">.</span><span class="n">content</span><span class="p">)</span>
     
<span class="c1">#尝试下载</span>
<span class="n">url</span> <span class="o">=</span> <span class="s1">&#39;https://data.gharchive.org/2021-11-21-0.json.gz&#39;</span>
<span class="n">download</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
</code></pre></div><br>
<p>批量下载2021年11月21日全天的数据， 使用小时粒度存储数据集。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="k">for</span> <span class="n">url</span> <span class="ow">in</span> <span class="n">urls</span><span class="p">:</span>
    <span class="n">download</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
</code></pre></div><p><br><br></p>
<h2 id="三读取操作">三、读取操作</h2>
<h3 id="31-数据解压">3.1 数据解压</h3>
<p>得到的 <code>.gz</code>数据可以使用以下代码进行解压，解压后会得到 <code>.json</code> 数据文件。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">import os
import gzip

gz_fs = [f for f in os.listdir(&#39;.&#39;) if &#39;.gz&#39; in f]
for gz_f in gz_fs:
    file = gz_f.replace(&#39;.gz&#39;, &#39;&#39;)
    content = gzip.GzipFile(gz_f).read()
    with open(file, &#39;wb&#39;) as jsonf:
        jsonf.write(content)
</code></pre></div><br>
<h3 id="32-读取json">3.2 读取json</h3>
<p>因为数据文件都很大，一次性读取会很消耗时间， 推荐阅读 <a href="https://textdata.cn/blog/2023-11-17-how-handle-mega-csv-that-far-exceed-memory/"><strong>如何处理远超电脑内存的csv文件</strong></a> 。</p>
<p><strong>pd.read_json(jsonf, nrows, lines, chunksize)</strong></p>
<ul>
<li>jsonf: 文件路径</li>
<li>nrows: 读取前nrows行</li>
<li>lines: 以行的方式读取，默认False</li>
<li>chunksize: 分批次读取，每批次的规模是chunksize行</li>
</ul>
<br>
<h3 id="321-读取前n行">3.2.1 读取前n行</h3>
<p>使用pandas读取 <code>2021-11-21-0.json</code>  <strong>前5条数据， 了解下数据集的字段</strong></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_json</span><span class="p">(</span><span class="s1">&#39;2021-11-21-0.json&#39;</span><span class="p">,</span> <span class="n">lines</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">nrows</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
<span class="n">df</span>
</code></pre></div><p><img loading="lazy" src="img/df.png" alt=""  />
</p>
<br>
<h3 id="322-查看折叠的字段">3.2.2 查看折叠的字段</h3>
<p>乍一看好像没啥数据，其实都折叠在字段之中。以actor为例，我们看看内部会折叠哪些字段</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s1">&#39;actor&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">values</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">array([
{&#39;id&#39;: 5355937, 
&#39;login&#39;: &#39;austinkregel&#39;, 
&#39;display_login&#39;: &#39;austinkregel&#39;, 
&#39;gravatar_id&#39;: &#39;&#39;, 
&#39;url&#39;: &#39;https://api.github.com/users/austinkregel&#39;, 
&#39;avatar_url&#39;: &#39;https://avatars.githubusercontent.com/u/5355937?&#39;},

{&#39;id&#39;: 89859977, 
&#39;login&#39;: &#39;Nicoperez19&#39;, 
&#39;display_login&#39;: &#39;Nicoperez19&#39;, 
&#39;gravatar_id&#39;: &#39;&#39;, 
&#39;url&#39;: &#39;https://api.github.com/users/Nicoperez19&#39;, 
&#39;avatar_url&#39;: &#39;https://avatars.githubusercontent.com/u/89859977?&#39;},

{&#39;id&#39;: 46858494, 
&#39;login&#39;: &#39;kapone3047&#39;, 
&#39;display_login&#39;: &#39;kapone3047&#39;, 
&#39;gravatar_id&#39;: &#39;&#39;, 
&#39;url&#39;: &#39;https://api.github.com/users/kapone3047&#39;, 
&#39;avatar_url&#39;: &#39;https://avatars.githubusercontent.com/u/46858494?&#39;},

       
 {&#39;id&#39;: 1843851, 
 &#39;login&#39;: &#39;DerekEdwards&#39;, 
 &#39;display_login&#39;: &#39;DerekEdwards&#39;, 
 &#39;gravatar_id&#39;: &#39;&#39;, 
 &#39;url&#39;: &#39;https://api.github.com/users/DerekEdwards&#39;, 
 &#39;avatar_url&#39;: &#39;https://avatars.githubusercontent.com/u/1843851?&#39;},
 
{&#39;id&#39;: 94767098, 
&#39;login&#39;: &#39;hectorapweb&#39;, 
&#39;display_login&#39;: &#39;hectorapweb&#39;, 
&#39;gravatar_id&#39;: &#39;&#39;, 
&#39;url&#39;: &#39;https://api.github.com/users/hectorapweb&#39;, 
&#39;avatar_url&#39;: &#39;https://avatars.githubusercontent.com/u/94767098?&#39;}

],dtype=object)
</code></pre></div><br>
<h3 id="323-恢复一个折叠的信息">3.2.3 恢复一个折叠的信息</h3>
<p>以actor为例</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">df[&#39;actor&#39;].apply(lambda x: pd.Series(x))
</code></pre></div><p><img loading="lazy" src="img/df2.png" alt=""  />
</p>
<br>
<h3 id="324-合并结果">3.2.4 合并结果</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">_ = df[&#39;actor&#39;].apply(lambda x: pd.Series(x))
df = pd.concat([df, _], axis=0)
df
</code></pre></div><p><img loading="lazy" src="img/df3.png" alt=""  />
</p>
<br>
<p>更新后的df含有的字段有</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">df.columns
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">Index([&#39;id&#39;, &#39;type&#39;, &#39;actor&#39;, &#39;repo&#39;, &#39;payload&#39;, &#39;public&#39;, &#39;created_at&#39;, &#39;org&#39;,
       &#39;id&#39;, &#39;login&#39;, &#39;display_login&#39;, &#39;gravatar_id&#39;, &#39;url&#39;, &#39;avatar_url&#39;],
      dtype=&#39;object&#39;)
</code></pre></div><p><br><br></p>
<h2 id="四相关数据集">四、相关数据集</h2>
<h3 id="github-1000万用户">Github 1000万用户</h3>
<p>Gong, Q., Zhang, J., Chen, Y., Li, Q., <a href="https://research.aalto.fi/en/persons/yu-xiao">Xiao, Y.</a>, Wang, X. &amp; Hui, P., Nov 2019, <em>CIKM &lsquo;19:Proceedings of the 28th ACM International Conference on Information and Knowledge Management.</em> <a href="https://research.aalto.fi/en/datasets/a-representative-user-centric-dataset-of-10-million-github-develo#">ACM</a>, p. 1251-1260 (ACM International Conference on Information &amp; Knowledge Management).</p>
<p>使用 GitHub API，我们构建了超过 1000 万 GitHub 用户的无偏见数据集。该数据收集于2018年7月20日至8月27日期间，涵盖10,649,574名用户、118,602,740次提交和20,999,258个存储库。每个数据条目都以 JSON 格式存储，代表一个 GitHub 用户，并包含用户个人资料页面中的描述信息、她的提交活动以及创建/分叉的公共存储库的信息。</p>
<p><strong>数据集下载地址</strong> <a href="https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/T6ZRJT">https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/T6ZRJT</a></p>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>54G数据集 | 1000万个 Github 用户数据</title>
      <link>https://textdata.cn/blog/2023-11-22-1000w-github-developer-dataset/</link>
      <pubDate>Wed, 22 Nov 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-11-22-1000w-github-developer-dataset/</guid>
      <description>&lt;h2 id=&#34;一github&#34;&gt;一、Github&lt;/h2&gt;
&lt;p&gt;GitHub 是一个具有代表性的开发者社区，帮助了软件的在线开发，吸引了全球超过 1亿开发者。 GitHub 将每个用户活动视为一个事件，例如新存储库或创建的分支的创建事件。 GitHub 总共支持 42 种事件类型。 典型的用户活动包括创建新存储库、克隆现有存储库、从 GitHub 提取存储库的最新更改以及提交本地所做的更改并将其推送到共享存储库。&lt;/p&gt;
&lt;p&gt;通过 GitHub，开发人员可以相互交流，通过在存储库下发布问题来分配和领取编程任务。 此外，还支持常规的“关注”功能，允许用户接收该平台上任何用户的状态更新通知。 在这些在线社区中，开发者之间的互动主要集中在协作开发和代码共享上，形成了一种特殊的社交网络。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二1000万github用户数据集&#34;&gt;二、1000万Github用户数据集&lt;/h2&gt;
&lt;h3 id=&#34;21-数据集概况&#34;&gt;2.1 数据集概况&lt;/h3&gt;
&lt;p&gt;每个 GitHub 用户都有一个数字用户 ID，该 ID 按升序分配。 用户注册越早，其用户 ID 就越小。 该研究中只考虑2017年12月31日之前注册的GitHub用户。&lt;strong&gt;为了获得无偏的用户数据集，使用基于ID的随机采样来实现数据爬取&lt;/strong&gt;。 请注意，某些数字 ID 没有对应的用户帐户，爬虫会跳过这些 ID。 对于每个用户，使用 GitHub users API (&lt;code&gt;https://api.github.com/user/ID&lt;/code&gt;) 来访问她的描述信息, 爬取了&lt;strong&gt;2018.6.20 ~ 2018.8.27&lt;/strong&gt;的数据。整个数据集压缩文件夹体积 5.7 G， 解压后会得到54G的 &lt;strong&gt;data.json&lt;/strong&gt; 。数据集下载地址&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/T6ZRJT
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;22-文献引用&#34;&gt;2.2 文献引用&lt;/h3&gt;
&lt;p&gt;&lt;a href=&#34;https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/T6ZRJT&#34;&gt;该数据集&lt;/a&gt;是网上公开，如使用该数据集引用格式:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;Qingyuan Gong, Jiayun Zhang, Yang Chen, Qi Li, Yu Xiao, Xin Wang, Pan Hui. Detecting Malicious Accounts in Online Developer Communities Using Deep Learning. Proc. of the 28th ACM International Conference on Information and Knowledge Management (CIKM&amp;#39;19), Beijing, China, Nov. 2019.
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;23-声明&#34;&gt;2.3 声明&lt;/h3&gt;
&lt;p&gt;科研用途，仅供展示；如有任何问题，加微信372335839，备注「姓名-学校-专业」&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;h2 id=&#34;三数据探索&#34;&gt;三、数据探索&lt;/h2&gt;
&lt;p&gt;54G的data.json太大， 我读取了前2000行，存储到了&lt;a href=&#34;mini_data.pkl&#34;&gt;mini_data.pkl&lt;/a&gt;文件中。&lt;/p&gt;
&lt;h3 id=&#34;31-读取json&#34;&gt;3.1 读取json&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;import pandas as pd

#54G的data.json太大， 我读取了前2000行
df = pd.read_json(&amp;#39;data.json&amp;#39;, nrows=2000, lines=True)
df.head()
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;字段有22个&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;col&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;columns&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;col&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;hirable
public_repos
is_suspicious
updated_at
id
blog
followers
location
follower_list
type
commit_list
bio
commits
company
following_list
public_gists
name
created_at
email
following
login
repo_list
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;32-前2000条记录保存为pkl&#34;&gt;3.2 前2000条记录保存为pkl&lt;/h3&gt;
&lt;p&gt;为了不浪费你的时间，可以先下载 &lt;a href=&#34;mini_data.pkl&#34;&gt;&lt;strong&gt;mini_data.pkl&lt;/strong&gt;&lt;/a&gt;, 里面存储了data.json中前 2000 条数据。 你可以自己检查下这个数据，如果觉得有用，再去自行下载下载5.4G的数据集压缩文件。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pickle&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_json&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;data.json&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;nrows&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;2000&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;lines&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;with&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;open&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;mini_data.pkl&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;wb&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;pickle&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;dump&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;33-读取pkl为df&#34;&gt;3.3 读取pkl为df&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pickle&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pickle&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;loads&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;open&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;mini_data.pkl&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;rb&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;())&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;head&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;2000
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
</description>
      <content:encoded><![CDATA[<h2 id="一github">一、Github</h2>
<p>GitHub 是一个具有代表性的开发者社区，帮助了软件的在线开发，吸引了全球超过 1亿开发者。 GitHub 将每个用户活动视为一个事件，例如新存储库或创建的分支的创建事件。 GitHub 总共支持 42 种事件类型。 典型的用户活动包括创建新存储库、克隆现有存储库、从 GitHub 提取存储库的最新更改以及提交本地所做的更改并将其推送到共享存储库。</p>
<p>通过 GitHub，开发人员可以相互交流，通过在存储库下发布问题来分配和领取编程任务。 此外，还支持常规的“关注”功能，允许用户接收该平台上任何用户的状态更新通知。 在这些在线社区中，开发者之间的互动主要集中在协作开发和代码共享上，形成了一种特殊的社交网络。</p>
<p><br><br></p>
<h2 id="二1000万github用户数据集">二、1000万Github用户数据集</h2>
<h3 id="21-数据集概况">2.1 数据集概况</h3>
<p>每个 GitHub 用户都有一个数字用户 ID，该 ID 按升序分配。 用户注册越早，其用户 ID 就越小。 该研究中只考虑2017年12月31日之前注册的GitHub用户。<strong>为了获得无偏的用户数据集，使用基于ID的随机采样来实现数据爬取</strong>。 请注意，某些数字 ID 没有对应的用户帐户，爬虫会跳过这些 ID。 对于每个用户，使用 GitHub users API (<code>https://api.github.com/user/ID</code>) 来访问她的描述信息, 爬取了<strong>2018.6.20 ~ 2018.8.27</strong>的数据。整个数据集压缩文件夹体积 5.7 G， 解压后会得到54G的 <strong>data.json</strong> 。数据集下载地址</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/T6ZRJT
</code></pre></div><br>
<h3 id="22-文献引用">2.2 文献引用</h3>
<p><a href="https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/T6ZRJT">该数据集</a>是网上公开，如使用该数据集引用格式:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">Qingyuan Gong, Jiayun Zhang, Yang Chen, Qi Li, Yu Xiao, Xin Wang, Pan Hui. Detecting Malicious Accounts in Online Developer Communities Using Deep Learning. Proc. of the 28th ACM International Conference on Information and Knowledge Management (CIKM&#39;19), Beijing, China, Nov. 2019.
</code></pre></div><br>
<h3 id="23-声明">2.3 声明</h3>
<p>科研用途，仅供展示；如有任何问题，加微信372335839，备注「姓名-学校-专业」</p>
<br>
<br>
<h2 id="三数据探索">三、数据探索</h2>
<p>54G的data.json太大， 我读取了前2000行，存储到了<a href="mini_data.pkl">mini_data.pkl</a>文件中。</p>
<h3 id="31-读取json">3.1 读取json</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">import pandas as pd

#54G的data.json太大， 我读取了前2000行
df = pd.read_json(&#39;data.json&#39;, nrows=2000, lines=True)
df.head()
</code></pre></div><p><img loading="lazy" src="img/df.png" alt=""  />
</p>
<br>
<p>字段有22个</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="k">for</span> <span class="n">col</span> <span class="ow">in</span> <span class="n">df</span><span class="o">.</span><span class="n">columns</span><span class="p">:</span>
    <span class="nb">print</span><span class="p">(</span><span class="n">col</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">hirable
public_repos
is_suspicious
updated_at
id
blog
followers
location
follower_list
type
commit_list
bio
commits
company
following_list
public_gists
name
created_at
email
following
login
repo_list
</code></pre></div><br>
<h3 id="32-前2000条记录保存为pkl">3.2 前2000条记录保存为pkl</h3>
<p>为了不浪费你的时间，可以先下载 <a href="mini_data.pkl"><strong>mini_data.pkl</strong></a>, 里面存储了data.json中前 2000 条数据。 你可以自己检查下这个数据，如果觉得有用，再去自行下载下载5.4G的数据集压缩文件。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pickle</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_json</span><span class="p">(</span><span class="s1">&#39;data.json&#39;</span><span class="p">,</span> <span class="n">nrows</span><span class="o">=</span><span class="mi">2000</span><span class="p">,</span> <span class="n">lines</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s1">&#39;mini_data.pkl&#39;</span><span class="p">,</span> <span class="s1">&#39;wb&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
    <span class="n">pickle</span><span class="o">.</span><span class="n">dump</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">f</span><span class="p">)</span>
</code></pre></div><br>
<h3 id="33-读取pkl为df">3.3 读取pkl为df</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pickle</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pickle</span><span class="o">.</span><span class="n">loads</span><span class="p">(</span><span class="nb">open</span><span class="p">(</span><span class="s1">&#39;mini_data.pkl&#39;</span><span class="p">,</span> <span class="s1">&#39;rb&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">())</span>
<span class="nb">print</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">))</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">2000
</code></pre></div><p><img loading="lazy" src="img/df.png" alt=""  />
</p>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>TechWeekly-20 每周有趣有用的技术分享</title>
      <link>https://textdata.cn/blog/techweekly20/</link>
      <pubDate>Wed, 22 Nov 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/techweekly20/</guid>
      <description>&lt;h2 id=&#34;开源chatpdf&#34;&gt;开源chatPDF&lt;/h2&gt;
&lt;p&gt;&lt;a href=&#34;https://github.com/Anil-matcha/ChatPDF&#34;&gt;https://github.com/Anil-matcha/ChatPDF&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;用不到 10 行 Python 代码创建本地&lt;a href=&#34;https://www.chatpdf.com/&#34;&gt;ChatPDF&lt;/a&gt; 或 &lt;a href=&#34;https://pdf.ai/&#34;&gt;PDF.ai等应用程序&lt;/a&gt;。即时答案。使用 AI 提出问题、提取信息并总结文档。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;chatglm&#34;&gt;chatGLM&lt;/h2&gt;
&lt;p&gt;ChatGLM-6B是一个开源的、支持中英双语的对话语言模型，基于&lt;a href=&#34;https://github.com/THUDM/GLM&#34;&gt;通用语言模型（GLM）&lt;/a&gt;架构，拥有62亿参数。结合模型量化技术，用户可以在消费级的显卡上进行本地（INT4量化）级别下最低只需 6GB 显存）。ChatGLM-6B 使用了和 ChatGPT 相似的技术，针对中文问答和对话进行了优化。&lt;/p&gt;
&lt;p&gt;大邓经过测试，基本可以本地运行，如果能与chatPDF 结合使用， 可以大大减轻科研工作者每日阅读量。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h3 id=&#34;xagent&#34;&gt;XAgent&lt;/h3&gt;
&lt;p&gt;XAgent是一个开源的基于大型语言模型（LLM）的自主智能体，可以自动解决各种任务。 它被设计为一个通用的智能体，可以应用于各种任务。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/overview_xagent.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;h2 id=&#34;langchain-chatchat&#34;&gt;Langchain-Chatchat&lt;/h2&gt;
&lt;p&gt;基于 Langchain 与 ChatGLM 等语言模型的本地知识库问答&lt;/p&gt;
&lt;p&gt;&lt;a href=&#34;https://github.com/chatchat-space/Langchain-Chatchat&#34;&gt;https://github.com/chatchat-space/Langchain-Chatchat&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/langchain&amp;#43;chatglm.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/init_knowledge_base.jpg&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
</description>
      <content:encoded><![CDATA[<h2 id="开源chatpdf">开源chatPDF</h2>
<p><a href="https://github.com/Anil-matcha/ChatPDF">https://github.com/Anil-matcha/ChatPDF</a></p>
<p>用不到 10 行 Python 代码创建本地<a href="https://www.chatpdf.com/">ChatPDF</a> 或 <a href="https://pdf.ai/">PDF.ai等应用程序</a>。即时答案。使用 AI 提出问题、提取信息并总结文档。</p>
<p><br><br></p>
<h2 id="chatglm">chatGLM</h2>
<p>ChatGLM-6B是一个开源的、支持中英双语的对话语言模型，基于<a href="https://github.com/THUDM/GLM">通用语言模型（GLM）</a>架构，拥有62亿参数。结合模型量化技术，用户可以在消费级的显卡上进行本地（INT4量化）级别下最低只需 6GB 显存）。ChatGLM-6B 使用了和 ChatGPT 相似的技术，针对中文问答和对话进行了优化。</p>
<p>大邓经过测试，基本可以本地运行，如果能与chatPDF 结合使用， 可以大大减轻科研工作者每日阅读量。</p>
<p><br><br></p>
<h3 id="xagent">XAgent</h3>
<p>XAgent是一个开源的基于大型语言模型（LLM）的自主智能体，可以自动解决各种任务。 它被设计为一个通用的智能体，可以应用于各种任务。</p>
<p><img loading="lazy" src="img/overview_xagent.png" alt=""  />
</p>
<br>
<br>
<h2 id="langchain-chatchat">Langchain-Chatchat</h2>
<p>基于 Langchain 与 ChatGLM 等语言模型的本地知识库问答</p>
<p><a href="https://github.com/chatchat-space/Langchain-Chatchat">https://github.com/chatchat-space/Langchain-Chatchat</a></p>
<p><img loading="lazy" src="img/langchain&#43;chatglm.png" alt=""  />
</p>
<p><img loading="lazy" src="img/init_knowledge_base.jpg" alt=""  />
</p>
<br>
<br>
]]></content:encoded>
    </item>
    
    <item>
      <title>以聚类为例 | 使用大语言模型LLM做文本分析</title>
      <link>https://textdata.cn/blog/2023-11-20-how-to-use-llms-tobuild-better-clustering-models/</link>
      <pubDate>Mon, 20 Nov 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-11-20-how-to-use-llms-tobuild-better-clustering-models/</guid>
      <description>&lt;p&gt;本文主要分享&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;传统聚类算法&lt;/li&gt;
&lt;li&gt;LLM与嵌入算法&lt;/li&gt;
&lt;li&gt;嵌入算法聚类&lt;/li&gt;
&lt;li&gt;启发； LLM的其他用法&lt;/li&gt;
&lt;/ol&gt;
&lt;br&gt;
&lt;p&gt;聚类是一种无监督机器学习技术，旨在根据相似的数据点的特征将其分组在一起。使用聚类成簇，有助于解决各种问题，例如客户细分、异常检测和文本分类等。尽管传统的聚类技术被广泛使用，但它仍然面临着挑战。 今天代码很少，也没有实验数据， 主要是偏思路分享。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;一编码挑战&#34;&gt;一、编码挑战&lt;/h2&gt;
&lt;h3 id=&#34;11--字段单位不统一&#34;&gt;1.1  字段单位不统一&lt;/h3&gt;
&lt;p&gt;我想在本文中解决的主要挑战是选择如何编码或转换输入特征。一般来说，您需要将每个特征转换为相同的比例，否则，聚类模型将在特征之间分配不成比例的权重。例如， 假设数据中有重量 &lt;strong&gt;weight1&lt;/strong&gt; 、 &lt;strong&gt;weight2&lt;/strong&gt;  两个字段，weight1单位是市斤，而weight2单位是公斤。如果不首先对这些测量进行标准化，即使实际重量相同，我们的模型也会推断出以市斤为单位（对于类似重量的物体）测量的重量差异大于以公斤为单位的差异。&lt;/p&gt;
&lt;p&gt;现实中，数据集中不会出现对一个信息使用两种单位进行度量。使用这个例子， 只为说明数据中不同字段分布不同，训练模型时不同字段承载的权重也不一样。为了减轻这个问题，一般是训练之前先将字段标准化。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;12-字段之间存在相关性&#34;&gt;1.2 字段之间存在相关性&lt;/h3&gt;
&lt;p&gt;让我们使用颜色组成的特征作为另一个示例。通常，许多人会选择将此特征 one-hot 编码到 n-1 个附加列中，其中 n 是唯一颜色的数量。虽然这有效，但它忽略了颜色之间的任何潜在关系。&lt;/p&gt;
&lt;p&gt;为什么是这样？让我们考虑数据集中的一个特征具有以下颜色：红色、栗色、深红色、猩红色和绿色。如果我们要对该列进行 one-hot 编码，我们将得到一个如下所示的数据帧：&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/01-color.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;在 &lt;strong&gt;欧几里德距离空间&lt;/strong&gt; 中，任意两个记录(行)之间的距离是相同的。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;import numpy as np

def euclidean_distance(vec1, vec2):
    if len(vec1) != len(vec2):
        raise ValueError(&amp;#34;vecs must have the same length.&amp;#34;)
        
    squared_differences = [(a - b) ** 2 for a, b in zip(vec1, vec2)]
    distance = np.sqrt(sum(squared_differences))
    return distance
    
red = np.array([0, 0, 0, 1, 0])
maroon = np.array([0, 0, 1, 0, 0])
green = np.array([0, 1, 0, 0, 0])

print(euclidean_distance(red, maroon))
print(euclidean_distance(red, green))
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;1.4142135623730951
1.4142135623730951
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h2 id=&#34;二有更好的办法吗&#34;&gt;二、有更好的办法吗？&lt;/h2&gt;
&lt;p&gt;当然， &lt;strong&gt;红色&lt;/strong&gt; 和 &lt;strong&gt;栗色&lt;/strong&gt; 是两种不同的颜色，但为了我们的聚类算法，我们其实不希望euclidean_distance(red, maroon) 与 euclidean_distance(red, green) 是相等的。&lt;/p&gt;
&lt;p&gt;那么该如何解决这个缺点呢？&lt;/p&gt;
&lt;p&gt;如果您阅读这篇文章的标题，我相信您可能已经get到本文的ieda……我们将结合 &lt;strong&gt;大语言模型&lt;/strong&gt; (Large language model, LLM)， 将每条记录字段和数值整理成一个字符串， 并通过LLM获得每条记录对应的嵌入表示。&lt;/p&gt;
&lt;p&gt;对于此示例，我将使用 Huggingface 中的句子转换器库以及我围绕工作申请综合创建的数据集。&lt;/p&gt;
&lt;p&gt;让我们从句子转换器开始。该 LLM 的工作原理与 BERT 类似，只不过它经过专门训练以在句子级别而不是单词或标记级别输出嵌入。这些句子级嵌入可以更好地捕获含义，并且计算速度更快。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;sentence_transformers&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;SentenceTransformer&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;sentence_transformers.util&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;cos_sim&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#使用hugginface，需要科学上网&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;model&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;SentenceTransformer&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;sa&#34;&gt;r&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;sentence-transformers/paraphrase-MiniLM-L6-v2&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;k&#34;&gt;def&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;prompt_text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
    &lt;span class=&#34;c1&#34;&gt;#每条记录整合为一个字符串&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;p_text&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;
        &lt;span class=&#34;sa&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;Age: &lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;Age&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt; Gender: &lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;Gender&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;lower&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt; Role: &lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;Role&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt; &amp;#34;&lt;/span&gt;
        &lt;span class=&#34;sa&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;Hiring Department: &lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;HiringDepartment&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt; &amp;#34;&lt;/span&gt;
        &lt;span class=&#34;sa&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;Travel Preference: &lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;TravelPreference&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt; Extracurriculars: &lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;ExtraCurriculars&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt; &amp;#34;&lt;/span&gt;
        &lt;span class=&#34;sa&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;Distance From Home: &lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;DistanceFromHome&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt; &amp;#34;&lt;/span&gt;
        &lt;span class=&#34;sa&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;Internships: &lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;Internships&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt; Education Level: &lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;EducationLevel&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt; Education Field: &lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;EducationField&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt; &amp;#34;&lt;/span&gt;
        &lt;span class=&#34;sa&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;Summary: &lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;Summary&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;&lt;/span&gt; 
    &lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;k&#34;&gt;return&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;p_text&lt;/span&gt;

&lt;span class=&#34;k&#34;&gt;def&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;output_embedding&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
    &lt;span class=&#34;c1&#34;&gt;#返回的嵌入表示的尺寸(记录数, 384)&lt;/span&gt;
    &lt;span class=&#34;c1&#34;&gt;#sentence-transformers/paraphrase-MiniLM-L6-v2 模型的词向量维度是384&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;embd&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;model&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;encode&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;k&#34;&gt;return&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;DataFrame&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;embd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;reshape&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;384&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;

&lt;span class=&#34;k&#34;&gt;def&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;preprocess_text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;text&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;prompt_text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;embd&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;output_embedding&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;k&#34;&gt;return&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;embd&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;combined_text&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;apply&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;k&#34;&gt;lambda&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;preprocess_text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;axis&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;我们的数据集包括有关求职者的信息，例如招聘部门、职位、年龄和教育水平等特征。这是一个数据截图：&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/02-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;我们的目标是将所有求职者分为不同的簇(可以理解为群体)。&lt;/p&gt;
&lt;p&gt;让我们看看如何将句子嵌入应用于每个求职者。第一步是通过将所有功能连接到一个字符串中来创建单个文本prompt。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;Age: 28.
Gender: male.
Role: Research Scientist.
Hiring Department: Research &amp;amp; Development.
Travel Preference: Travel_Frequently.
Extracurriculars: nan.
Distance From Home: 4.
Internships: 9.
Education Level: 3.
Education Field: Engineering.
Summary: As you can see, I am very dedicated and I am ready to start at your firm immediately.
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;将原记录(行)转为如上图所示的文本，之后调用 SBERT LLM 检索文本对应的嵌入向量。为方便展示，这里使用 dataframe.style 功能来突出显示低值和大值，以使表格更容易扫描：&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/03-df-style.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h2 id=&#34;三用嵌入编码有什么益处&#34;&gt;三、用嵌入编码有什么益处？&lt;/h2&gt;
&lt;p&gt;之前讲了传统聚类算法使用one-hot编码方式的不足，但没有解释用嵌入表示的益处。 先不讲理论， 就像探索颜色编码，我们看一个例子。 我想测量 &lt;strong&gt;Role&lt;/strong&gt; (岗位角色) 的相似程度， 我更倾向于用余弦相似度，而不是欧几里德距离， 请问这其中的差异是？&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;欧几里得距离&lt;/strong&gt; 是两点之间几何距离的度量，而 &lt;strong&gt;余弦相似度&lt;/strong&gt; 度量向量的方向。&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;欧几里得距离对向量的大小敏感&lt;/strong&gt;，而余弦相似度则不然。&lt;/li&gt;
&lt;li&gt;欧氏距离的值范围从 0（相同向量）到无穷大，而 &lt;strong&gt;余弦相似度的范围从 -1（完全不相似）到 1（完全相似）&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;让我们选择两个岗位角色：&lt;strong&gt;销售代表&lt;/strong&gt;（sales representative）和&lt;strong&gt;销售主管&lt;/strong&gt;(sales executive)。&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;使用 one-hot 编码的 销售代表 和 销售主管 的余弦相似度为 0.5，这意味着他们&lt;strong&gt;有些相关&lt;/strong&gt;。这是有道理的，因为他们都是销售角色。&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;使用嵌入编码的余弦相似度为 0.82。&lt;strong&gt;它们的相关性要高得多&lt;/strong&gt;。这更有意义，因为销售代表和销售主管在实践中是极其相似的角色。&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;31--传统的聚类&#34;&gt;3.1  传统的聚类&lt;/h3&gt;
&lt;p&gt;传统聚类算法大致流程如下图所示，&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/06-traditional-process.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;原文作者实验使用K=3的聚类算法，但k如何设置不是最关键的点。 我们的聚类模型中最重要的字段是求职者的&lt;strong&gt;个人总结&lt;/strong&gt;（Summary），其次是 &lt;strong&gt;招聘部门&lt;/strong&gt;（HiringDepartment）、&lt;strong&gt;是否喜欢旅行&lt;/strong&gt;(TravelPreference)。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/04-weight.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;为了更好的理解3个簇， 我们输出了数据汇总，每个数值字段平均值 及 非数值字段的高频项。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/05-stats.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;按道理聚类算法的结果应该不同簇之间的差异尽可能的大。糟糕的是不同簇之间的， 年龄(Age)、实习次数(Internships) 差异很小，而更糟糕的是招聘部门(HiringDepartment) 和 岗位角色(Role) 完全相同。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;32-嵌入的聚类&#34;&gt;3.2 嵌入的聚类&lt;/h3&gt;
&lt;p&gt;使用嵌入编码的聚类算法流程如下图所示。与传统 聚类方法相比，使用嵌入的流程只需处理数字特征， 因为由求职者提示信息(代码里的prompt_text)转化来的嵌入是严格数字化的。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/07-emb-process.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;在这里，我们不能像上次那样直接计算字段重要性。我们有数百个难以理解的特征，它们的重要性各不相同，我们无法理解。那么我们该怎么办？让我们训练另一个模型（这次是有监督的三类分类模型），使用原始特征集来预测嵌入模型生成的类标签。这样就可以以同类的方式重现字段重要性。结果如下&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/08-weight.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;我们找到一种新的嵌入表示来编码求职者信息， 并运算出了聚类结果。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/09-statas.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;从统计信息(上图)中可以看出，不同簇之间的差异变的更加清晰。 使用嵌入编码， 让更多申请销售岗位的的销售主管划分到cluster2， 让更多申请研发岗位的的科学家划分到cluster1 和 cluster3.&lt;/p&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;前文内容翻译整理自&lt;/p&gt;
&lt;p&gt;&lt;a href=&#34;https://medium.com/@swansburg.justin/how-to-use-llms-to-build-better-clustering-models-9b17a5491bb4&#34;&gt;https://medium.com/@swansburg.justin/how-to-use-llms-to-build-better-clustering-models-9b17a5491bb4&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;四启发&#34;&gt;四、启发&lt;/h2&gt;
&lt;p&gt;读完以上内容，大邓想到一个问题， 假设 没有简历系统，没有大数据，求职者与面试官坐在现场，  数据就是面试过程中的交流， 而交流必然通过话语这一媒介。 例如求职者的个人信息&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;“大家好，我叫张三， 今年24岁，哈尔滨人。本科毕业于哈尔滨工业大学，市场营销专业。 我是一个很外向的人，对销售很感兴趣，在大学期间摆了很多地摊。很希望获得贵公司的机会，让我在营销岗位上大发异彩。”
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;面试期间，记录人员将该哈尔滨张三的个人信息被整理为&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;name: 张三
age: 24
city: 哈尔滨
edu: 哈尔滨工业大学
major: 市场营销
experience: 摆摊
summary: 我是外向的人，对销售很感兴趣。
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;求职者的信息汇总成xlsx， 每个人的信息都或多或少的被压缩了。 这种表示方式， 在小规模时， 求职者的总结summary还是有很大信息量的，能够让面试者回忆起当时的场景和情景。但是当求职者的规模上升到几千上万， 备注note信息这种很重要的信息反而无法利用。&lt;/p&gt;
&lt;p&gt;使用大语言模型LLM，将文本提示转化为嵌入表示。我们可以将LLM看成是一个察言观色，见微知著，明察秋毫的智者。  这个智者可以&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;分类&lt;/li&gt;
&lt;li&gt;提取信息&lt;/li&gt;
&lt;li&gt;补全&lt;/li&gt;
&lt;li&gt;相似性&lt;/li&gt;
&lt;li&gt;&amp;hellip;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;以往缺失数据， 用插值或者其他技巧， 现在我们可以借助LLM， 只有有其他字段残存的微弱线索， LLM就能帮我们补全缺失值。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;41-分类&#34;&gt;4.1 分类&lt;/h3&gt;
&lt;p&gt;如图所示， 对于很多短文本， 我们可以推断话题，也可以推断情绪。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;https://huggingface.co/morit/chinese_xlm_xnli
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/10-classification.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/11-classification.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;h3 id=&#34;42-提取信息&#34;&gt;4.2 提取信息&lt;/h3&gt;
&lt;p&gt;假设有一些信息存储在文本中， 可以用正则表达式提取， 下面的例子用正则会很难设计， 但用LLM很简单。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;https://huggingface.co/luhua/chinese_pretrain_mrc_roberta_wwm_ext_large
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/12-extract.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;43-补全&#34;&gt;4.3 补全&lt;/h3&gt;
&lt;p&gt;填充缺失值信息&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/13-mask.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;44-相似性&#34;&gt;4.4 相似性&lt;/h3&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/14-sim.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;当然LLM功能还有很多，大家可以自己探索探索&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/15-func.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
</description>
      <content:encoded><![CDATA[<p>本文主要分享</p>
<ol>
<li>传统聚类算法</li>
<li>LLM与嵌入算法</li>
<li>嵌入算法聚类</li>
<li>启发； LLM的其他用法</li>
</ol>
<br>
<p>聚类是一种无监督机器学习技术，旨在根据相似的数据点的特征将其分组在一起。使用聚类成簇，有助于解决各种问题，例如客户细分、异常检测和文本分类等。尽管传统的聚类技术被广泛使用，但它仍然面临着挑战。 今天代码很少，也没有实验数据， 主要是偏思路分享。</p>
<p><br><br></p>
<h2 id="一编码挑战">一、编码挑战</h2>
<h3 id="11--字段单位不统一">1.1  字段单位不统一</h3>
<p>我想在本文中解决的主要挑战是选择如何编码或转换输入特征。一般来说，您需要将每个特征转换为相同的比例，否则，聚类模型将在特征之间分配不成比例的权重。例如， 假设数据中有重量 <strong>weight1</strong> 、 <strong>weight2</strong>  两个字段，weight1单位是市斤，而weight2单位是公斤。如果不首先对这些测量进行标准化，即使实际重量相同，我们的模型也会推断出以市斤为单位（对于类似重量的物体）测量的重量差异大于以公斤为单位的差异。</p>
<p>现实中，数据集中不会出现对一个信息使用两种单位进行度量。使用这个例子， 只为说明数据中不同字段分布不同，训练模型时不同字段承载的权重也不一样。为了减轻这个问题，一般是训练之前先将字段标准化。</p>
<br>
<h3 id="12-字段之间存在相关性">1.2 字段之间存在相关性</h3>
<p>让我们使用颜色组成的特征作为另一个示例。通常，许多人会选择将此特征 one-hot 编码到 n-1 个附加列中，其中 n 是唯一颜色的数量。虽然这有效，但它忽略了颜色之间的任何潜在关系。</p>
<p>为什么是这样？让我们考虑数据集中的一个特征具有以下颜色：红色、栗色、深红色、猩红色和绿色。如果我们要对该列进行 one-hot 编码，我们将得到一个如下所示的数据帧：</p>
<p><img loading="lazy" src="img/01-color.png" alt=""  />
</p>
<p>在 <strong>欧几里德距离空间</strong> 中，任意两个记录(行)之间的距离是相同的。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">import numpy as np

def euclidean_distance(vec1, vec2):
    if len(vec1) != len(vec2):
        raise ValueError(&#34;vecs must have the same length.&#34;)
        
    squared_differences = [(a - b) ** 2 for a, b in zip(vec1, vec2)]
    distance = np.sqrt(sum(squared_differences))
    return distance
    
red = np.array([0, 0, 0, 1, 0])
maroon = np.array([0, 0, 1, 0, 0])
green = np.array([0, 1, 0, 0, 0])

print(euclidean_distance(red, maroon))
print(euclidean_distance(red, green))
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">1.4142135623730951
1.4142135623730951
</code></pre></div><br>
<h2 id="二有更好的办法吗">二、有更好的办法吗？</h2>
<p>当然， <strong>红色</strong> 和 <strong>栗色</strong> 是两种不同的颜色，但为了我们的聚类算法，我们其实不希望euclidean_distance(red, maroon) 与 euclidean_distance(red, green) 是相等的。</p>
<p>那么该如何解决这个缺点呢？</p>
<p>如果您阅读这篇文章的标题，我相信您可能已经get到本文的ieda……我们将结合 <strong>大语言模型</strong> (Large language model, LLM)， 将每条记录字段和数值整理成一个字符串， 并通过LLM获得每条记录对应的嵌入表示。</p>
<p>对于此示例，我将使用 Huggingface 中的句子转换器库以及我围绕工作申请综合创建的数据集。</p>
<p>让我们从句子转换器开始。该 LLM 的工作原理与 BERT 类似，只不过它经过专门训练以在句子级别而不是单词或标记级别输出嵌入。这些句子级嵌入可以更好地捕获含义，并且计算速度更快。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">sentence_transformers</span> <span class="kn">import</span> <span class="n">SentenceTransformer</span>
<span class="kn">from</span> <span class="nn">sentence_transformers.util</span> <span class="kn">import</span> <span class="n">cos_sim</span>

<span class="c1">#使用hugginface，需要科学上网</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">SentenceTransformer</span><span class="p">(</span><span class="sa">r</span><span class="s2">&#34;sentence-transformers/paraphrase-MiniLM-L6-v2&#34;</span><span class="p">)</span>

<span class="k">def</span> <span class="nf">prompt_text</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
    <span class="c1">#每条记录整合为一个字符串</span>
    <span class="n">p_text</span> <span class="o">=</span> <span class="p">(</span>
        <span class="sa">f</span><span class="s2">&#34;Age: </span><span class="si">{</span><span class="n">x</span><span class="p">[</span><span class="s1">&#39;Age&#39;</span><span class="p">]</span><span class="si">}</span><span class="s2"> Gender: </span><span class="si">{</span><span class="n">x</span><span class="p">[</span><span class="s1">&#39;Gender&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">lower</span><span class="p">()</span><span class="si">}</span><span class="s2"> Role: </span><span class="si">{</span><span class="n">x</span><span class="p">[</span><span class="s1">&#39;Role&#39;</span><span class="p">]</span><span class="si">}</span><span class="s2"> &#34;</span>
        <span class="sa">f</span><span class="s2">&#34;Hiring Department: </span><span class="si">{</span><span class="n">x</span><span class="p">[</span><span class="s1">&#39;HiringDepartment&#39;</span><span class="p">]</span><span class="si">}</span><span class="s2"> &#34;</span>
        <span class="sa">f</span><span class="s2">&#34;Travel Preference: </span><span class="si">{</span><span class="n">x</span><span class="p">[</span><span class="s1">&#39;TravelPreference&#39;</span><span class="p">]</span><span class="si">}</span><span class="s2"> Extracurriculars: </span><span class="si">{</span><span class="n">x</span><span class="p">[</span><span class="s1">&#39;ExtraCurriculars&#39;</span><span class="p">]</span><span class="si">}</span><span class="s2"> &#34;</span>
        <span class="sa">f</span><span class="s2">&#34;Distance From Home: </span><span class="si">{</span><span class="n">x</span><span class="p">[</span><span class="s1">&#39;DistanceFromHome&#39;</span><span class="p">]</span><span class="si">}</span><span class="s2"> &#34;</span>
        <span class="sa">f</span><span class="s2">&#34;Internships: </span><span class="si">{</span><span class="n">x</span><span class="p">[</span><span class="s1">&#39;Internships&#39;</span><span class="p">]</span><span class="si">}</span><span class="s2"> Education Level: </span><span class="si">{</span><span class="n">x</span><span class="p">[</span><span class="s1">&#39;EducationLevel&#39;</span><span class="p">]</span><span class="si">}</span><span class="s2"> Education Field: </span><span class="si">{</span><span class="n">x</span><span class="p">[</span><span class="s1">&#39;EducationField&#39;</span><span class="p">]</span><span class="si">}</span><span class="s2"> &#34;</span>
        <span class="sa">f</span><span class="s2">&#34;Summary: </span><span class="si">{</span><span class="n">x</span><span class="p">[</span><span class="s1">&#39;Summary&#39;</span><span class="p">]</span><span class="si">}</span><span class="s2">&#34;</span> 
    <span class="p">)</span>
    <span class="k">return</span> <span class="n">p_text</span>

<span class="k">def</span> <span class="nf">output_embedding</span><span class="p">(</span><span class="n">text</span><span class="p">):</span>
    <span class="c1">#返回的嵌入表示的尺寸(记录数, 384)</span>
    <span class="c1">#sentence-transformers/paraphrase-MiniLM-L6-v2 模型的词向量维度是384</span>
    <span class="n">embd</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">embd</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">384</span><span class="p">))</span>

<span class="k">def</span> <span class="nf">preprocess_text</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
    <span class="n">text</span> <span class="o">=</span> <span class="n">prompt_text</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
    <span class="n">embd</span> <span class="o">=</span> <span class="n">output_embedding</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">embd</span>

<span class="n">df</span><span class="p">[</span><span class="s1">&#39;combined_text&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">preprocess_text</span><span class="p">(</span><span class="n">x</span><span class="p">),</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
</code></pre></div><p>我们的数据集包括有关求职者的信息，例如招聘部门、职位、年龄和教育水平等特征。这是一个数据截图：</p>
<p><img loading="lazy" src="img/02-df.png" alt=""  />
</p>
<br>
<p>我们的目标是将所有求职者分为不同的簇(可以理解为群体)。</p>
<p>让我们看看如何将句子嵌入应用于每个求职者。第一步是通过将所有功能连接到一个字符串中来创建单个文本prompt。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">Age: 28.
Gender: male.
Role: Research Scientist.
Hiring Department: Research &amp; Development.
Travel Preference: Travel_Frequently.
Extracurriculars: nan.
Distance From Home: 4.
Internships: 9.
Education Level: 3.
Education Field: Engineering.
Summary: As you can see, I am very dedicated and I am ready to start at your firm immediately.
</code></pre></div><p>将原记录(行)转为如上图所示的文本，之后调用 SBERT LLM 检索文本对应的嵌入向量。为方便展示，这里使用 dataframe.style 功能来突出显示低值和大值，以使表格更容易扫描：</p>
<p><img loading="lazy" src="img/03-df-style.png" alt=""  />
</p>
<br>
<h2 id="三用嵌入编码有什么益处">三、用嵌入编码有什么益处？</h2>
<p>之前讲了传统聚类算法使用one-hot编码方式的不足，但没有解释用嵌入表示的益处。 先不讲理论， 就像探索颜色编码，我们看一个例子。 我想测量 <strong>Role</strong> (岗位角色) 的相似程度， 我更倾向于用余弦相似度，而不是欧几里德距离， 请问这其中的差异是？</p>
<ul>
<li><strong>欧几里得距离</strong> 是两点之间几何距离的度量，而 <strong>余弦相似度</strong> 度量向量的方向。</li>
<li><strong>欧几里得距离对向量的大小敏感</strong>，而余弦相似度则不然。</li>
<li>欧氏距离的值范围从 0（相同向量）到无穷大，而 <strong>余弦相似度的范围从 -1（完全不相似）到 1（完全相似）</strong></li>
</ul>
<p>让我们选择两个岗位角色：<strong>销售代表</strong>（sales representative）和<strong>销售主管</strong>(sales executive)。</p>
<ul>
<li>
<p>使用 one-hot 编码的 销售代表 和 销售主管 的余弦相似度为 0.5，这意味着他们<strong>有些相关</strong>。这是有道理的，因为他们都是销售角色。</p>
</li>
<li>
<p>使用嵌入编码的余弦相似度为 0.82。<strong>它们的相关性要高得多</strong>。这更有意义，因为销售代表和销售主管在实践中是极其相似的角色。</p>
</li>
</ul>
<h3 id="31--传统的聚类">3.1  传统的聚类</h3>
<p>传统聚类算法大致流程如下图所示，</p>
<p><img loading="lazy" src="img/06-traditional-process.png" alt=""  />
</p>
<p>原文作者实验使用K=3的聚类算法，但k如何设置不是最关键的点。 我们的聚类模型中最重要的字段是求职者的<strong>个人总结</strong>（Summary），其次是 <strong>招聘部门</strong>（HiringDepartment）、<strong>是否喜欢旅行</strong>(TravelPreference)。</p>
<p><img loading="lazy" src="img/04-weight.png" alt=""  />
</p>
<br>
<p>为了更好的理解3个簇， 我们输出了数据汇总，每个数值字段平均值 及 非数值字段的高频项。</p>
<p><img loading="lazy" src="img/05-stats.png" alt=""  />
</p>
<p>按道理聚类算法的结果应该不同簇之间的差异尽可能的大。糟糕的是不同簇之间的， 年龄(Age)、实习次数(Internships) 差异很小，而更糟糕的是招聘部门(HiringDepartment) 和 岗位角色(Role) 完全相同。</p>
<br>
<h3 id="32-嵌入的聚类">3.2 嵌入的聚类</h3>
<p>使用嵌入编码的聚类算法流程如下图所示。与传统 聚类方法相比，使用嵌入的流程只需处理数字特征， 因为由求职者提示信息(代码里的prompt_text)转化来的嵌入是严格数字化的。</p>
<p><img loading="lazy" src="img/07-emb-process.png" alt=""  />
</p>
<p>在这里，我们不能像上次那样直接计算字段重要性。我们有数百个难以理解的特征，它们的重要性各不相同，我们无法理解。那么我们该怎么办？让我们训练另一个模型（这次是有监督的三类分类模型），使用原始特征集来预测嵌入模型生成的类标签。这样就可以以同类的方式重现字段重要性。结果如下</p>
<p><img loading="lazy" src="img/08-weight.png" alt=""  />
</p>
<p>我们找到一种新的嵌入表示来编码求职者信息， 并运算出了聚类结果。</p>
<p><img loading="lazy" src="img/09-statas.png" alt=""  />
</p>
<p>从统计信息(上图)中可以看出，不同簇之间的差异变的更加清晰。 使用嵌入编码， 让更多申请销售岗位的的销售主管划分到cluster2， 让更多申请研发岗位的的科学家划分到cluster1 和 cluster3.</p>
<br>
<blockquote>
<p>前文内容翻译整理自</p>
<p><a href="https://medium.com/@swansburg.justin/how-to-use-llms-to-build-better-clustering-models-9b17a5491bb4">https://medium.com/@swansburg.justin/how-to-use-llms-to-build-better-clustering-models-9b17a5491bb4</a></p>
</blockquote>
<p><br><br></p>
<h2 id="四启发">四、启发</h2>
<p>读完以上内容，大邓想到一个问题， 假设 没有简历系统，没有大数据，求职者与面试官坐在现场，  数据就是面试过程中的交流， 而交流必然通过话语这一媒介。 例如求职者的个人信息</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">“大家好，我叫张三， 今年24岁，哈尔滨人。本科毕业于哈尔滨工业大学，市场营销专业。 我是一个很外向的人，对销售很感兴趣，在大学期间摆了很多地摊。很希望获得贵公司的机会，让我在营销岗位上大发异彩。”
</code></pre></div><p>面试期间，记录人员将该哈尔滨张三的个人信息被整理为</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">name: 张三
age: 24
city: 哈尔滨
edu: 哈尔滨工业大学
major: 市场营销
experience: 摆摊
summary: 我是外向的人，对销售很感兴趣。
</code></pre></div><p>求职者的信息汇总成xlsx， 每个人的信息都或多或少的被压缩了。 这种表示方式， 在小规模时， 求职者的总结summary还是有很大信息量的，能够让面试者回忆起当时的场景和情景。但是当求职者的规模上升到几千上万， 备注note信息这种很重要的信息反而无法利用。</p>
<p>使用大语言模型LLM，将文本提示转化为嵌入表示。我们可以将LLM看成是一个察言观色，见微知著，明察秋毫的智者。  这个智者可以</p>
<ul>
<li>分类</li>
<li>提取信息</li>
<li>补全</li>
<li>相似性</li>
<li>&hellip;</li>
</ul>
<p>以往缺失数据， 用插值或者其他技巧， 现在我们可以借助LLM， 只有有其他字段残存的微弱线索， LLM就能帮我们补全缺失值。</p>
<br>
<h3 id="41-分类">4.1 分类</h3>
<p>如图所示， 对于很多短文本， 我们可以推断话题，也可以推断情绪。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">https://huggingface.co/morit/chinese_xlm_xnli
</code></pre></div><p><img loading="lazy" src="img/10-classification.png" alt=""  />
</p>
<p><img loading="lazy" src="img/11-classification.png" alt=""  />
</p>
<h3 id="42-提取信息">4.2 提取信息</h3>
<p>假设有一些信息存储在文本中， 可以用正则表达式提取， 下面的例子用正则会很难设计， 但用LLM很简单。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">https://huggingface.co/luhua/chinese_pretrain_mrc_roberta_wwm_ext_large
</code></pre></div><p><img loading="lazy" src="img/12-extract.png" alt=""  />
</p>
<br>
<h3 id="43-补全">4.3 补全</h3>
<p>填充缺失值信息</p>
<p><img loading="lazy" src="img/13-mask.png" alt=""  />
</p>
<br>
<h3 id="44-相似性">4.4 相似性</h3>
<p><img loading="lazy" src="img/14-sim.png" alt=""  />
</p>
<p>当然LLM功能还有很多，大家可以自己探索探索</p>
<p><img loading="lazy" src="img/15-func.png" alt=""  />
</p>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>pandarallel库 | 多核运行提升pandas速度</title>
      <link>https://textdata.cn/blog/2023-11-19-pandarallel-speed-up-pandas/</link>
      <pubDate>Sat, 18 Nov 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-11-19-pandarallel-speed-up-pandas/</guid>
      <description>&lt;p&gt;只需更改一行代码， &lt;strong&gt;pandarallel库&lt;/strong&gt; 就可以充分利用CPU性能，并行化所有 Pandas 操作，加速你的数据处理。&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;pandarallel&lt;/strong&gt; 还提供漂亮的进度条（在笔记本和终端上可用）以 大致了解要完成的剩余计算量。&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;没有并行化&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/progress_apply.gif&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;并行化&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/progress_parallel_apply.gif&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;可以看到，使用并行化后，处理速度快了很多。&lt;/p&gt;
&lt;br&gt;
&lt;h2 id=&#34;一性能对比&#34;&gt;一、性能对比&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;cpu有n个核，大概并行化会提升大概n倍&lt;/strong&gt;。以下是使用和不使用 Pandaral·lel 的比较基准。实验环境：&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;操作系统&lt;/strong&gt;：Linux Ubuntu 16.04&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;硬件&lt;/strong&gt;：Intel Core i7 @ 3.40 GHz - 4 核&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/standard_vs_parallel_4_cores.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;并行操作的运行速度大约是标准操作的 4 倍（除了标准操作的运行速度仅快 3.2 倍）。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二特性&#34;&gt;二、特性&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;pandarallel&lt;/strong&gt; 目前实现以下 API：&lt;strong&gt;pandas&lt;/strong&gt;&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:left&#34;&gt;&lt;strong&gt;没有并行化&lt;/strong&gt;&lt;/th&gt;
&lt;th style=&#34;text-align:left&#34;&gt;&lt;strong&gt;并行化&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left&#34;&gt;&lt;code&gt;df.apply(func)&lt;/code&gt;&lt;/td&gt;
&lt;td style=&#34;text-align:left&#34;&gt;&lt;code&gt;df.parallel_apply(func)&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left&#34;&gt;&lt;code&gt;df.applymap(func)&lt;/code&gt;&lt;/td&gt;
&lt;td style=&#34;text-align:left&#34;&gt;&lt;code&gt;df.parallel_applymap(func)&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left&#34;&gt;&lt;code&gt;df.groupby(args).apply(func)&lt;/code&gt;&lt;/td&gt;
&lt;td style=&#34;text-align:left&#34;&gt;&lt;code&gt;df.groupby(args).parallel_apply(func)&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left&#34;&gt;&lt;code&gt;df.groupby(args1).col_name.rolling(args2).apply(func)&lt;/code&gt;&lt;/td&gt;
&lt;td style=&#34;text-align:left&#34;&gt;&lt;code&gt;df.groupby(args1).col_name.rolling(args2).parallel_apply(func)&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left&#34;&gt;&lt;code&gt;df.groupby(args1).col_name.expanding(args2).apply(func)&lt;/code&gt;&lt;/td&gt;
&lt;td style=&#34;text-align:left&#34;&gt;&lt;code&gt;df.groupby(args1).col_name.expanding(args2).parallel_apply(func)&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left&#34;&gt;&lt;code&gt;series.map(func)&lt;/code&gt;&lt;/td&gt;
&lt;td style=&#34;text-align:left&#34;&gt;&lt;code&gt;series.parallel_map(func)&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left&#34;&gt;&lt;code&gt;series.apply(func)&lt;/code&gt;&lt;/td&gt;
&lt;td style=&#34;text-align:left&#34;&gt;&lt;code&gt;series.parallel_apply(func)&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left&#34;&gt;&lt;code&gt;series.rolling(args).apply(func)&lt;/code&gt;&lt;/td&gt;
&lt;td style=&#34;text-align:left&#34;&gt;&lt;code&gt;series.rolling(args).parallel_apply(func)&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;三语法&#34;&gt;三、语法&lt;/h2&gt;
&lt;p&gt;Mac 和 linux，没有什么特殊的用法， 但在 &lt;strong&gt;Windows&lt;/strong&gt; 上， 您掉用的函数必须是&lt;strong&gt;自包含&lt;/strong&gt;的，并且不应依赖于外部资源。为了降低大家的记忆压力， 咱们假设所有系统，都要满足自包含且不依赖外部资源。&lt;/p&gt;
&lt;h3 id=&#34;31-安装&#34;&gt;3.1 安装&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;pip install pandarallel
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;32-错误用法&#34;&gt;3.2 错误用法&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandarallel&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pandarallel&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#初始化，且显示进度条&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;pandarallel&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;initialize&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;progress_bar&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;


&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;math&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;def&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;func&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
    &lt;span class=&#34;c1&#34;&gt;# func不能依赖外部资源， math定义在函数体func之外， 会出问题的！&lt;/span&gt;
    &lt;span class=&#34;k&#34;&gt;return&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;math&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;sin&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;a&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;**&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;math&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;sin&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;b&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;**&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
  
  
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;实验的csv文件路径&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;result&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;某个数值字段&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;parallel_apply&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;func&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;33-正确用法&#34;&gt;3.3 正确用法&lt;/h3&gt;
&lt;p&gt;定义好计算函数 &lt;em&gt;&lt;strong&gt;func&lt;/strong&gt;&lt;/em&gt;，  标准的 &lt;em&gt;&lt;strong&gt;pandas&lt;/strong&gt;&lt;/em&gt; 的计算是在 &lt;em&gt;&lt;strong&gt;pd.Series&lt;/strong&gt;&lt;/em&gt; 基础上掉用 &lt;em&gt;&lt;strong&gt;apply&lt;/strong&gt;&lt;/em&gt; 方法，即 &lt;em&gt;&lt;strong&gt;pd.Series.apply(func)&lt;/strong&gt;&lt;/em&gt;。&lt;/p&gt;
&lt;p&gt;而 &lt;em&gt;&lt;strong&gt;pandarallel&lt;/strong&gt;&lt;/em&gt; 稍微修改了方法名， &lt;em&gt;&lt;strong&gt;pd.Series.parallel_apply(func)&lt;/strong&gt;&lt;/em&gt;。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandarallel&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pandarallel&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#初始化，且显示进度条&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;pandarallel&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;initialize&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;progress_bar&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;


&lt;span class=&#34;k&#34;&gt;def&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;func&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
    &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;math&lt;/span&gt;
    &lt;span class=&#34;c1&#34;&gt;# 在函数体func内导入math，掉用math， okay!&lt;/span&gt;
    &lt;span class=&#34;k&#34;&gt;return&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;math&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;sin&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;a&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;**&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;math&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;sin&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;b&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;**&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
  
  
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;实验的csv文件路径&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;result&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;某个数值字段&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;parallel_apply&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;func&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;四实验&#34;&gt;四、实验&lt;/h2&gt;
&lt;p&gt;对一个 &lt;em&gt;&lt;strong&gt;xlsx&lt;/strong&gt;&lt;/em&gt; 文件的 &lt;em&gt;&lt;strong&gt;text&lt;/strong&gt;&lt;/em&gt; 字段进行词频统计， 结果保存到新字段 &lt;em&gt;&lt;strong&gt;wordCount&lt;/strong&gt;&lt;/em&gt; 中。&lt;/p&gt;
&lt;h3 id=&#34;41-读取数据&#34;&gt;4.1 读取数据&lt;/h3&gt;
&lt;p&gt;mda01-22.xlsx数据有55439条记录， 体积573M。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_excel&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;mda01-22.xlsx&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;head&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;55439
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;42-没有并行&#34;&gt;4.2 没有并行&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;o&#34;&gt;%%&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;time&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;jieba&lt;/span&gt;
    
&lt;span class=&#34;k&#34;&gt;def&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;word_count&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
    &lt;span class=&#34;k&#34;&gt;return&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;jieba&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;lcut&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_excel&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;mda01-22.xlsx&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;wordCount&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;text&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;apply&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;word_count&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;head&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;CPU times: user 11min 56s, sys: 10.5 s, total: 12min 7s
Wall time: 12min 7s
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/df2.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;43-并行化&#34;&gt;4.3 并行化&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;o&#34;&gt;%%&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;time&lt;/span&gt;

&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandarallel&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pandarallel&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#初始化，且显示进度条&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;pandarallel&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;initialize&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;progress_bar&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;k&#34;&gt;def&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;parallel_word_count&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
    &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;jieba&lt;/span&gt;
    &lt;span class=&#34;k&#34;&gt;return&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;jieba&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;lcut&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_excel&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;mda01-22.xlsx&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;wordCount&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;text&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;parallel_apply&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;word_count&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;head&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;INFO: Pandarallel will run on 12 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
CPU times: user 12.4 s, sys: 1.41 s, total: 13.8 s
Wall time: 2min 40s
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/df2.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;Wow, 运行总时间从  12min 7s 降低 2min 40s 。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;44-使用场景&#34;&gt;4.4 使用场景&lt;/h3&gt;
&lt;p&gt;并行化是有代价的（实例化新进程、通过共享内存发送数据、 &amp;hellip;），只有在并行化的计算量大时才有效足够高。对于小规模的数据，使用并行化并不总是值得的。经过测试， 找了一个61kb的xlsx， 结果并行化反而还慢了。&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;pandarallel&lt;/strong&gt; 通过使用计算机cpu所有内核来绕过此限制。 但代价是，需要两倍于标准操作的内存占用。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
</description>
      <content:encoded><![CDATA[<p>只需更改一行代码， <strong>pandarallel库</strong> 就可以充分利用CPU性能，并行化所有 Pandas 操作，加速你的数据处理。</p>
<p><strong>pandarallel</strong> 还提供漂亮的进度条（在笔记本和终端上可用）以 大致了解要完成的剩余计算量。</p>
<p><strong>没有并行化</strong></p>
<p><img loading="lazy" src="img/progress_apply.gif" alt=""  />
</p>
<p><strong>并行化</strong></p>
<p><img loading="lazy" src="img/progress_parallel_apply.gif" alt=""  />
</p>
<p>可以看到，使用并行化后，处理速度快了很多。</p>
<br>
<h2 id="一性能对比">一、性能对比</h2>
<p><strong>cpu有n个核，大概并行化会提升大概n倍</strong>。以下是使用和不使用 Pandaral·lel 的比较基准。实验环境：</p>
<ul>
<li><strong>操作系统</strong>：Linux Ubuntu 16.04</li>
<li><strong>硬件</strong>：Intel Core i7 @ 3.40 GHz - 4 核</li>
</ul>
<p><img loading="lazy" src="img/standard_vs_parallel_4_cores.png" alt=""  />
</p>
<p>并行操作的运行速度大约是标准操作的 4 倍（除了标准操作的运行速度仅快 3.2 倍）。</p>
<p><br><br></p>
<h2 id="二特性">二、特性</h2>
<p><strong>pandarallel</strong> 目前实现以下 API：<strong>pandas</strong></p>
<table>
<thead>
<tr>
<th style="text-align:left"><strong>没有并行化</strong></th>
<th style="text-align:left"><strong>并行化</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:left"><code>df.apply(func)</code></td>
<td style="text-align:left"><code>df.parallel_apply(func)</code></td>
</tr>
<tr>
<td style="text-align:left"><code>df.applymap(func)</code></td>
<td style="text-align:left"><code>df.parallel_applymap(func)</code></td>
</tr>
<tr>
<td style="text-align:left"><code>df.groupby(args).apply(func)</code></td>
<td style="text-align:left"><code>df.groupby(args).parallel_apply(func)</code></td>
</tr>
<tr>
<td style="text-align:left"><code>df.groupby(args1).col_name.rolling(args2).apply(func)</code></td>
<td style="text-align:left"><code>df.groupby(args1).col_name.rolling(args2).parallel_apply(func)</code></td>
</tr>
<tr>
<td style="text-align:left"><code>df.groupby(args1).col_name.expanding(args2).apply(func)</code></td>
<td style="text-align:left"><code>df.groupby(args1).col_name.expanding(args2).parallel_apply(func)</code></td>
</tr>
<tr>
<td style="text-align:left"><code>series.map(func)</code></td>
<td style="text-align:left"><code>series.parallel_map(func)</code></td>
</tr>
<tr>
<td style="text-align:left"><code>series.apply(func)</code></td>
<td style="text-align:left"><code>series.parallel_apply(func)</code></td>
</tr>
<tr>
<td style="text-align:left"><code>series.rolling(args).apply(func)</code></td>
<td style="text-align:left"><code>series.rolling(args).parallel_apply(func)</code></td>
</tr>
</tbody>
</table>
<p><br><br></p>
<h2 id="三语法">三、语法</h2>
<p>Mac 和 linux，没有什么特殊的用法， 但在 <strong>Windows</strong> 上， 您掉用的函数必须是<strong>自包含</strong>的，并且不应依赖于外部资源。为了降低大家的记忆压力， 咱们假设所有系统，都要满足自包含且不依赖外部资源。</p>
<h3 id="31-安装">3.1 安装</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">pip install pandarallel
</code></pre></div><br>
<h3 id="32-错误用法">3.2 错误用法</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">from</span> <span class="nn">pandarallel</span> <span class="kn">import</span> <span class="n">pandarallel</span>

<span class="c1">#初始化，且显示进度条</span>
<span class="n">pandarallel</span><span class="o">.</span><span class="n">initialize</span><span class="p">(</span><span class="n">progress_bar</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>


<span class="kn">import</span> <span class="nn">math</span>
<span class="k">def</span> <span class="nf">func</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
    <span class="c1"># func不能依赖外部资源， math定义在函数体func之外， 会出问题的！</span>
    <span class="k">return</span> <span class="n">math</span><span class="o">.</span><span class="n">sin</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">a</span><span class="o">**</span><span class="mi">2</span><span class="p">)</span> <span class="o">+</span> <span class="n">math</span><span class="o">.</span><span class="n">sin</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">b</span><span class="o">**</span><span class="mi">2</span><span class="p">)</span>
  
  
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;实验的csv文件路径&#39;</span><span class="p">)</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;result&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;某个数值字段&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">parallel_apply</span><span class="p">(</span><span class="n">func</span><span class="p">)</span>
</code></pre></div><br>
<h3 id="33-正确用法">3.3 正确用法</h3>
<p>定义好计算函数 <em><strong>func</strong></em>，  标准的 <em><strong>pandas</strong></em> 的计算是在 <em><strong>pd.Series</strong></em> 基础上掉用 <em><strong>apply</strong></em> 方法，即 <em><strong>pd.Series.apply(func)</strong></em>。</p>
<p>而 <em><strong>pandarallel</strong></em> 稍微修改了方法名， <em><strong>pd.Series.parallel_apply(func)</strong></em>。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">from</span> <span class="nn">pandarallel</span> <span class="kn">import</span> <span class="n">pandarallel</span>
<span class="c1">#初始化，且显示进度条</span>
<span class="n">pandarallel</span><span class="o">.</span><span class="n">initialize</span><span class="p">(</span><span class="n">progress_bar</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>


<span class="k">def</span> <span class="nf">func</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
    <span class="kn">import</span> <span class="nn">math</span>
    <span class="c1"># 在函数体func内导入math，掉用math， okay!</span>
    <span class="k">return</span> <span class="n">math</span><span class="o">.</span><span class="n">sin</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">a</span><span class="o">**</span><span class="mi">2</span><span class="p">)</span> <span class="o">+</span> <span class="n">math</span><span class="o">.</span><span class="n">sin</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">b</span><span class="o">**</span><span class="mi">2</span><span class="p">)</span>
  
  
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;实验的csv文件路径&#39;</span><span class="p">)</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;result&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;某个数值字段&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">parallel_apply</span><span class="p">(</span><span class="n">func</span><span class="p">)</span>
</code></pre></div><p><br><br></p>
<h2 id="四实验">四、实验</h2>
<p>对一个 <em><strong>xlsx</strong></em> 文件的 <em><strong>text</strong></em> 字段进行词频统计， 结果保存到新字段 <em><strong>wordCount</strong></em> 中。</p>
<h3 id="41-读取数据">4.1 读取数据</h3>
<p>mda01-22.xlsx数据有55439条记录， 体积573M。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_excel</span><span class="p">(</span><span class="s1">&#39;mda01-22.xlsx&#39;</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">))</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">55439
</code></pre></div><p><img loading="lazy" src="img/df.png" alt=""  />
</p>
<br>
<h3 id="42-没有并行">4.2 没有并行</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">import</span> <span class="nn">jieba</span>
    
<span class="k">def</span> <span class="nf">word_count</span><span class="p">(</span><span class="n">text</span><span class="p">):</span>
    <span class="k">return</span> <span class="nb">len</span><span class="p">(</span><span class="n">jieba</span><span class="o">.</span><span class="n">lcut</span><span class="p">(</span><span class="n">text</span><span class="p">))</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_excel</span><span class="p">(</span><span class="s1">&#39;mda01-22.xlsx&#39;</span><span class="p">)</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;wordCount&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;text&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">word_count</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">CPU times: user 11min 56s, sys: 10.5 s, total: 12min 7s
Wall time: 12min 7s
</code></pre></div><p><img loading="lazy" src="img/df2.png" alt=""  />
</p>
<br>
<h3 id="43-并行化">4.3 并行化</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="o">%%</span><span class="n">time</span>

<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">from</span> <span class="nn">pandarallel</span> <span class="kn">import</span> <span class="n">pandarallel</span>

<span class="c1">#初始化，且显示进度条</span>
<span class="n">pandarallel</span><span class="o">.</span><span class="n">initialize</span><span class="p">(</span><span class="n">progress_bar</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>

<span class="k">def</span> <span class="nf">parallel_word_count</span><span class="p">(</span><span class="n">text</span><span class="p">):</span>
    <span class="kn">import</span> <span class="nn">jieba</span>
    <span class="k">return</span> <span class="nb">len</span><span class="p">(</span><span class="n">jieba</span><span class="o">.</span><span class="n">lcut</span><span class="p">(</span><span class="n">text</span><span class="p">))</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_excel</span><span class="p">(</span><span class="s1">&#39;mda01-22.xlsx&#39;</span><span class="p">)</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;wordCount&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;text&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">parallel_apply</span><span class="p">(</span><span class="n">word_count</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">INFO: Pandarallel will run on 12 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
CPU times: user 12.4 s, sys: 1.41 s, total: 13.8 s
Wall time: 2min 40s
</code></pre></div><p><img loading="lazy" src="img/df2.png" alt=""  />
</p>
<p>Wow, 运行总时间从  12min 7s 降低 2min 40s 。</p>
<br>
<h3 id="44-使用场景">4.4 使用场景</h3>
<p>并行化是有代价的（实例化新进程、通过共享内存发送数据、 &hellip;），只有在并行化的计算量大时才有效足够高。对于小规模的数据，使用并行化并不总是值得的。经过测试， 找了一个61kb的xlsx， 结果并行化反而还慢了。</p>
<p><strong>pandarallel</strong> 通过使用计算机cpu所有内核来绕过此限制。 但代价是，需要两倍于标准操作的内存占用。</p>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>Modin，只需一行代码加速你的Pandas</title>
      <link>https://textdata.cn/blog/2023-11-17-modin-accecerate-your-process/</link>
      <pubDate>Fri, 17 Nov 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-11-17-modin-accecerate-your-process/</guid>
      <description>&lt;p&gt;modin库是python的第三方库，只需一行代码，就能用pandas语法来加速数据处理过程。&lt;/p&gt;
&lt;br&gt;
&lt;h2 id=&#34;一modin有啥用&#34;&gt;一、modin有啥用？&lt;/h2&gt;
&lt;p&gt;pandas库以其简洁易用的api，受到数据分析师喜爱，能做python、sql、excel三者都能做的数据分析。现在的电脑CPU一般都是多核，但pandas只能单核，导致数据处理能力有限。&lt;/p&gt;
&lt;p&gt;而今天，我们要分享的modin，可以利用电脑cpu所有的内核， 加速数据处理。假设你的电脑cpu有4个内核， pandas相当于雇佣了一个工人干活，而modin同时雇佣四个人干活，所以同样的任务，理论上modin比pandas要快4倍。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二modin特点&#34;&gt;二、modin特点&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;支持pandas.DataFrame数据类型&lt;/li&gt;
&lt;li&gt;与pandas兼容，语法相似，几乎不需要额外学习；&lt;/li&gt;
&lt;li&gt;能处理1MB到1TB+的数据；&lt;/li&gt;
&lt;li&gt;使用者不需要知道系统有多少内核，也不需要指定如何分配数据；&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;三实验&#34;&gt;三、实验&lt;/h2&gt;
&lt;h3 id=&#34;31-环境准备&#34;&gt;3.1 环境准备&lt;/h3&gt;
&lt;p&gt;在命令行cmd (苹果电脑在terminal)中执行&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;pip3 install &amp;#34;modin[all]&amp;#34;
pip3 install humanize
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;32-使用方法&#34;&gt;3.2 使用方法&lt;/h3&gt;
&lt;p&gt;只需要一行代码，即可实现pandas功能。 下面的两行代码， mpd几乎等同于我们熟悉的pd。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;import modin.pandas as mpd
import pandas as pd
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;33-准备数据&#34;&gt;3.3 准备数据&lt;/h3&gt;
&lt;p&gt;这里用  &lt;a href=&#34;https://textdata.cn/blog/2023-04-13-3571w-patent-dataset-in-china-mainland/&#34;&gt;&lt;strong&gt;数据集(付费) | 3571万条专利申请数据集(1985-2022年)&lt;/strong&gt;&lt;/a&gt; 为例，&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/data-screen.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;humanize&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;naturalsize&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;os&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;csvfsizes&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;os&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;path&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;getsize&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; 
             &lt;span class=&#34;n&#34;&gt;f&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;os&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;listdir&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;.&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; 
             &lt;span class=&#34;k&#34;&gt;if&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;.csv&amp;#39;&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#排序，文件体积从大到小&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;csvfsizes&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;sorted&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;csvfsizes&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
                   &lt;span class=&#34;n&#34;&gt;key&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;k&#34;&gt;lambda&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;k&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;k&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt; 
                   &lt;span class=&#34;n&#34;&gt;reverse&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;csvf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;size&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;csvfsizes&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;humansize&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;naturalsize&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;size&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;csvf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39; &amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;humansize&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;广东省.csv   10.4 GB
江苏省.csv   9.6 GB
浙江省.csv   7.1 GB
其他国家.csv   6.2 GB
北京市.csv   4.6 GB
山东省.csv   4.3 GB
上海市.csv   3.1 GB
安徽省.csv   3.0 GB
四川省.csv   2.3 GB
湖北省.csv   2.1 GB
福建省.csv   2.1 GB
河南省.csv   2.0 GB
天津市.csv   1.6 GB
湖南省.csv   1.5 GB
陕西省.csv   1.5 GB
辽宁省.csv   1.4 GB
河北省.csv   1.3 GB
重庆市.csv   1.2 GB
江西省.csv   1.0 GB
广西壮族自治区.csv   809.9 MB
台湾省.csv   792.9 MB
黑龙江省.csv   784.5 MB
贵州省.csv   542.4 MB
云南省.csv   538.9 MB
吉林省.csv   524.9 MB
...
香港特别行政区.csv   90.2 MB
青海省.csv   74.9 MB
西藏自治区.csv   19.5 MB
澳门特别行政区.csv   3.5 MB
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;34-读取速度&#34;&gt;3.4 读取速度&lt;/h3&gt;
&lt;p&gt;我们分别选择&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;吉林省.csv 524.9 MB&lt;/li&gt;
&lt;li&gt;江西省.csv 1.0 GB&lt;/li&gt;
&lt;li&gt;北京市.csv 4.6 GB&lt;/li&gt;
&lt;li&gt;广东省.csv 10.4 GB&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;来测试读取数据的速度&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;modin.pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;mpd&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#Pandas 524.9 MB&lt;/span&gt;
&lt;span class=&#34;o&#34;&gt;%&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;time&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;吉林省.csv&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;CPU times: total: 10.6 s
Wall time: 11.2 s
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;#Modin 524.9 MB&lt;/span&gt;
&lt;span class=&#34;o&#34;&gt;%&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;time&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;mpd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;吉林省.csv&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;CPU times: total: 1.38 s
Wall time: 2.68 s
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;其他几个文件语法类似， 都有显著的速度提升。以下是实验表现&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;文件&lt;/th&gt;
&lt;th&gt;体积&lt;/th&gt;
&lt;th&gt;pandas（Wall time）&lt;/th&gt;
&lt;th&gt;modin（Wall time）&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;吉林省.csv&lt;/td&gt;
&lt;td&gt;524.9 MB&lt;/td&gt;
&lt;td&gt;11.2 s&lt;/td&gt;
&lt;td&gt;2.68 s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;江西省.csv&lt;/td&gt;
&lt;td&gt;1.0 GB&lt;/td&gt;
&lt;td&gt;22.9 s&lt;/td&gt;
&lt;td&gt;5.17 s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;北京.csv&lt;/td&gt;
&lt;td&gt;4.6 GB&lt;/td&gt;
&lt;td&gt;100s&lt;/td&gt;
&lt;td&gt;24.7 s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;广东省.csv&lt;/td&gt;
&lt;td&gt;10.4 GB&lt;/td&gt;
&lt;td&gt;213s&lt;/td&gt;
&lt;td&gt;55.9 s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h3 id=&#34;35-运算速度&#34;&gt;3.5 运算速度&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;modin.pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;mpd&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df1&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;mpd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;广东省.csv&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#计算文本长度&lt;/span&gt;
&lt;span class=&#34;o&#34;&gt;%&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;time&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;专利摘要&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;CPU times: total: 15.6 ms
Wall time: 26.5 ms
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df2&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;广东省.csv&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;o&#34;&gt;%&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;time&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;专利摘要&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;CPU times: total: 3.02 s
Wall time: 3.33 s
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;modin在计算方面快了125倍。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;注意&#34;&gt;注意&lt;/h2&gt;
&lt;p&gt;但是由于时间限制，实验比较简单， 个中情况不能一一覆盖。 也有人反映，使用modin，反而比pandas更慢了。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
</description>
      <content:encoded><![CDATA[<p>modin库是python的第三方库，只需一行代码，就能用pandas语法来加速数据处理过程。</p>
<br>
<h2 id="一modin有啥用">一、modin有啥用？</h2>
<p>pandas库以其简洁易用的api，受到数据分析师喜爱，能做python、sql、excel三者都能做的数据分析。现在的电脑CPU一般都是多核，但pandas只能单核，导致数据处理能力有限。</p>
<p>而今天，我们要分享的modin，可以利用电脑cpu所有的内核， 加速数据处理。假设你的电脑cpu有4个内核， pandas相当于雇佣了一个工人干活，而modin同时雇佣四个人干活，所以同样的任务，理论上modin比pandas要快4倍。</p>
<p><br><br></p>
<h2 id="二modin特点">二、modin特点</h2>
<ol>
<li>支持pandas.DataFrame数据类型</li>
<li>与pandas兼容，语法相似，几乎不需要额外学习；</li>
<li>能处理1MB到1TB+的数据；</li>
<li>使用者不需要知道系统有多少内核，也不需要指定如何分配数据；</li>
</ol>
<p><br><br></p>
<h2 id="三实验">三、实验</h2>
<h3 id="31-环境准备">3.1 环境准备</h3>
<p>在命令行cmd (苹果电脑在terminal)中执行</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">pip3 install &#34;modin[all]&#34;
pip3 install humanize
</code></pre></div><br>
<h3 id="32-使用方法">3.2 使用方法</h3>
<p>只需要一行代码，即可实现pandas功能。 下面的两行代码， mpd几乎等同于我们熟悉的pd。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">import modin.pandas as mpd
import pandas as pd
</code></pre></div><br>
<h3 id="33-准备数据">3.3 准备数据</h3>
<p>这里用  <a href="https://textdata.cn/blog/2023-04-13-3571w-patent-dataset-in-china-mainland/"><strong>数据集(付费) | 3571万条专利申请数据集(1985-2022年)</strong></a> 为例，</p>
<p><img loading="lazy" src="img/data-screen.png" alt=""  />
</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">humanize</span> <span class="kn">import</span> <span class="n">naturalsize</span>
<span class="kn">import</span> <span class="nn">os</span>

<span class="n">csvfsizes</span> <span class="o">=</span> <span class="p">[(</span><span class="n">f</span><span class="p">,</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">getsize</span><span class="p">(</span><span class="n">f</span><span class="p">))</span> <span class="k">for</span> 
             <span class="n">f</span> <span class="ow">in</span> <span class="n">os</span><span class="o">.</span><span class="n">listdir</span><span class="p">(</span><span class="s1">&#39;.&#39;</span><span class="p">)</span> 
             <span class="k">if</span> <span class="s1">&#39;.csv&#39;</span> <span class="ow">in</span> <span class="n">f</span><span class="p">]</span>

<span class="c1">#排序，文件体积从大到小</span>
<span class="n">csvfsizes</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">csvfsizes</span><span class="p">,</span> 
                   <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">k</span><span class="p">:</span><span class="n">k</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> 
                   <span class="n">reverse</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>

<span class="k">for</span> <span class="n">csvf</span><span class="p">,</span> <span class="n">size</span> <span class="ow">in</span> <span class="n">csvfsizes</span><span class="p">:</span>
    <span class="n">humansize</span> <span class="o">=</span> <span class="n">naturalsize</span><span class="p">(</span><span class="n">size</span><span class="p">)</span>
    <span class="nb">print</span><span class="p">(</span><span class="n">csvf</span><span class="p">,</span> <span class="s1">&#39; &#39;</span><span class="p">,</span> <span class="n">humansize</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">广东省.csv   10.4 GB
江苏省.csv   9.6 GB
浙江省.csv   7.1 GB
其他国家.csv   6.2 GB
北京市.csv   4.6 GB
山东省.csv   4.3 GB
上海市.csv   3.1 GB
安徽省.csv   3.0 GB
四川省.csv   2.3 GB
湖北省.csv   2.1 GB
福建省.csv   2.1 GB
河南省.csv   2.0 GB
天津市.csv   1.6 GB
湖南省.csv   1.5 GB
陕西省.csv   1.5 GB
辽宁省.csv   1.4 GB
河北省.csv   1.3 GB
重庆市.csv   1.2 GB
江西省.csv   1.0 GB
广西壮族自治区.csv   809.9 MB
台湾省.csv   792.9 MB
黑龙江省.csv   784.5 MB
贵州省.csv   542.4 MB
云南省.csv   538.9 MB
吉林省.csv   524.9 MB
...
香港特别行政区.csv   90.2 MB
青海省.csv   74.9 MB
西藏自治区.csv   19.5 MB
澳门特别行政区.csv   3.5 MB
</code></pre></div><br>
<h3 id="34-读取速度">3.4 读取速度</h3>
<p>我们分别选择</p>
<ul>
<li>吉林省.csv 524.9 MB</li>
<li>江西省.csv 1.0 GB</li>
<li>北京市.csv 4.6 GB</li>
<li>广东省.csv 10.4 GB</li>
</ul>
<p>来测试读取数据的速度</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">modin.pandas</span> <span class="k">as</span> <span class="nn">mpd</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="c1">#Pandas 524.9 MB</span>
<span class="o">%</span><span class="n">time</span> <span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;吉林省.csv&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">CPU times: total: 10.6 s
Wall time: 11.2 s
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#Modin 524.9 MB</span>
<span class="o">%</span><span class="n">time</span> <span class="n">df</span> <span class="o">=</span> <span class="n">mpd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;吉林省.csv&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">CPU times: total: 1.38 s
Wall time: 2.68 s
</code></pre></div><p>其他几个文件语法类似， 都有显著的速度提升。以下是实验表现</p>
<table>
<thead>
<tr>
<th>文件</th>
<th>体积</th>
<th>pandas（Wall time）</th>
<th>modin（Wall time）</th>
</tr>
</thead>
<tbody>
<tr>
<td>吉林省.csv</td>
<td>524.9 MB</td>
<td>11.2 s</td>
<td>2.68 s</td>
</tr>
<tr>
<td>江西省.csv</td>
<td>1.0 GB</td>
<td>22.9 s</td>
<td>5.17 s</td>
</tr>
<tr>
<td>北京.csv</td>
<td>4.6 GB</td>
<td>100s</td>
<td>24.7 s</td>
</tr>
<tr>
<td>广东省.csv</td>
<td>10.4 GB</td>
<td>213s</td>
<td>55.9 s</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
<p><br><br></p>
<h3 id="35-运算速度">3.5 运算速度</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">modin.pandas</span> <span class="k">as</span> <span class="nn">mpd</span>

<span class="n">df1</span> <span class="o">=</span> <span class="n">mpd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;广东省.csv&#39;</span><span class="p">)</span>
<span class="c1">#计算文本长度</span>
<span class="o">%</span><span class="n">time</span> <span class="n">df1</span><span class="p">[</span><span class="s1">&#39;专利摘要&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">len</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">CPU times: total: 15.6 ms
Wall time: 26.5 ms
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df2</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;广东省.csv&#39;</span><span class="p">)</span>
<span class="o">%</span><span class="n">time</span> <span class="n">df2</span><span class="p">[</span><span class="s1">&#39;专利摘要&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">len</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">CPU times: total: 3.02 s
Wall time: 3.33 s
</code></pre></div><p>modin在计算方面快了125倍。</p>
<p><br><br></p>
<h2 id="注意">注意</h2>
<p>但是由于时间限制，实验比较简单， 个中情况不能一一覆盖。 也有人反映，使用modin，反而比pandas更慢了。</p>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>Word Embeddings、Transformer与GPT：一文揭示三者关系</title>
      <link>https://textdata.cn/blog/2023-11-16-how-to-understand-the-meaning-of-gpt/</link>
      <pubDate>Thu, 16 Nov 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-11-16-how-to-understand-the-meaning-of-gpt/</guid>
      <description>&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;作者: 7号床
公众号: 7号床
原文  https://zhuanlan.zhihu.com/p/666206302
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;一gpt-的名词解释&#34;&gt;一、GPT 的名词解释&lt;/h2&gt;
&lt;p&gt;著名的 &lt;strong&gt;GPT&lt;/strong&gt; 这个名字全称是 &lt;strong&gt;Generative Pre-trained Transformer&lt;/strong&gt;。&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Generative&lt;/strong&gt; 是&amp;quot;生成式&amp;quot;的意思，也就是说这个 AI 模型是用来生成内容的。&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Pre-trained&lt;/strong&gt; 是“预训练”的意思，就是说这个 AI 模型能有很强的能力，是因为他事先做了大量的训练，台上一分钟台下十年功。&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Transformer&lt;/strong&gt; , 就有点耐人寻味了，不仅普通人不理解，就连很多专业领域的人员理解起来也都是含混不清、似是而非。&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/1.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;center&gt;ChatGPT 是 GPT 大模型在聊天对话领域的应用程序&lt;/center&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Transformer&lt;/strong&gt; 作为单词，翻译出来频率最高的意思是 &lt;strong&gt;变压器&lt;/strong&gt;，然后是 &lt;strong&gt;变形金刚&lt;/strong&gt; ，还有一些引申的含义是 &lt;strong&gt;转换器&lt;/strong&gt; 、&lt;strong&gt;促使变化者&lt;/strong&gt; 、&lt;strong&gt;转变者&lt;/strong&gt; 或 &lt;strong&gt;改革者&lt;/strong&gt;等等。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/2.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;center&gt;谷歌翻译上对 **Transformer** 的英译中翻译&lt;/center&gt;&lt;/p&gt;
&lt;p&gt;再把 &lt;strong&gt;Transformer&lt;/strong&gt; 放到  &lt;strong&gt;Chat Generative Pre-trained Transformer&lt;/strong&gt; 中看看，突然间变得奇怪了，难道 ChatGPT 借鉴了变压器的技术？还是说 ChatGPT 是一个变形金刚？或者索性就翻译成通用的安全的叫法 &lt;strong&gt;转换器&lt;/strong&gt; ？这让人百思不得其解。&lt;/p&gt;
&lt;p&gt;光光从 GPT 这三个字母的组合就能看出来， &lt;strong&gt;Generative&lt;/strong&gt; 与 &lt;strong&gt;Pre-trained&lt;/strong&gt; 都是定语，而 &lt;strong&gt;Transformer 才是 GPT 的主体，才是 GPT 的灵魂&lt;/strong&gt;所在。可以说，理解透了 &lt;strong&gt;Transformer&lt;/strong&gt; 的真正含义，才能初步地理解 GPT。另一方面， Transformer 这个词太重要了。它在这几年的人工智能领域大放异彩，不仅仅局限于 NLP 自然语言处理领域，它还有着更广阔的发展空间。 Transformer 目前已经进入到了多模态领域，比如音频与视觉，甚至数学公式、代码编程等领域，著名的 **Stable Diffusion 中也用到了 Transformer **。&lt;strong&gt;可以说，所有生成式人工智能领域的大模型中目前都有了这个 Transformer 的身影&lt;/strong&gt;。既然如此重要，那就让我们深入地探究一下 &lt;strong&gt;Transformer&lt;/strong&gt; 在人工智能领域最确切的最标准的含义到底是什么吧！&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Transformer&lt;/strong&gt; 最早是由 Google 的人工智能团队提出来的。在2017 年6月发表的论文**《Attention Is All You Need》中，他们首次提出了一种新的神经网络架构 Transformer**。Transformer 依赖于一个叫“自注意力机制”（ Self-Attention）的内部构件，可十分准确高效地对自然语言领域的问题进行处理，以完美地解决翻译、对话、论文协作甚至编程等复杂的问题。&lt;/p&gt;
&lt;p&gt;顺藤摸瓜可以看出，&lt;strong&gt;GTP 的核心是 Transformer，而 Transformer 的核心则是“自注意力机制”（ Self-Attention）&lt;/strong&gt;。那么这个“自注意力机制”又是什东西呢？让我们用语言翻译领域的几个简单易懂的例子来讲解一下。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二-transformer-的核心-self-attention&#34;&gt;二、 Transformer 的核心 Self-Attention&lt;/h2&gt;
&lt;p&gt;首先，看下面这两个短句：&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;句子I&lt;/strong&gt;：The bank of the river.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;句子II&lt;/strong&gt;：Money in the bank.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;在翻译成中文的过程中，机器算法是如何知道“句子I”中的“bank”指的是自然环境中的“岸边”，而“句子II”中的“bank”指的是金融体系中的“银行”呢？&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/3.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;center&gt;bank在不同句子中指代不同的事物&lt;/center&gt;&lt;/p&gt;
&lt;h3 id=&#34;21-人类脑中的翻译算法&#34;&gt;2.1 人类脑中的翻译算法&lt;/h3&gt;
&lt;p&gt;作为人类的我们当然会觉得这是一个再简单不过的事情了，那是因为我们的语言技能从幼儿发展到成年人后，早已烂熟于心了。但即使烂熟于心，也并不意味着在我们的大脑中没有对应的计算过程。&lt;strong&gt;实际上人工智能的翻译过程就是对我们人脑中的计算过程的模拟&lt;/strong&gt;。那么就让我们回想一下儿童时期学习语言时的情景吧，回想一下当时的我们是怎么知道一个多义词在某一句话中具体的含义的？&lt;/p&gt;
&lt;p&gt;人类做这件事的方法是根据 &lt;strong&gt;前后文的语义对照&lt;/strong&gt; 来确定结果，即看句子中其他相关联的单词是什么含义。&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;在 &lt;strong&gt;句子I&lt;/strong&gt; 中， &lt;em&gt;&lt;strong&gt;river&lt;/strong&gt;&lt;/em&gt; 这个词指明了自然环境，&lt;/li&gt;
&lt;li&gt;而在 &lt;strong&gt;句子II&lt;/strong&gt;中， &lt;em&gt;&lt;strong&gt;money&lt;/strong&gt;&lt;/em&gt; 这个词则指明了金融环境。&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;所以两个句子中的多义词“bank”也就有了各自的定位。如果把这种方式总结成一种算法的话，这个算法就可以用于人工智能领域用于语言处理了。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;22-机器算法模拟人脑中的翻译过程&#34;&gt;2.2 机器算法模拟人脑中的翻译过程&lt;/h3&gt;
&lt;p&gt;但人工智能作为一种计算机算法，它只能处理冷冰冰的数字，并不知道何为自然环境，何为金融环境，它又是怎么去判断 &lt;em&gt;&lt;strong&gt;river&lt;/strong&gt;&lt;/em&gt; 和 &lt;em&gt;&lt;strong&gt;money&lt;/strong&gt;&lt;/em&gt; 各自的含义呢。实际上，机器算法并不知道 &lt;em&gt;&lt;strong&gt;river&lt;/strong&gt;&lt;/em&gt; 和 &lt;em&gt;&lt;strong&gt;money&lt;/strong&gt;&lt;/em&gt; 的具体含义。但是机器可以通过某种数字的方式来表达 &lt;em&gt;&lt;strong&gt;river&lt;/strong&gt;&lt;/em&gt; 和 &lt;em&gt;&lt;strong&gt;money&lt;/strong&gt;&lt;/em&gt; ，同时，通过数字的方式还表达了许许多多其他的词汇，其中必然会有一些词汇会与 &lt;em&gt;&lt;strong&gt;river&lt;/strong&gt;&lt;/em&gt; 和 &lt;em&gt;&lt;strong&gt;money&lt;/strong&gt;&lt;/em&gt; 有着很紧密的语义上的逻辑关系。通过判断 &lt;em&gt;&lt;strong&gt;river&lt;/strong&gt;&lt;/em&gt; 和 &lt;em&gt;&lt;strong&gt;money&lt;/strong&gt;&lt;/em&gt; 各与哪些词汇在语义上有紧密的逻辑关系，便可以知道这两个词各属于什么领域了。&lt;/p&gt;
&lt;p&gt;（其实，不像人类会对某个领域有一个具体的名称来命名，在人工智能领域，机器最终也不知道这个领域的统称到底叫什么名字，但它却知道这个领域中都包括了哪些词、哪些概念和哪些逻辑。***机器不以单独名称来定义一个概念，它却可以用很多相关的概念与逻辑来圈定这一个概念！***这可能就是老子说的：道可道非常道，名可名非常名吧。）&lt;/p&gt;
&lt;br&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;独热编码法(One-hot Encoding)&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;那么就让我们看看这种数字表达方式具体是什么样子吧。&lt;/p&gt;
&lt;p&gt;假设这个世界上有100万个单词，每一个单词，我们都可以用一组 0 和 1 组成的向量（一组数字）来定义的话，那么每一个单词就可以被编码成100万个0或1组成的向量。如下图：&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/4.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;center&gt;独热编码示例&lt;/center&gt;&lt;/p&gt;
&lt;p&gt;这种单词编码方法叫 **独热编码法(One-hot Encoding)**法。可是这样一维的编码方法将导致向量占用的空间过大，1个单词用100万个单元的向量表达，世界上一共有100万个单词，那么就需要 1万亿（100万*100万）的体积来把它们表达出来，很明显这种臃肿的结构不利于电脑计算。&lt;/p&gt;
&lt;p&gt;但最大的问题还不在于这个体积问题，而是语义联系问题。独热编码使得单词与单词之间完全相互独立，从每个单词所编码成为的100万个单元的向量身上，根本看不出它与其他单词有何种语义内涵上的逻辑联系。比如，在这些数字中，我们无法知道 &lt;em&gt;&lt;strong&gt;apple&lt;/strong&gt;&lt;/em&gt; 和 &lt;em&gt;&lt;strong&gt;bag&lt;/strong&gt;&lt;/em&gt; 属于静物，区别于 cat 和 &lt;em&gt;&lt;strong&gt;dog&lt;/strong&gt;&lt;/em&gt;、&lt;em&gt;&lt;strong&gt;elephant&lt;/strong&gt;&lt;/em&gt; 属于动物且是哺乳动物，而 &lt;em&gt;&lt;strong&gt;cat&lt;/strong&gt;&lt;/em&gt;  和 &lt;em&gt;&lt;strong&gt;dog&lt;/strong&gt;&lt;/em&gt; 又属于小动物，且大多数为非野生，区别于 &lt;em&gt;&lt;strong&gt;elephant&lt;/strong&gt;&lt;/em&gt; 为大型的野生动物，等等等等，这些单词背后所蕴含的各种内在的逻辑联系和分类关系均无法从独热编码法中知晓。实际上独热编码是传统计算机数据库时代的产物，而在人工智能领域则采用另一种编码法。为了解决独热编码的问题， &lt;strong&gt;词嵌入编码法(Word Embedding)&lt;/strong&gt; 诞生了，如下图：&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/5.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;center&gt;Word Embedding 词嵌入编码示意，及 Embedding 空间&lt;/center&gt;&lt;/p&gt;
&lt;br&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;词嵌入编码法(Word Embedding)&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;**词嵌入编码法(Word Embedding)**将语义上相近的、有关联的词汇在 Embedding 空间中生成相近的位置定位。相对于 &lt;strong&gt;独热编码法&lt;/strong&gt; 超长的一维数据，词嵌入编码法(Word Embedding) 提升了数据的表达维度，它更像是在某一个 &lt;strong&gt;空间&lt;/strong&gt; 中对词汇进行编码。&lt;/p&gt;
&lt;p&gt;如上图（为了在此文章中表达方便，我们仅用二维空间来表达，实际上这个空间的维度很高，至少要在512维之上！一维二维三维的空间大家都可以在脑中想象出来对应的画面，但是四维以上以至于 512 维就难以图形化的想象了。），在 Embedding 的二维空间中 &lt;em&gt;&lt;strong&gt;dog&lt;/strong&gt;&lt;/em&gt;、 &lt;em&gt;&lt;strong&gt;cat&lt;/strong&gt;&lt;/em&gt; 、&lt;em&gt;&lt;strong&gt;rabbit&lt;/strong&gt;&lt;/em&gt; 三个向量的坐标点位排布，可以看到三个绿色的点距离很近，是因为他们三个相对于其他来说语义上更接近。tree 和 flower 则离它们较远，但是 &lt;em&gt;&lt;strong&gt;cat&lt;/strong&gt;&lt;/em&gt; 会因为在很多语言的文章中都会有“爬树”的词汇出现在同一句话中，所以导致  &lt;em&gt;&lt;strong&gt;cat&lt;/strong&gt;&lt;/em&gt;  会与  &lt;em&gt;&lt;strong&gt;tree&lt;/strong&gt;&lt;/em&gt;  离得较近一些。同时 &lt;em&gt;&lt;strong&gt;dog&lt;/strong&gt;&lt;/em&gt;、 &lt;em&gt;&lt;strong&gt;rabbit&lt;/strong&gt;&lt;/em&gt;  与  &lt;em&gt;&lt;strong&gt;tree&lt;/strong&gt;&lt;/em&gt; 的关系就较远。&lt;/p&gt;
&lt;p&gt;实际上，在 Embedding 空间中，词与词之间的关系还不仅仅限于语义上的分类所导致的定位远近这么简单。一个词所代表的事物与其他词所代表的事物之间能产生内在联系的往往有成百上千上万种之多。比如  &lt;em&gt;&lt;strong&gt;man&lt;/strong&gt;&lt;/em&gt;  和  &lt;em&gt;&lt;strong&gt;woman&lt;/strong&gt;&lt;/em&gt; ，他们之间的关系还会映射出  &lt;em&gt;&lt;strong&gt;king&lt;/strong&gt;&lt;/em&gt;  和  &lt;em&gt;&lt;strong&gt;queen&lt;/strong&gt;&lt;/em&gt;  之间的关系。同时，语法也会带来一定的联系，比如在一个三维空间中由  &lt;em&gt;&lt;strong&gt;walking&lt;/strong&gt;&lt;/em&gt;  到 &lt;em&gt;&lt;strong&gt;walked&lt;/strong&gt;&lt;/em&gt;  的距离与斜率竟然与  &lt;em&gt;&lt;strong&gt;swimming&lt;/strong&gt;&lt;/em&gt;  到 &lt;em&gt;&lt;strong&gt;swam&lt;/strong&gt;&lt;/em&gt; 的距离与斜率一致（即向量的长度与斜率一致），且距离几乎相等。因为这背后是两组动作单词的现在分词形式和过去分词形式的变化关系。我们可以尽情地想象，凡是事物或概念有逻辑联系的，甚至是逻辑与逻辑之间的联系的，在 Embedding 向量空间中都可以得到远近亲疏的空间表达。只不过这种空间要比我们能想象出的三维空间要高出很多维度。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/6.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;center&gt;在 Embedding 空间中隐含的内在逻辑关系&lt;/center&gt;&lt;/p&gt;
&lt;p&gt;Word Embedding 之所以能给每一个单词做这样有意义的向量空间的标注，是因为 AI 科学家们事先用了全球十多种主流语言的大量语料给它进行了训练。这些语料有小说、论文、学术期刊、网络文章、新闻报道、论坛对话记录等等等等，应有尽有，数以百亿到千亿计。可以说，这些海量的文字资料都是人类从古至今感受发现这个世界各个方面的文字总结和积累。现实世界中各种事物之间的逻辑关系都被人类用这些文字记录了下来，只是有的是用严谨的论文方式，有的是用写意的小说方式，有的使用类似维基百科这样的系统梳理，有的则是人们在网络论坛中的对话记录&amp;hellip;等等等等。但不管是什么方式，都是人类试图用语言对这个世界的描述。&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;语言是人类最伟大的发明&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;笔者7号床曾经问过  ChatGPT  一个问题：&lt;em&gt;&lt;strong&gt;“人类最伟大的发明是什么”&lt;/strong&gt;&lt;/em&gt; ，ChatGPT的回答是：&lt;em&gt;&lt;strong&gt;“语言！”&lt;/strong&gt;&lt;/em&gt;。之后，ChatGPT 进一步回答，因为语言以及匹配语言的文字与符号，它们让人类把对世界的感受与理解记录下来，形成了知识宝库。方便全人类一代一代地不断完善这个宝库，并从中总结凝练、学习、创造、传承。语言是人类产生文明并开始与其他动物分道扬镳的分叉点。&lt;/p&gt;
&lt;p&gt;很多人曾经十分疑惑，人工智能吹得那么先进，却从一个 ChatGPT 聊天功能开始火爆起来。难道每天不干正事专门闲聊就证明了人工智能的先进性吗？现在看来，这个问题的答案已经浮出水面了，OpenAI 的团队选择通过聊天软件 ChatGPT 作为 GPT 启程的第一步是经过深思熟虑的。&lt;/p&gt;
&lt;p&gt;下面让我们回到正题。&lt;/p&gt;
&lt;p&gt;人类的知识宝库中存储着海量的信息
ChatGPT 所说的这个知识宝库现在变得越来越庞大、越来越复杂了。这世界上并不存在任何一个肉身的人类有能力做到对宝库中所有信息进行消化整理，因为内容体量过于庞大、过于复杂。而一个人的阅览进度却又是十分有限，以至于在他的有生之年，哪怕完成其中的万分之一都比登天还难。于是，迫不得已，人类才喊出了 &lt;em&gt;&lt;strong&gt;“闻道有先后，术业有专攻”&lt;/strong&gt;&lt;/em&gt; ，每个人类个体才转而去研究具体某一领域。&lt;/p&gt;
&lt;p&gt;另一方面，人类早期发明的纸张和印刷术，以至于后来的计算机芯片存储，倒是可以记录存储下来如此巨量的信息了，但却无法主动地、有机地分析汇总其中所有信息之间的内在逻辑。以至于计算机存储的这些数据越积越多，犹如汪洋大海。&lt;/p&gt;
&lt;p&gt;这个知识宝库的结构就好比一棵万米高的巨大知识树，人类如同蚂蚁一样在树上摸索前行。人类只能将有限的肉身算力资源集中在主要的枝干，对于无数的细枝末节尚无暇顾及，但随着发现的主要枝干越来越多，细枝末节的信息量将呈爆炸的方式展现出来。而对于这颗知识巨树的展示能力，却因为计算机时代的到来而大大加速了进程。但当发现知识树越来越庞大时，人类也认识到了自身的渺小。&lt;/p&gt;
&lt;p&gt;AI （Embedding）开启对知识宝库的挖掘
现在，这一探索知识巨树的任务落到了 AI 的身上，AI 的承载和运算能力超越了过往所有人类个体以及群体能力的总和。AI 通过事先的大量预训练，把这些海量文字用 Word Embedding 的方式抽象地汇总在了大模型之中。Word Embedding 词嵌入编码法，能让每一个单词之间产生应有的语义上的以及背后逻辑关系上的联系。这种联系越紧密，他们在 Embedding 空间中的位置距离越紧密，反之则越远。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;23-attention-注意力机制&#34;&gt;2.3 Attention 注意力机制&lt;/h3&gt;
&lt;p&gt;想象一下，Google 用了至少千亿级的语料来训练单词在 Embedding 空间中的表达，其中包含了全世界几乎所有语言的词汇量。所以在回过头来考虑一下之前举例中的两句话时，就有了如下这样一副景象：&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/7.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;center&gt;在 Word Embedding 向量空间中 bank、 river 和 money 的向量表达&lt;/center&gt;&lt;/p&gt;
&lt;p&gt;如上图，我们用一个简单的位置关系图来展示一下&lt;em&gt;&lt;strong&gt;bank&lt;/strong&gt;&lt;/em&gt;、 &lt;em&gt;&lt;strong&gt;river&lt;/strong&gt;&lt;/em&gt; 和 &lt;em&gt;&lt;strong&gt;money&lt;/strong&gt;&lt;/em&gt; 这几个单词在 Embedding 空间中的位置关系（在实际 Embedding 空间中的关系要比这个图复杂数百倍，这里只是为了让大家更好地理解关键逻辑而做了简化）。&lt;/p&gt;
&lt;p&gt;由于 “bank” 是一个多义词，所以它在 Embedding 空间中的定位本来是有多个“分身”，我们取其中的两个分身，即“bank1”和“bank2”。那么，我们需要做的就是定位清晰“bank1”和“bank2”这两个单词在空间中到底各自离 &lt;em&gt;&lt;strong&gt;river&lt;/strong&gt;&lt;/em&gt; 和 &lt;em&gt;&lt;strong&gt;money&lt;/strong&gt;&lt;/em&gt; 的哪个单词更近一些。在图中很明显，“bank1”离 &lt;em&gt;&lt;strong&gt;river&lt;/strong&gt;&lt;/em&gt; 更近，而“bank2”离 &lt;em&gt;&lt;strong&gt;money&lt;/strong&gt;&lt;/em&gt; 更近，于是这两句话就变成了：&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;**变形后的句子I：**The &lt;strong&gt;bank1&lt;/strong&gt; of the river.&lt;/li&gt;
&lt;li&gt;**变形后的句子II：**Money in the &lt;strong&gt;bank2&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;如之前所说，虽然此时机器算法压根也不知道 &lt;em&gt;&lt;strong&gt;river&lt;/strong&gt;&lt;/em&gt; 和 &lt;em&gt;&lt;strong&gt;money&lt;/strong&gt;&lt;/em&gt; 到底是何物，但它知道在Embedding 空间中， &lt;em&gt;&lt;strong&gt;river&lt;/strong&gt;&lt;/em&gt; 周边有很多和大自然有关的词汇，比如  &lt;em&gt;&lt;strong&gt;water&lt;/strong&gt;&lt;/em&gt;、&lt;em&gt;&lt;strong&gt;tree&lt;/strong&gt;&lt;/em&gt;、&lt;em&gt;&lt;strong&gt;fish&lt;/strong&gt;&lt;/em&gt; 等等。而 &lt;em&gt;&lt;strong&gt;money&lt;/strong&gt;&lt;/em&gt; 周边有许多与金融有关的词汇，比如 &lt;em&gt;&lt;strong&gt;currency&lt;/strong&gt;&lt;/em&gt;,  &lt;em&gt;&lt;strong&gt;cash&lt;/strong&gt;&lt;/em&gt; ,  &lt;em&gt;&lt;strong&gt;withdraw&lt;/strong&gt;&lt;/em&gt; 等等。于是，机器算法知道了 &lt;em&gt;&lt;strong&gt;bank1&lt;/strong&gt;&lt;/em&gt; 代表的是与 &lt;em&gt;&lt;strong&gt;river&lt;/strong&gt;&lt;/em&gt; 有关的一个单词，与他们比较近的单词还有   &lt;em&gt;&lt;strong&gt;water&lt;/strong&gt;&lt;/em&gt;、&lt;em&gt;&lt;strong&gt;tree&lt;/strong&gt;&lt;/em&gt;、&lt;em&gt;&lt;strong&gt;fish&lt;/strong&gt;&lt;/em&gt; 等等，而“&lt;strong&gt;bank2&lt;/strong&gt;”代表的是与“&lt;strong&gt;money&lt;/strong&gt;”有关的一个单词，与他们比较接近的单词还有  &lt;em&gt;&lt;strong&gt;currency&lt;/strong&gt;&lt;/em&gt;,  &lt;em&gt;&lt;strong&gt;cash&lt;/strong&gt;&lt;/em&gt; ,  &lt;em&gt;&lt;strong&gt;withdraw&lt;/strong&gt;&lt;/em&gt;  等等。这就是**“Attention 注意力机制”的工作原理，也就是 Attention 让一个单词在句子中找到与它产生强语义联系的其他单词，并组成一个新的变体单词**：&lt;em&gt;&lt;strong&gt;bank1&lt;/strong&gt;&lt;/em&gt;、&lt;em&gt;&lt;strong&gt;bank2&lt;/strong&gt;&lt;/em&gt;。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;24-self-attention-自注意力机制&#34;&gt;2.4 Self-Attention 自注意力机制&lt;/h3&gt;
&lt;p&gt;然后又有新的问题产生了，机器算法是如何知道一句话中只有 &lt;em&gt;&lt;strong&gt;river&lt;/strong&gt;&lt;/em&gt; 或 &lt;em&gt;&lt;strong&gt;money&lt;/strong&gt;&lt;/em&gt; 这两个词代表了上下文语义的强关联词汇，而不是 &lt;em&gt;&lt;strong&gt;The&lt;/strong&gt;&lt;/em&gt;、&lt;em&gt;&lt;strong&gt;in&lt;/strong&gt;&lt;/em&gt;、&lt;em&gt;&lt;strong&gt;of&lt;/strong&gt;&lt;/em&gt;或其他单词呢？实际上这依旧是 Embedding 空间中每一个单词的空间定位相近程度的问题。（实际上，在 Embedding 空间中，不仅仅名词有各自的位置，动词、介词、形容词等等都有自己的位置，甚至一个词组、一句话也会有自己的位置。）&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/8.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;全句中的每一个单词在 Embedding 空间中定位的相近度是这样来计算的。机器算法会对每一个单词与全句中其他单词逐一地配对，做语义关联程度的计算和比较，最终汇总到表格中，&lt;strong&gt;颜色越深代表语义关联程度越高&lt;/strong&gt;。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/9.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;center&gt;一个句子中所有单词都做一遍“Attention 注意力机制”&lt;/center&gt;&lt;/p&gt;
&lt;p&gt;我们可以从表格中看出来：&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;每一个单词与自己的相似度为最高分 1（一般用数值“1”来代表最大权重，这里的相似度用权重来表达）；&lt;/li&gt;
&lt;li&gt;互不相关的单词之间的语义关联度为 0（其实可能是 0.001 之类的很小的数字，这里做了简化，即值太小，以至于低于某一个阈值而归零处理）；&lt;/li&gt;
&lt;li&gt;&lt;em&gt;&lt;strong&gt;bank&lt;/strong&gt;&lt;/em&gt;  与   &lt;em&gt;&lt;strong&gt;river&lt;/strong&gt;&lt;/em&gt; 的相似度为 0.11；&lt;/li&gt;
&lt;li&gt;&lt;em&gt;&lt;strong&gt;bank&lt;/strong&gt;&lt;/em&gt; 与  &lt;em&gt;&lt;strong&gt;money&lt;/strong&gt;&lt;/em&gt; 的相似度为 0.25；&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;每一个单词与自己的语义关联度为最高的 1（一般用数值“1”来代表最大权重，这里的相似度用权重来表达）；ention 自注意力机制”了。于是通过“自注意力机制”的语义关联比对后，我们便找出了 &lt;em&gt;&lt;strong&gt;river&lt;/strong&gt;&lt;/em&gt; 为 &lt;strong&gt;句子I&lt;/strong&gt; 全句中与 &lt;em&gt;&lt;strong&gt;bank&lt;/strong&gt;&lt;/em&gt; 关联度最大的词， &lt;em&gt;&lt;strong&gt;money&lt;/strong&gt;&lt;/em&gt; 为“句子II”全句中与“bank”关联度最大的单词，然后 &lt;strong&gt;句子I&lt;/strong&gt; 中的 &lt;em&gt;&lt;strong&gt;bank&lt;/strong&gt;&lt;/em&gt; 就被机器算法转换成了它的新变种 &lt;em&gt;&lt;strong&gt;bank1&lt;/strong&gt;&lt;/em&gt;（&lt;em&gt;&lt;strong&gt;river-bank&lt;/strong&gt;&lt;/em&gt;），而在 &lt;strong&gt;句子2&lt;/strong&gt; 中的 &lt;em&gt;&lt;strong&gt;bank&lt;/strong&gt;&lt;/em&gt; 则被机器算法转换成了它的新变种 &lt;em&gt;&lt;strong&gt;bank2&lt;/strong&gt;&lt;/em&gt;（“money-bank”）。然后机器算法就可以继续往后进行翻译工作了。&lt;/p&gt;
&lt;br&gt;
&lt;h2 id=&#34;25-transformer-最终实现准确的翻译&#34;&gt;2.5 Transformer 最终实现准确的翻译&lt;/h2&gt;
&lt;p&gt;Embedding 是一个全场景全维度的空间，它其中含有全世界的所有语言的单词。​在这同一空间中，不仅仅有英文，也有中文、法文、德文&amp;hellip;等等的 Embedding 词汇标注。​那么基于Embedding 空间表达的的翻译就变成了现实。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/10.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;center&gt;t-SNE visualization of the bilingual word embedding.（t-SNE 是一种高维数据可视化技术）&lt;/center&gt;&lt;/p&gt;
&lt;p&gt;比如，中文的 &lt;em&gt;&lt;strong&gt;河流&lt;/strong&gt;&lt;/em&gt; 和英文的 &lt;em&gt;&lt;strong&gt;river&lt;/strong&gt;&lt;/em&gt; 在 Embedding 空间中的位置基本是一样的，而 &lt;em&gt;&lt;strong&gt;钱&lt;/strong&gt;&lt;/em&gt; 和 &lt;em&gt;&lt;strong&gt;money&lt;/strong&gt;&lt;/em&gt; 的位置基本一样，&lt;em&gt;&lt;strong&gt;岸边&lt;/strong&gt;&lt;/em&gt; 和 &lt;em&gt;&lt;strong&gt;bank1&lt;/strong&gt;&lt;/em&gt; 的位置一样，&lt;em&gt;&lt;strong&gt;银行&lt;/strong&gt;&lt;/em&gt; 和 &lt;em&gt;&lt;strong&gt;bank2&lt;/strong&gt;&lt;/em&gt; 的位置一样。于是，把这些不同语言的定位一一找出来，就实现了十分正确的翻译结果了。&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;句子I&lt;/strong&gt;：The &lt;em&gt;&lt;strong&gt;bank1&lt;/strong&gt;&lt;/em&gt; of the river.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;句子I翻译&lt;/strong&gt;：那个河流的岸边。&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;句子II&lt;/strong&gt;：Money in the &lt;em&gt;&lt;strong&gt;bank2&lt;/strong&gt;&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;句子II翻译&lt;/strong&gt;：银行中的钱。&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;至此，Transformer 和其中的核心部件 Self-Attention 对于语言翻译类信息处理的流程就被简要地讲清楚了。但像上面例子中 ***“The bank of the river.”***这样的句子太短太简单了，它甚至都无法称为一个完整的句子。在实际项目中，输入给 Transformer 的语句会更长更复杂，往往在一句话中有可能出现三个以上的单词有语义关联的关系，甚至更多。 比如这一句：“The animal did not cross the street because it was too tired. ”。很明显，在该句中和 &lt;em&gt;&lt;strong&gt;it&lt;/strong&gt;&lt;/em&gt; 有语义关系的词汇有两个，分别是 &lt;em&gt;&lt;strong&gt;animal&lt;/strong&gt;&lt;/em&gt; 和 &lt;em&gt;&lt;strong&gt;street&lt;/strong&gt;&lt;/em&gt;。&lt;/p&gt;
&lt;p&gt;对于这样的情况，处理机制和“The bank of the river.”的处理机制仍然是一样的。Self-Attention 一样会对全句中的所有单词都进行在 Embedding 空间中的距离比较，即语义关联权重的比较。&lt;/p&gt;
&lt;p&gt;在 &lt;em&gt;&lt;strong&gt;“The animal did not cross the street because it was too tired.”&lt;/strong&gt;&lt;/em&gt; 中 &lt;em&gt;&lt;strong&gt;it&lt;/strong&gt;&lt;/em&gt;与 &lt;em&gt;&lt;strong&gt;animal&lt;/strong&gt;&lt;/em&gt; 的语义关联权重比与 &lt;em&gt;&lt;strong&gt;street&lt;/strong&gt;&lt;/em&gt;的语义关联权重要高。因此，Self-Attention 自注意力机制处理后的结果将以 &lt;em&gt;&lt;strong&gt;animal&lt;/strong&gt;&lt;/em&gt; 为主导来生成新的单词 &lt;em&gt;&lt;strong&gt;it1&lt;/strong&gt;&lt;/em&gt; ，即 &lt;em&gt;&lt;strong&gt;it1 =“animal-it”&lt;/strong&gt;&lt;/em&gt;。此时就变成了 &lt;em&gt;&lt;strong&gt;“The animal did not cross the street becauseit1 was too tired. ”&lt;/strong&gt;&lt;/em&gt; 。翻译成法语为：“L‘animaln’a pas traverse la rue parceil était trop fatigue.” 。翻译成中文则为：“这只动物没有过马路，因为它太累了。”。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/11.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;center&gt;色块的深浅表明了与“it”语义关联权重的强弱。这里“it”与“animal”的语义关联权重最大&lt;/center&gt;&lt;/p&gt;
&lt;p&gt;在另一句话中，&lt;em&gt;&lt;strong&gt;“The animal did not cross the street because it was too wide.” &lt;em&gt;&lt;strong&gt;，只是一字之差， &lt;em&gt;&lt;strong&gt;tired&lt;/strong&gt;&lt;/em&gt; 变成了 &lt;em&gt;&lt;strong&gt;wide&lt;/strong&gt;&lt;/em&gt;，导致了全句的语义发生了很大的变化，尤其是 &lt;em&gt;&lt;strong&gt;it&lt;/strong&gt;&lt;/em&gt; 所指的对象由 &lt;em&gt;&lt;strong&gt;animal&lt;/strong&gt;&lt;/em&gt; 变成了&lt;/strong&gt;&lt;/em&gt;street&lt;/strong&gt;&lt;/em&gt;。此时 Self-Attention 同样按照以前的方法进行语义关联度匹配，结果是&lt;em&gt;&lt;strong&gt;animal&lt;/strong&gt;&lt;/em&gt; 和 &lt;em&gt;&lt;strong&gt;street&lt;/strong&gt;&lt;/em&gt; 的权重在全句中都很高，但是 &lt;em&gt;&lt;strong&gt;street&lt;/strong&gt;&lt;/em&gt; 是最高的，所以最终的结果将以 &lt;em&gt;&lt;strong&gt;street&lt;/strong&gt;&lt;/em&gt; 主导来生成新的 &lt;em&gt;&lt;strong&gt;it2&lt;/strong&gt;&lt;/em&gt; ，即 &lt;em&gt;&lt;strong&gt;it2=“street-it”&lt;/strong&gt;&lt;/em&gt;。此时就变成了“The animal did not cross the street becauseit2was too wide.” 。翻译成法语为：“L‘animal n’a pas traverse la rue parceelle était trop large. ”。翻译成中文为：“这只动物没有过马路，因为路太宽了。”&lt;strong&gt;（注意：这里用的是“路”，而不是“它”，稍后会解释）&lt;/strong&gt;。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/12.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;center&gt;这里“it”与“street”的语义关联权重最大&lt;/center&gt;&lt;/p&gt;
&lt;p&gt;之所以 Self-Attention 可以把 Word Embedding 中的权重比较做得如此细腻，不仅是因为 Google 用了千亿级的语料来训练 Word Embedding。同时更是因为 Transformer 模型本身的架构核心 Self-Attention 也有与之匹配的超级强大的处理能力，它在超长语句上的处理能力远远超过了早先的 RNN （循环神经网络）和 CNN （卷积神经网络）（这两个著名的人工神经网络我会在之后的文章中一一介绍），它不仅仅能对一句中所有单词做 Self-Attention 自注意力机制的审核，它还可以对一整段话，甚至全篇文章做审核。这就是我们通常说的要结合上下文来理解语句并翻译。最新的 GPT-4 Turbo 一次可以处理大约 9.6 万个单词，比许多小说都长。此外，12.8万字（128K）的上下文长度可以导致更长的对话，而不会让人工智能在超长文的对话或翻译过程中迷失方向。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;26-word-embedding-的进一步扩展-sentence-embedding&#34;&gt;2.6 Word Embedding 的进一步扩展 Sentence Embedding&lt;/h3&gt;
&lt;p&gt;这一强大的能力，同样也来源于 Word Embedding 的能力。它不仅仅可以对单个词语进行定位，它甚至还可以做到对句子进行逻辑定位，如下图中所示。这种能力被称为“Sentence Embedding”。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/13.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;center&gt;Sentence Embedding 可以表达句子与句子之间的关系&lt;/center&gt;&lt;/p&gt;
&lt;p&gt;Word Embedding 和 Sentence Embedding 是大语言模型（Large Language Models，LLMs）的重要基础组成部分。它们将人类语言转化为了计算机能够读懂的底层数字表达方式，并且通过多维度的空间定位捕捉了各个单词、短语、句子在语义上的细微差别，以及它们之间的逻辑联系。&lt;strong&gt;这种底层的数字表达已经跨越了不同的语系语言，成为了全人类共用的最底层语言逻辑，甚至成为了一种世界语——AI 世界语，这对于翻译、搜索和理解不同语言语种具有非常重要的作用。可以说，巴别塔的传说自此解决！！&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;既有“大力出奇迹”的训练内容，更有承载“大力出奇迹”的结构，最终导致 Transformer 必然产生了这样的“奇迹”，使它能够在机器翻译领域达到了人类翻译的“信达雅”的成就。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/14.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;center&gt;BLEU 英译德评分&lt;/center&gt;&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/15.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;center&gt;BLEU 英译法评分&lt;/center&gt;&lt;/p&gt;
&lt;p&gt;上两幅图中，在 BLEU 的英德翻译与英法翻译领域 Transformer 得分最高。 （ 注：BLEU，bilingual evaluation understudy，即：双语互译质量评估辅助工具。它是用来评估机器翻译质量的工具。BLEU的设计思想：机器翻译结果越接近专业人工翻译的结果则越好。）&lt;/p&gt;
&lt;p&gt;通过一个小例子就能看出它的优越性，正好说说为什么是“路”而不是“它”，之前这两句的翻译结果如下：&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The animal did not cross the street because &lt;strong&gt;it1&lt;/strong&gt; was too tired.&lt;/li&gt;
&lt;li&gt;L&amp;rsquo;animal n&amp;rsquo;a pas traverse la rue parce &lt;strong&gt;il&lt;/strong&gt; était trop fatigue.&lt;/li&gt;
&lt;li&gt;这只动物没有过马路，因为&lt;strong&gt;它&lt;/strong&gt;太累了。&lt;/li&gt;
&lt;li&gt;———————————————&lt;/li&gt;
&lt;li&gt;The animal did not cross the street because &lt;strong&gt;it2&lt;/strong&gt; was too wide.&lt;/li&gt;
&lt;li&gt;L&amp;rsquo;animal n&amp;rsquo;a pas traverse la rue parce &lt;strong&gt;elle&lt;/strong&gt; était trop large.&lt;/li&gt;
&lt;li&gt;这只动物没有过马路，因为&lt;strong&gt;路&lt;/strong&gt;太宽了。&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;在法语中 il 和 elle 是明显不同的，因此他们可以在各自句子中指代出 &lt;em&gt;&lt;strong&gt;it&lt;/strong&gt;&lt;/em&gt; 的不同的翻译结果，不会引起语义模糊。这种在法语中明显的区别在翻译成中文时，就没有这么简单了。如果把两句话翻译成中文，&lt;em&gt;&lt;strong&gt;it&lt;/strong&gt;&lt;/em&gt; 都可以被粗糙地翻译成“它”，则第二句的语义将被普遍地认为不够精准，因为翻译成“它”会产生一定的语义模糊。取而代之，用“路”则更能达到“信达雅”的效果。大家可以用不同的翻译软件测试一下这两句话的英译中翻译，就知道哪些软件用了 Transformer 的底层技术，而哪些没用了！（你懂的 ）&lt;/p&gt;
&lt;p&gt;好了，绕了这么远，解释了这么多，终于可以说说这个 &lt;strong&gt;Transformer&lt;/strong&gt; 到底是什么意思了！&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;三ai-领域-transformer-的确切含义&#34;&gt;三、AI 领域 Transformer 的确切含义&lt;/h2&gt;
&lt;p&gt;**单词“X”转化为“X1”，“X”代表在 Transformer 处理之前一句话中的单词，而“X1”则代表了经过 Transformer 的 Slef-Attention 处理之后，附加了句子中其他具有强语义关联关系的单词后的“变种单词”。**其实，句子还是原来那个句子，单词还是那个单词，本质并没有变，但表达形式却变了。就如同“bank”被转变成了“bank1”一样。“bank1”的灵魂还是那个“bank”，但是“bank1”展示出来了隐藏在“bank”身体中的另一面“river-bank”。&lt;/p&gt;
&lt;p&gt;所以，用众所周知的  &lt;em&gt;&lt;strong&gt;变形金刚 Transformer&lt;/strong&gt;&lt;/em&gt; 来命名与解释就再贴切不过了~！ &lt;em&gt;&lt;strong&gt;bank&lt;/strong&gt;&lt;/em&gt; 变形成了 &lt;em&gt;&lt;strong&gt;bank1&lt;/strong&gt;&lt;/em&gt;， ***bank ***与 &lt;em&gt;&lt;strong&gt;bank1&lt;/strong&gt;&lt;/em&gt; 异体同身！&lt;em&gt;&lt;strong&gt;大黄蜂&lt;/strong&gt;&lt;/em&gt; 既是机器人，&lt;em&gt;&lt;strong&gt;大黄蜂&lt;/strong&gt;&lt;/em&gt; 也是跑车。由车变形到机器人，再由机器人变形到车，万变不离其宗，都是 &lt;em&gt;&lt;strong&gt;大黄蜂&lt;/strong&gt;&lt;/em&gt; ，本质上并没有改变，但是，外观变了，用途也就变了！&lt;/p&gt;
&lt;p&gt;在车的状态下，容易让人混淆（你本以为它是一辆车，但其实他是一个机器人，不变成人形，你还真认不出来）。就如同多义词一样，过往的翻译机制很难辨认出它在一句话中的确切含义，他们虽然也有上下文语义的兼顾理解能力，但是处理信息量还是太少，导致他们无法做到十分精准，经常造成单词虽然翻译对了，但放在句子里却容易产生含混不清甚至错误。但是通过 Transformer 的变形操作，“大黄蜂”的车状态就变形成了同样叫 &lt;em&gt;&lt;strong&gt;大黄蜂&lt;/strong&gt;&lt;/em&gt; 的机器人状态，再放回到句子中，则让它现了原型，于是一切水落石出！&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/16.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;center&gt;“大黄蜂”既是机器人，“大黄蜂”也是跑车，本质上都是同一个家伙，只是在不同的场合有不同的用途。&lt;/center&gt;&lt;/p&gt;
&lt;p&gt;Google 的技术团队就是用了“变形金刚 Transformer”这个梗。如此的诙谐幽默、简单直白，半开玩笑地就起了个技术名词。但也不得不承认“变形金刚 Transformer”这个词用在这里，用于这个技术名词的命名，也确实再贴切不过了，真正的名副其实！&lt;/p&gt;
&lt;p&gt;所以，当下次有人问你“GPT”到底是什么、翻译成中文又是什么意思时，你就可以明确地对他说：&lt;em&gt;&lt;strong&gt;“生成式预训练转换器”&lt;/strong&gt;&lt;/em&gt; 或者 &lt;em&gt;&lt;strong&gt;“生成式预训练变形金刚”&lt;/strong&gt;&lt;/em&gt;（前者翻译得其实也很含糊，所以我建议后者，虽然对方可能会嘲笑你几分钟，但也仅限这几分钟）。懂的人自然懂，不懂的也不用去解释！&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
</description>
      <content:encoded><![CDATA[<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">作者: 7号床
公众号: 7号床
原文  https://zhuanlan.zhihu.com/p/666206302
</code></pre></div><p><br><br></p>
<h2 id="一gpt-的名词解释">一、GPT 的名词解释</h2>
<p>著名的 <strong>GPT</strong> 这个名字全称是 <strong>Generative Pre-trained Transformer</strong>。</p>
<ul>
<li><strong>Generative</strong> 是&quot;生成式&quot;的意思，也就是说这个 AI 模型是用来生成内容的。</li>
<li><strong>Pre-trained</strong> 是“预训练”的意思，就是说这个 AI 模型能有很强的能力，是因为他事先做了大量的训练，台上一分钟台下十年功。</li>
<li><strong>Transformer</strong> , 就有点耐人寻味了，不仅普通人不理解，就连很多专业领域的人员理解起来也都是含混不清、似是而非。</li>
</ul>
<p><img loading="lazy" src="img/1.png" alt=""  />
</p>
<p><center>ChatGPT 是 GPT 大模型在聊天对话领域的应用程序</center></p>
<p><strong>Transformer</strong> 作为单词，翻译出来频率最高的意思是 <strong>变压器</strong>，然后是 <strong>变形金刚</strong> ，还有一些引申的含义是 <strong>转换器</strong> 、<strong>促使变化者</strong> 、<strong>转变者</strong> 或 <strong>改革者</strong>等等。</p>
<p><img loading="lazy" src="img/2.png" alt=""  />
</p>
<p><center>谷歌翻译上对 **Transformer** 的英译中翻译</center></p>
<p>再把 <strong>Transformer</strong> 放到  <strong>Chat Generative Pre-trained Transformer</strong> 中看看，突然间变得奇怪了，难道 ChatGPT 借鉴了变压器的技术？还是说 ChatGPT 是一个变形金刚？或者索性就翻译成通用的安全的叫法 <strong>转换器</strong> ？这让人百思不得其解。</p>
<p>光光从 GPT 这三个字母的组合就能看出来， <strong>Generative</strong> 与 <strong>Pre-trained</strong> 都是定语，而 <strong>Transformer 才是 GPT 的主体，才是 GPT 的灵魂</strong>所在。可以说，理解透了 <strong>Transformer</strong> 的真正含义，才能初步地理解 GPT。另一方面， Transformer 这个词太重要了。它在这几年的人工智能领域大放异彩，不仅仅局限于 NLP 自然语言处理领域，它还有着更广阔的发展空间。 Transformer 目前已经进入到了多模态领域，比如音频与视觉，甚至数学公式、代码编程等领域，著名的 **Stable Diffusion 中也用到了 Transformer **。<strong>可以说，所有生成式人工智能领域的大模型中目前都有了这个 Transformer 的身影</strong>。既然如此重要，那就让我们深入地探究一下 <strong>Transformer</strong> 在人工智能领域最确切的最标准的含义到底是什么吧！</p>
<p><strong>Transformer</strong> 最早是由 Google 的人工智能团队提出来的。在2017 年6月发表的论文**《Attention Is All You Need》中，他们首次提出了一种新的神经网络架构 Transformer**。Transformer 依赖于一个叫“自注意力机制”（ Self-Attention）的内部构件，可十分准确高效地对自然语言领域的问题进行处理，以完美地解决翻译、对话、论文协作甚至编程等复杂的问题。</p>
<p>顺藤摸瓜可以看出，<strong>GTP 的核心是 Transformer，而 Transformer 的核心则是“自注意力机制”（ Self-Attention）</strong>。那么这个“自注意力机制”又是什东西呢？让我们用语言翻译领域的几个简单易懂的例子来讲解一下。</p>
<p><br><br></p>
<h2 id="二-transformer-的核心-self-attention">二、 Transformer 的核心 Self-Attention</h2>
<p>首先，看下面这两个短句：</p>
<ul>
<li><strong>句子I</strong>：The bank of the river.</li>
<li><strong>句子II</strong>：Money in the bank.</li>
</ul>
<p>在翻译成中文的过程中，机器算法是如何知道“句子I”中的“bank”指的是自然环境中的“岸边”，而“句子II”中的“bank”指的是金融体系中的“银行”呢？</p>
<p><img loading="lazy" src="img/3.png" alt=""  />
</p>
<p><center>bank在不同句子中指代不同的事物</center></p>
<h3 id="21-人类脑中的翻译算法">2.1 人类脑中的翻译算法</h3>
<p>作为人类的我们当然会觉得这是一个再简单不过的事情了，那是因为我们的语言技能从幼儿发展到成年人后，早已烂熟于心了。但即使烂熟于心，也并不意味着在我们的大脑中没有对应的计算过程。<strong>实际上人工智能的翻译过程就是对我们人脑中的计算过程的模拟</strong>。那么就让我们回想一下儿童时期学习语言时的情景吧，回想一下当时的我们是怎么知道一个多义词在某一句话中具体的含义的？</p>
<p>人类做这件事的方法是根据 <strong>前后文的语义对照</strong> 来确定结果，即看句子中其他相关联的单词是什么含义。</p>
<ul>
<li>在 <strong>句子I</strong> 中， <em><strong>river</strong></em> 这个词指明了自然环境，</li>
<li>而在 <strong>句子II</strong>中， <em><strong>money</strong></em> 这个词则指明了金融环境。</li>
</ul>
<p>所以两个句子中的多义词“bank”也就有了各自的定位。如果把这种方式总结成一种算法的话，这个算法就可以用于人工智能领域用于语言处理了。</p>
<br>
<h3 id="22-机器算法模拟人脑中的翻译过程">2.2 机器算法模拟人脑中的翻译过程</h3>
<p>但人工智能作为一种计算机算法，它只能处理冷冰冰的数字，并不知道何为自然环境，何为金融环境，它又是怎么去判断 <em><strong>river</strong></em> 和 <em><strong>money</strong></em> 各自的含义呢。实际上，机器算法并不知道 <em><strong>river</strong></em> 和 <em><strong>money</strong></em> 的具体含义。但是机器可以通过某种数字的方式来表达 <em><strong>river</strong></em> 和 <em><strong>money</strong></em> ，同时，通过数字的方式还表达了许许多多其他的词汇，其中必然会有一些词汇会与 <em><strong>river</strong></em> 和 <em><strong>money</strong></em> 有着很紧密的语义上的逻辑关系。通过判断 <em><strong>river</strong></em> 和 <em><strong>money</strong></em> 各与哪些词汇在语义上有紧密的逻辑关系，便可以知道这两个词各属于什么领域了。</p>
<p>（其实，不像人类会对某个领域有一个具体的名称来命名，在人工智能领域，机器最终也不知道这个领域的统称到底叫什么名字，但它却知道这个领域中都包括了哪些词、哪些概念和哪些逻辑。***机器不以单独名称来定义一个概念，它却可以用很多相关的概念与逻辑来圈定这一个概念！***这可能就是老子说的：道可道非常道，名可名非常名吧。）</p>
<br>
<ul>
<li><strong>独热编码法(One-hot Encoding)</strong></li>
</ul>
<p>那么就让我们看看这种数字表达方式具体是什么样子吧。</p>
<p>假设这个世界上有100万个单词，每一个单词，我们都可以用一组 0 和 1 组成的向量（一组数字）来定义的话，那么每一个单词就可以被编码成100万个0或1组成的向量。如下图：</p>
<p><img loading="lazy" src="img/4.png" alt=""  />
</p>
<p><center>独热编码示例</center></p>
<p>这种单词编码方法叫 **独热编码法(One-hot Encoding)**法。可是这样一维的编码方法将导致向量占用的空间过大，1个单词用100万个单元的向量表达，世界上一共有100万个单词，那么就需要 1万亿（100万*100万）的体积来把它们表达出来，很明显这种臃肿的结构不利于电脑计算。</p>
<p>但最大的问题还不在于这个体积问题，而是语义联系问题。独热编码使得单词与单词之间完全相互独立，从每个单词所编码成为的100万个单元的向量身上，根本看不出它与其他单词有何种语义内涵上的逻辑联系。比如，在这些数字中，我们无法知道 <em><strong>apple</strong></em> 和 <em><strong>bag</strong></em> 属于静物，区别于 cat 和 <em><strong>dog</strong></em>、<em><strong>elephant</strong></em> 属于动物且是哺乳动物，而 <em><strong>cat</strong></em>  和 <em><strong>dog</strong></em> 又属于小动物，且大多数为非野生，区别于 <em><strong>elephant</strong></em> 为大型的野生动物，等等等等，这些单词背后所蕴含的各种内在的逻辑联系和分类关系均无法从独热编码法中知晓。实际上独热编码是传统计算机数据库时代的产物，而在人工智能领域则采用另一种编码法。为了解决独热编码的问题， <strong>词嵌入编码法(Word Embedding)</strong> 诞生了，如下图：</p>
<p><img loading="lazy" src="img/5.png" alt=""  />
</p>
<p><center>Word Embedding 词嵌入编码示意，及 Embedding 空间</center></p>
<br>
<ul>
<li><strong>词嵌入编码法(Word Embedding)</strong></li>
</ul>
<p>**词嵌入编码法(Word Embedding)**将语义上相近的、有关联的词汇在 Embedding 空间中生成相近的位置定位。相对于 <strong>独热编码法</strong> 超长的一维数据，词嵌入编码法(Word Embedding) 提升了数据的表达维度，它更像是在某一个 <strong>空间</strong> 中对词汇进行编码。</p>
<p>如上图（为了在此文章中表达方便，我们仅用二维空间来表达，实际上这个空间的维度很高，至少要在512维之上！一维二维三维的空间大家都可以在脑中想象出来对应的画面，但是四维以上以至于 512 维就难以图形化的想象了。），在 Embedding 的二维空间中 <em><strong>dog</strong></em>、 <em><strong>cat</strong></em> 、<em><strong>rabbit</strong></em> 三个向量的坐标点位排布，可以看到三个绿色的点距离很近，是因为他们三个相对于其他来说语义上更接近。tree 和 flower 则离它们较远，但是 <em><strong>cat</strong></em> 会因为在很多语言的文章中都会有“爬树”的词汇出现在同一句话中，所以导致  <em><strong>cat</strong></em>  会与  <em><strong>tree</strong></em>  离得较近一些。同时 <em><strong>dog</strong></em>、 <em><strong>rabbit</strong></em>  与  <em><strong>tree</strong></em> 的关系就较远。</p>
<p>实际上，在 Embedding 空间中，词与词之间的关系还不仅仅限于语义上的分类所导致的定位远近这么简单。一个词所代表的事物与其他词所代表的事物之间能产生内在联系的往往有成百上千上万种之多。比如  <em><strong>man</strong></em>  和  <em><strong>woman</strong></em> ，他们之间的关系还会映射出  <em><strong>king</strong></em>  和  <em><strong>queen</strong></em>  之间的关系。同时，语法也会带来一定的联系，比如在一个三维空间中由  <em><strong>walking</strong></em>  到 <em><strong>walked</strong></em>  的距离与斜率竟然与  <em><strong>swimming</strong></em>  到 <em><strong>swam</strong></em> 的距离与斜率一致（即向量的长度与斜率一致），且距离几乎相等。因为这背后是两组动作单词的现在分词形式和过去分词形式的变化关系。我们可以尽情地想象，凡是事物或概念有逻辑联系的，甚至是逻辑与逻辑之间的联系的，在 Embedding 向量空间中都可以得到远近亲疏的空间表达。只不过这种空间要比我们能想象出的三维空间要高出很多维度。</p>
<p><img loading="lazy" src="img/6.png" alt=""  />
</p>
<p><center>在 Embedding 空间中隐含的内在逻辑关系</center></p>
<p>Word Embedding 之所以能给每一个单词做这样有意义的向量空间的标注，是因为 AI 科学家们事先用了全球十多种主流语言的大量语料给它进行了训练。这些语料有小说、论文、学术期刊、网络文章、新闻报道、论坛对话记录等等等等，应有尽有，数以百亿到千亿计。可以说，这些海量的文字资料都是人类从古至今感受发现这个世界各个方面的文字总结和积累。现实世界中各种事物之间的逻辑关系都被人类用这些文字记录了下来，只是有的是用严谨的论文方式，有的是用写意的小说方式，有的使用类似维基百科这样的系统梳理，有的则是人们在网络论坛中的对话记录&hellip;等等等等。但不管是什么方式，都是人类试图用语言对这个世界的描述。</p>
<ul>
<li><strong>语言是人类最伟大的发明</strong></li>
</ul>
<p>笔者7号床曾经问过  ChatGPT  一个问题：<em><strong>“人类最伟大的发明是什么”</strong></em> ，ChatGPT的回答是：<em><strong>“语言！”</strong></em>。之后，ChatGPT 进一步回答，因为语言以及匹配语言的文字与符号，它们让人类把对世界的感受与理解记录下来，形成了知识宝库。方便全人类一代一代地不断完善这个宝库，并从中总结凝练、学习、创造、传承。语言是人类产生文明并开始与其他动物分道扬镳的分叉点。</p>
<p>很多人曾经十分疑惑，人工智能吹得那么先进，却从一个 ChatGPT 聊天功能开始火爆起来。难道每天不干正事专门闲聊就证明了人工智能的先进性吗？现在看来，这个问题的答案已经浮出水面了，OpenAI 的团队选择通过聊天软件 ChatGPT 作为 GPT 启程的第一步是经过深思熟虑的。</p>
<p>下面让我们回到正题。</p>
<p>人类的知识宝库中存储着海量的信息
ChatGPT 所说的这个知识宝库现在变得越来越庞大、越来越复杂了。这世界上并不存在任何一个肉身的人类有能力做到对宝库中所有信息进行消化整理，因为内容体量过于庞大、过于复杂。而一个人的阅览进度却又是十分有限，以至于在他的有生之年，哪怕完成其中的万分之一都比登天还难。于是，迫不得已，人类才喊出了 <em><strong>“闻道有先后，术业有专攻”</strong></em> ，每个人类个体才转而去研究具体某一领域。</p>
<p>另一方面，人类早期发明的纸张和印刷术，以至于后来的计算机芯片存储，倒是可以记录存储下来如此巨量的信息了，但却无法主动地、有机地分析汇总其中所有信息之间的内在逻辑。以至于计算机存储的这些数据越积越多，犹如汪洋大海。</p>
<p>这个知识宝库的结构就好比一棵万米高的巨大知识树，人类如同蚂蚁一样在树上摸索前行。人类只能将有限的肉身算力资源集中在主要的枝干，对于无数的细枝末节尚无暇顾及，但随着发现的主要枝干越来越多，细枝末节的信息量将呈爆炸的方式展现出来。而对于这颗知识巨树的展示能力，却因为计算机时代的到来而大大加速了进程。但当发现知识树越来越庞大时，人类也认识到了自身的渺小。</p>
<p>AI （Embedding）开启对知识宝库的挖掘
现在，这一探索知识巨树的任务落到了 AI 的身上，AI 的承载和运算能力超越了过往所有人类个体以及群体能力的总和。AI 通过事先的大量预训练，把这些海量文字用 Word Embedding 的方式抽象地汇总在了大模型之中。Word Embedding 词嵌入编码法，能让每一个单词之间产生应有的语义上的以及背后逻辑关系上的联系。这种联系越紧密，他们在 Embedding 空间中的位置距离越紧密，反之则越远。</p>
<br>
<h3 id="23-attention-注意力机制">2.3 Attention 注意力机制</h3>
<p>想象一下，Google 用了至少千亿级的语料来训练单词在 Embedding 空间中的表达，其中包含了全世界几乎所有语言的词汇量。所以在回过头来考虑一下之前举例中的两句话时，就有了如下这样一副景象：</p>
<p><img loading="lazy" src="img/7.png" alt=""  />
</p>
<p><center>在 Word Embedding 向量空间中 bank、 river 和 money 的向量表达</center></p>
<p>如上图，我们用一个简单的位置关系图来展示一下<em><strong>bank</strong></em>、 <em><strong>river</strong></em> 和 <em><strong>money</strong></em> 这几个单词在 Embedding 空间中的位置关系（在实际 Embedding 空间中的关系要比这个图复杂数百倍，这里只是为了让大家更好地理解关键逻辑而做了简化）。</p>
<p>由于 “bank” 是一个多义词，所以它在 Embedding 空间中的定位本来是有多个“分身”，我们取其中的两个分身，即“bank1”和“bank2”。那么，我们需要做的就是定位清晰“bank1”和“bank2”这两个单词在空间中到底各自离 <em><strong>river</strong></em> 和 <em><strong>money</strong></em> 的哪个单词更近一些。在图中很明显，“bank1”离 <em><strong>river</strong></em> 更近，而“bank2”离 <em><strong>money</strong></em> 更近，于是这两句话就变成了：</p>
<ul>
<li>**变形后的句子I：**The <strong>bank1</strong> of the river.</li>
<li>**变形后的句子II：**Money in the <strong>bank2</strong>.</li>
</ul>
<p>如之前所说，虽然此时机器算法压根也不知道 <em><strong>river</strong></em> 和 <em><strong>money</strong></em> 到底是何物，但它知道在Embedding 空间中， <em><strong>river</strong></em> 周边有很多和大自然有关的词汇，比如  <em><strong>water</strong></em>、<em><strong>tree</strong></em>、<em><strong>fish</strong></em> 等等。而 <em><strong>money</strong></em> 周边有许多与金融有关的词汇，比如 <em><strong>currency</strong></em>,  <em><strong>cash</strong></em> ,  <em><strong>withdraw</strong></em> 等等。于是，机器算法知道了 <em><strong>bank1</strong></em> 代表的是与 <em><strong>river</strong></em> 有关的一个单词，与他们比较近的单词还有   <em><strong>water</strong></em>、<em><strong>tree</strong></em>、<em><strong>fish</strong></em> 等等，而“<strong>bank2</strong>”代表的是与“<strong>money</strong>”有关的一个单词，与他们比较接近的单词还有  <em><strong>currency</strong></em>,  <em><strong>cash</strong></em> ,  <em><strong>withdraw</strong></em>  等等。这就是**“Attention 注意力机制”的工作原理，也就是 Attention 让一个单词在句子中找到与它产生强语义联系的其他单词，并组成一个新的变体单词**：<em><strong>bank1</strong></em>、<em><strong>bank2</strong></em>。</p>
<br>
<h3 id="24-self-attention-自注意力机制">2.4 Self-Attention 自注意力机制</h3>
<p>然后又有新的问题产生了，机器算法是如何知道一句话中只有 <em><strong>river</strong></em> 或 <em><strong>money</strong></em> 这两个词代表了上下文语义的强关联词汇，而不是 <em><strong>The</strong></em>、<em><strong>in</strong></em>、<em><strong>of</strong></em>或其他单词呢？实际上这依旧是 Embedding 空间中每一个单词的空间定位相近程度的问题。（实际上，在 Embedding 空间中，不仅仅名词有各自的位置，动词、介词、形容词等等都有自己的位置，甚至一个词组、一句话也会有自己的位置。）</p>
<p><img loading="lazy" src="img/8.png" alt=""  />
</p>
<p>全句中的每一个单词在 Embedding 空间中定位的相近度是这样来计算的。机器算法会对每一个单词与全句中其他单词逐一地配对，做语义关联程度的计算和比较，最终汇总到表格中，<strong>颜色越深代表语义关联程度越高</strong>。</p>
<p><img loading="lazy" src="img/9.png" alt=""  />
</p>
<p><center>一个句子中所有单词都做一遍“Attention 注意力机制”</center></p>
<p>我们可以从表格中看出来：</p>
<ul>
<li>每一个单词与自己的相似度为最高分 1（一般用数值“1”来代表最大权重，这里的相似度用权重来表达）；</li>
<li>互不相关的单词之间的语义关联度为 0（其实可能是 0.001 之类的很小的数字，这里做了简化，即值太小，以至于低于某一个阈值而归零处理）；</li>
<li><em><strong>bank</strong></em>  与   <em><strong>river</strong></em> 的相似度为 0.11；</li>
<li><em><strong>bank</strong></em> 与  <em><strong>money</strong></em> 的相似度为 0.25；</li>
</ul>
<p>每一个单词与自己的语义关联度为最高的 1（一般用数值“1”来代表最大权重，这里的相似度用权重来表达）；ention 自注意力机制”了。于是通过“自注意力机制”的语义关联比对后，我们便找出了 <em><strong>river</strong></em> 为 <strong>句子I</strong> 全句中与 <em><strong>bank</strong></em> 关联度最大的词， <em><strong>money</strong></em> 为“句子II”全句中与“bank”关联度最大的单词，然后 <strong>句子I</strong> 中的 <em><strong>bank</strong></em> 就被机器算法转换成了它的新变种 <em><strong>bank1</strong></em>（<em><strong>river-bank</strong></em>），而在 <strong>句子2</strong> 中的 <em><strong>bank</strong></em> 则被机器算法转换成了它的新变种 <em><strong>bank2</strong></em>（“money-bank”）。然后机器算法就可以继续往后进行翻译工作了。</p>
<br>
<h2 id="25-transformer-最终实现准确的翻译">2.5 Transformer 最终实现准确的翻译</h2>
<p>Embedding 是一个全场景全维度的空间，它其中含有全世界的所有语言的单词。​在这同一空间中，不仅仅有英文，也有中文、法文、德文&hellip;等等的 Embedding 词汇标注。​那么基于Embedding 空间表达的的翻译就变成了现实。</p>
<p><img loading="lazy" src="img/10.png" alt=""  />
</p>
<p><center>t-SNE visualization of the bilingual word embedding.（t-SNE 是一种高维数据可视化技术）</center></p>
<p>比如，中文的 <em><strong>河流</strong></em> 和英文的 <em><strong>river</strong></em> 在 Embedding 空间中的位置基本是一样的，而 <em><strong>钱</strong></em> 和 <em><strong>money</strong></em> 的位置基本一样，<em><strong>岸边</strong></em> 和 <em><strong>bank1</strong></em> 的位置一样，<em><strong>银行</strong></em> 和 <em><strong>bank2</strong></em> 的位置一样。于是，把这些不同语言的定位一一找出来，就实现了十分正确的翻译结果了。</p>
<ul>
<li><strong>句子I</strong>：The <em><strong>bank1</strong></em> of the river.</li>
<li><strong>句子I翻译</strong>：那个河流的岸边。</li>
<li><strong>句子II</strong>：Money in the <em><strong>bank2</strong></em>.</li>
<li><strong>句子II翻译</strong>：银行中的钱。</li>
</ul>
<p>至此，Transformer 和其中的核心部件 Self-Attention 对于语言翻译类信息处理的流程就被简要地讲清楚了。但像上面例子中 ***“The bank of the river.”***这样的句子太短太简单了，它甚至都无法称为一个完整的句子。在实际项目中，输入给 Transformer 的语句会更长更复杂，往往在一句话中有可能出现三个以上的单词有语义关联的关系，甚至更多。 比如这一句：“The animal did not cross the street because it was too tired. ”。很明显，在该句中和 <em><strong>it</strong></em> 有语义关系的词汇有两个，分别是 <em><strong>animal</strong></em> 和 <em><strong>street</strong></em>。</p>
<p>对于这样的情况，处理机制和“The bank of the river.”的处理机制仍然是一样的。Self-Attention 一样会对全句中的所有单词都进行在 Embedding 空间中的距离比较，即语义关联权重的比较。</p>
<p>在 <em><strong>“The animal did not cross the street because it was too tired.”</strong></em> 中 <em><strong>it</strong></em>与 <em><strong>animal</strong></em> 的语义关联权重比与 <em><strong>street</strong></em>的语义关联权重要高。因此，Self-Attention 自注意力机制处理后的结果将以 <em><strong>animal</strong></em> 为主导来生成新的单词 <em><strong>it1</strong></em> ，即 <em><strong>it1 =“animal-it”</strong></em>。此时就变成了 <em><strong>“The animal did not cross the street becauseit1 was too tired. ”</strong></em> 。翻译成法语为：“L‘animaln’a pas traverse la rue parceil était trop fatigue.” 。翻译成中文则为：“这只动物没有过马路，因为它太累了。”。</p>
<p><img loading="lazy" src="img/11.png" alt=""  />
</p>
<p><center>色块的深浅表明了与“it”语义关联权重的强弱。这里“it”与“animal”的语义关联权重最大</center></p>
<p>在另一句话中，<em><strong>“The animal did not cross the street because it was too wide.” <em><strong>，只是一字之差， <em><strong>tired</strong></em> 变成了 <em><strong>wide</strong></em>，导致了全句的语义发生了很大的变化，尤其是 <em><strong>it</strong></em> 所指的对象由 <em><strong>animal</strong></em> 变成了</strong></em>street</strong></em>。此时 Self-Attention 同样按照以前的方法进行语义关联度匹配，结果是<em><strong>animal</strong></em> 和 <em><strong>street</strong></em> 的权重在全句中都很高，但是 <em><strong>street</strong></em> 是最高的，所以最终的结果将以 <em><strong>street</strong></em> 主导来生成新的 <em><strong>it2</strong></em> ，即 <em><strong>it2=“street-it”</strong></em>。此时就变成了“The animal did not cross the street becauseit2was too wide.” 。翻译成法语为：“L‘animal n’a pas traverse la rue parceelle était trop large. ”。翻译成中文为：“这只动物没有过马路，因为路太宽了。”<strong>（注意：这里用的是“路”，而不是“它”，稍后会解释）</strong>。</p>
<p><img loading="lazy" src="img/12.png" alt=""  />
</p>
<p><center>这里“it”与“street”的语义关联权重最大</center></p>
<p>之所以 Self-Attention 可以把 Word Embedding 中的权重比较做得如此细腻，不仅是因为 Google 用了千亿级的语料来训练 Word Embedding。同时更是因为 Transformer 模型本身的架构核心 Self-Attention 也有与之匹配的超级强大的处理能力，它在超长语句上的处理能力远远超过了早先的 RNN （循环神经网络）和 CNN （卷积神经网络）（这两个著名的人工神经网络我会在之后的文章中一一介绍），它不仅仅能对一句中所有单词做 Self-Attention 自注意力机制的审核，它还可以对一整段话，甚至全篇文章做审核。这就是我们通常说的要结合上下文来理解语句并翻译。最新的 GPT-4 Turbo 一次可以处理大约 9.6 万个单词，比许多小说都长。此外，12.8万字（128K）的上下文长度可以导致更长的对话，而不会让人工智能在超长文的对话或翻译过程中迷失方向。</p>
<br>
<h3 id="26-word-embedding-的进一步扩展-sentence-embedding">2.6 Word Embedding 的进一步扩展 Sentence Embedding</h3>
<p>这一强大的能力，同样也来源于 Word Embedding 的能力。它不仅仅可以对单个词语进行定位，它甚至还可以做到对句子进行逻辑定位，如下图中所示。这种能力被称为“Sentence Embedding”。</p>
<p><img loading="lazy" src="img/13.png" alt=""  />
</p>
<p><center>Sentence Embedding 可以表达句子与句子之间的关系</center></p>
<p>Word Embedding 和 Sentence Embedding 是大语言模型（Large Language Models，LLMs）的重要基础组成部分。它们将人类语言转化为了计算机能够读懂的底层数字表达方式，并且通过多维度的空间定位捕捉了各个单词、短语、句子在语义上的细微差别，以及它们之间的逻辑联系。<strong>这种底层的数字表达已经跨越了不同的语系语言，成为了全人类共用的最底层语言逻辑，甚至成为了一种世界语——AI 世界语，这对于翻译、搜索和理解不同语言语种具有非常重要的作用。可以说，巴别塔的传说自此解决！！</strong></p>
<p>既有“大力出奇迹”的训练内容，更有承载“大力出奇迹”的结构，最终导致 Transformer 必然产生了这样的“奇迹”，使它能够在机器翻译领域达到了人类翻译的“信达雅”的成就。</p>
<p><img loading="lazy" src="img/14.png" alt=""  />
</p>
<p><center>BLEU 英译德评分</center></p>
<br>
<p><img loading="lazy" src="img/15.png" alt=""  />
</p>
<p><center>BLEU 英译法评分</center></p>
<p>上两幅图中，在 BLEU 的英德翻译与英法翻译领域 Transformer 得分最高。 （ 注：BLEU，bilingual evaluation understudy，即：双语互译质量评估辅助工具。它是用来评估机器翻译质量的工具。BLEU的设计思想：机器翻译结果越接近专业人工翻译的结果则越好。）</p>
<p>通过一个小例子就能看出它的优越性，正好说说为什么是“路”而不是“它”，之前这两句的翻译结果如下：</p>
<ul>
<li>The animal did not cross the street because <strong>it1</strong> was too tired.</li>
<li>L&rsquo;animal n&rsquo;a pas traverse la rue parce <strong>il</strong> était trop fatigue.</li>
<li>这只动物没有过马路，因为<strong>它</strong>太累了。</li>
<li>———————————————</li>
<li>The animal did not cross the street because <strong>it2</strong> was too wide.</li>
<li>L&rsquo;animal n&rsquo;a pas traverse la rue parce <strong>elle</strong> était trop large.</li>
<li>这只动物没有过马路，因为<strong>路</strong>太宽了。</li>
</ul>
<p>在法语中 il 和 elle 是明显不同的，因此他们可以在各自句子中指代出 <em><strong>it</strong></em> 的不同的翻译结果，不会引起语义模糊。这种在法语中明显的区别在翻译成中文时，就没有这么简单了。如果把两句话翻译成中文，<em><strong>it</strong></em> 都可以被粗糙地翻译成“它”，则第二句的语义将被普遍地认为不够精准，因为翻译成“它”会产生一定的语义模糊。取而代之，用“路”则更能达到“信达雅”的效果。大家可以用不同的翻译软件测试一下这两句话的英译中翻译，就知道哪些软件用了 Transformer 的底层技术，而哪些没用了！（你懂的 ）</p>
<p>好了，绕了这么远，解释了这么多，终于可以说说这个 <strong>Transformer</strong> 到底是什么意思了！</p>
<p><br><br></p>
<h2 id="三ai-领域-transformer-的确切含义">三、AI 领域 Transformer 的确切含义</h2>
<p>**单词“X”转化为“X1”，“X”代表在 Transformer 处理之前一句话中的单词，而“X1”则代表了经过 Transformer 的 Slef-Attention 处理之后，附加了句子中其他具有强语义关联关系的单词后的“变种单词”。**其实，句子还是原来那个句子，单词还是那个单词，本质并没有变，但表达形式却变了。就如同“bank”被转变成了“bank1”一样。“bank1”的灵魂还是那个“bank”，但是“bank1”展示出来了隐藏在“bank”身体中的另一面“river-bank”。</p>
<p>所以，用众所周知的  <em><strong>变形金刚 Transformer</strong></em> 来命名与解释就再贴切不过了~！ <em><strong>bank</strong></em> 变形成了 <em><strong>bank1</strong></em>， ***bank ***与 <em><strong>bank1</strong></em> 异体同身！<em><strong>大黄蜂</strong></em> 既是机器人，<em><strong>大黄蜂</strong></em> 也是跑车。由车变形到机器人，再由机器人变形到车，万变不离其宗，都是 <em><strong>大黄蜂</strong></em> ，本质上并没有改变，但是，外观变了，用途也就变了！</p>
<p>在车的状态下，容易让人混淆（你本以为它是一辆车，但其实他是一个机器人，不变成人形，你还真认不出来）。就如同多义词一样，过往的翻译机制很难辨认出它在一句话中的确切含义，他们虽然也有上下文语义的兼顾理解能力，但是处理信息量还是太少，导致他们无法做到十分精准，经常造成单词虽然翻译对了，但放在句子里却容易产生含混不清甚至错误。但是通过 Transformer 的变形操作，“大黄蜂”的车状态就变形成了同样叫 <em><strong>大黄蜂</strong></em> 的机器人状态，再放回到句子中，则让它现了原型，于是一切水落石出！</p>
<p><img loading="lazy" src="img/16.png" alt=""  />
</p>
<p><center>“大黄蜂”既是机器人，“大黄蜂”也是跑车，本质上都是同一个家伙，只是在不同的场合有不同的用途。</center></p>
<p>Google 的技术团队就是用了“变形金刚 Transformer”这个梗。如此的诙谐幽默、简单直白，半开玩笑地就起了个技术名词。但也不得不承认“变形金刚 Transformer”这个词用在这里，用于这个技术名词的命名，也确实再贴切不过了，真正的名副其实！</p>
<p>所以，当下次有人问你“GPT”到底是什么、翻译成中文又是什么意思时，你就可以明确地对他说：<em><strong>“生成式预训练转换器”</strong></em> 或者 <em><strong>“生成式预训练变形金刚”</strong></em>（前者翻译得其实也很含糊，所以我建议后者，虽然对方可能会嘲笑你几分钟，但也仅限这几分钟）。懂的人自然懂，不懂的也不用去解释！</p>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>代码 | 使用LDA预测文本的话题类型</title>
      <link>https://textdata.cn/blog/2023-11-14-using-lda-to-predict-topic/</link>
      <pubDate>Tue, 14 Nov 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-11-14-using-lda-to-predict-topic/</guid>
      <description>&lt;h2 id=&#34;获取代码&#34;&gt;获取代码&lt;/h2&gt;
&lt;p&gt;&lt;a href=&#34;lda-code.zip&#34;&gt;&lt;strong&gt;点击下载本文数据&amp;amp;代码&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/lda-model.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;如何用LDA预测文本的话题类型，本文将覆盖以下代码技术&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;csv数据读取&lt;/li&gt;
&lt;li&gt;文本预处理&lt;/li&gt;
&lt;li&gt;训练(保存)lda模型&lt;/li&gt;
&lt;li&gt;预测话题&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;一读取数据&#34;&gt;一、读取数据&lt;/h2&gt;
&lt;p&gt;本文使用的数据集来自于 之前分享的 &lt;a href=&#34;https://textdata.cn/blog/2023-04-25-zhihu-parent-child-relationship/&#34;&gt;网络爬虫 | 知乎热门话题「全职儿女」&lt;/a&gt;&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;data/知乎-全职儿女.csv&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;encoding&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;utf-8&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;dropna&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;subset&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;content&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;inplace&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;记录数: &amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;head&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;pre&gt;&lt;code&gt;记录数:  411
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/01-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二清洗数据&#34;&gt;二、清洗数据&lt;/h2&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;re&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;jieba&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;stoptext&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;open&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;data/stopwords.txt&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;encoding&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;utf-8&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;        
&lt;span class=&#34;n&#34;&gt;stopwords&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;stoptext&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;split&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;se&#34;&gt;\n&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;  


&lt;span class=&#34;k&#34;&gt;def&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;clean_text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
    &lt;span class=&#34;c1&#34;&gt;# 用正则表达式提取中文文本&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;text&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;join&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;re&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;findall&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;[&lt;/span&gt;&lt;span class=&#34;se&#34;&gt;\u4e00&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;se&#34;&gt;\u9fa5&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;]+&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;   
    &lt;span class=&#34;n&#34;&gt;words&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;jieba&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;lcut&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;                               
    &lt;span class=&#34;n&#34;&gt;words&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;w&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;w&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;words&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;if&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;w&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;not&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;stopwords&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;   
    &lt;span class=&#34;c1&#34;&gt;#整理成用空格间隔词语的文本形式(类似西方语言)&lt;/span&gt;
    &lt;span class=&#34;k&#34;&gt;return&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39; &amp;#39;&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;join&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;words&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;


&lt;span class=&#34;n&#34;&gt;test_text&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;首先，我认为「全职儿女」不应该被简单地归为啃老。在目前社会环境下，随着经济、教育等发展，年轻...&amp;#34;&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;clean_text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;test_text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;    &amp;#39;全职 儿女 简单 地归为 啃 老 社会 环境 经济 教育 发展 年轻&amp;#39;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;clean_content&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;content&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;apply&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;clean_text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;head&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/02-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;三训练lda模型&#34;&gt;三、训练LDA模型&lt;/h2&gt;
&lt;h3 id=&#34;31-训练&#34;&gt;3.1 训练&lt;/h3&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/03-wordcloud.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;根据词云图， 假设对数据比较了解，可以直接设置话题数 n_components = 4&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;sklearn.feature_extraction.text&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;CountVectorizer&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;TfidfVectorizer&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;sklearn.decomposition&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;LatentDirichletAllocation&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;# 构建词典，将词转为数字。将文档转为向量&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;vectorizer&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;TfidfVectorizer&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;max_df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mf&#34;&gt;0.5&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;min_df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;20&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;   
&lt;span class=&#34;n&#34;&gt;doc_term_matrix&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;vectorizer&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fit_transform&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;clean_content&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;# 构建LDA话题模型&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;# 初始化模型，设置话题数为4,随机状态码888&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;lda_model&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;LatentDirichletAllocation&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;n_components&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;4&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;random_state&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;888&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;       
&lt;span class=&#34;n&#34;&gt;lda_output&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;lda_model&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fit_transform&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;doc_term_matrix&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;lda_model&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;32-保存模型&#34;&gt;3.2 保存模型&lt;/h3&gt;
&lt;p&gt;如果训练过程非常久，保存模型，下次就可以跳过训练阶段，直接使用模型。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;joblib&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;# # 保存模型&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;joblib&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;dump&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;lda_model&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;output/全职儿女lda_model.pkl&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;pre&gt;&lt;code&gt;[&#39;output/全职儿女lda_model.pkl&#39;]
&lt;/code&gt;&lt;/pre&gt;
&lt;br&gt;
&lt;h3 id=&#34;33-导入模型&#34;&gt;3.3 导入模型&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;joblib&lt;/span&gt;

&lt;span class=&#34;s2&#34;&gt;&amp;#34;&amp;#34;&amp;#34;导入保存的模型&amp;#34;&amp;#34;&amp;#34;&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;lda_model&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;joblib&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;load&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;output/全职儿女lda_model.pkl&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;lda_model&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/lda888.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;四使用lda模型&#34;&gt;四、使用LDA模型&lt;/h2&gt;
&lt;h3 id=&#34;41-查看话题特征词&#34;&gt;4.1 查看话题特征词&lt;/h3&gt;
&lt;p&gt;获得每个话题对应的的n个特征词，方便后续对每个话题命名和解读&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;numpy&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;np&lt;/span&gt;


&lt;span class=&#34;k&#34;&gt;def&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;show_topics&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;vectorizer&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;lda_model&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;top_n&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;30&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
    &lt;span class=&#34;s2&#34;&gt;&amp;#34;&amp;#34;&amp;#34;
&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;    显示每个话题最重要的n个词语
&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;    vectorizer: 词袋法或tfidf.基于前面代码这里使用TF-IDF法
&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;    lda_model: 训练好的lda话题模型
&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;    top_n: 设置最重要的n个特征词，默认30个.
&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;    &amp;#34;&amp;#34;&amp;#34;&lt;/span&gt;
    
    &lt;span class=&#34;n&#34;&gt;keywords&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;np&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;array&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;vectorizer&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;get_feature_names_out&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;())&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;topic_keywords&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[]&lt;/span&gt;
    &lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;topic_weights&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;lda_model&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;components_&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
        &lt;span class=&#34;n&#34;&gt;top_keyword_locs&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;topic_weights&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;argsort&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()[:&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;top_n&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
        &lt;span class=&#34;n&#34;&gt;topic_keywords&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;append&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;keywords&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;take&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;top_keyword_locs&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
    &lt;span class=&#34;k&#34;&gt;return&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;topic_keywords&lt;/span&gt;


&lt;span class=&#34;s2&#34;&gt;&amp;#34;&amp;#34;&amp;#34;利用show_topics函数展示全职儿女文本中的4个话题，基于每个话题最重要的20个词语为每个话题命名&amp;#34;&amp;#34;&amp;#34;&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;topic_keywords&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;show_topics&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;vectorizer&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;vectorizer&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;   &lt;span class=&#34;c1&#34;&gt;# 【可改动】vectorizer我们训练的词语空间&lt;/span&gt;
                             &lt;span class=&#34;n&#34;&gt;lda_model&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;lda_model&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;     &lt;span class=&#34;c1&#34;&gt;# 【可改动】lda_model训练的lda模型&lt;/span&gt;
                             &lt;span class=&#34;n&#34;&gt;top_n&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;20&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;                &lt;span class=&#34;c1&#34;&gt;# 【可改动】最重要的30个词语&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df_topic_keywords&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;DataFrame&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;topic_keywords&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df_topic_keywords&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;columns&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;Word-&amp;#39;&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;i&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;i&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;range&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df_topic_keywords&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;shape&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])]&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df_topic_keywords&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;index&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;Topic-&amp;#39;&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;i&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;i&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;range&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df_topic_keywords&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;shape&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])]&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df_topic_keywords&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/04-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;42-预测文本的话题id&#34;&gt;4.2 预测文本的话题id&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;numpy&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;np&lt;/span&gt;

&lt;span class=&#34;k&#34;&gt;def&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;predict_topic&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;doc_term_matrix&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;vectorizer&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;transform&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;([&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;clean_text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)])&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;topic_term_prob_matrix&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;lda_model&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;transform&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;doc_term_matrix&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;topic_index&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;np&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;argmax&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;topic_term_prob_matrix&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

    &lt;span class=&#34;k&#34;&gt;return&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;topic_index&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;test_text2&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;最近，“全职儿女”话题受到舆论关注。“全职儿女”是指一种新型的脱产生活方式，年轻人脱产寄居父母生活，并通过付出一定的劳动换取经济支持，同时保持学习提升或发展副业的状态。这种生活方式既有其合理性和正当性，也有其问题和风险。我们不能一概而论，也不能一味否定或肯定。&amp;#34;&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;topic_index&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;predict_topic&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;test_text2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;该文本所属Topic: &amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;topic_index&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;    该文本所属Topic:  0
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;#批量预测&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;话题ID&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;clean_content&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;apply&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;predict_topic&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;head&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/05-predict.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;话题ID&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;value_counts&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;话题ID
0    214
1     88
3     84
2     25
Name: count, dtype: int64
&lt;/code&gt;&lt;/pre&gt;
&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;#预测结果保存到csv、xlsx中。&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;to_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;output/话题预测结果.csv&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;index&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;False&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;to_excel&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;output/话题预测结果.xlsx&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;index&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;False&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/output.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;获取代码-1&#34;&gt;获取代码&lt;/h2&gt;
&lt;p&gt;&lt;a href=&#34;lda-code.zip&#34;&gt;&lt;strong&gt;点击下载本文&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
</description>
      <content:encoded><![CDATA[<h2 id="获取代码">获取代码</h2>
<p><a href="lda-code.zip"><strong>点击下载本文数据&amp;代码</strong></a></p>
<p><br><br></p>
<p><img loading="lazy" src="img/lda-model.png" alt=""  />
</p>
<p>如何用LDA预测文本的话题类型，本文将覆盖以下代码技术</p>
<ol>
<li>csv数据读取</li>
<li>文本预处理</li>
<li>训练(保存)lda模型</li>
<li>预测话题</li>
</ol>
<p><br><br></p>
<h2 id="一读取数据">一、读取数据</h2>
<p>本文使用的数据集来自于 之前分享的 <a href="https://textdata.cn/blog/2023-04-25-zhihu-parent-child-relationship/">网络爬虫 | 知乎热门话题「全职儿女」</a></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;data/知乎-全职儿女.csv&#39;</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8&#39;</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">dropna</span><span class="p">(</span><span class="n">subset</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;content&#39;</span><span class="p">],</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;记录数: &#39;</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">))</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
</code></pre></div><pre><code>记录数:  411
</code></pre>
<p><img loading="lazy" src="img/01-df.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="二清洗数据">二、清洗数据</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">re</span>
<span class="kn">import</span> <span class="nn">jieba</span>

<span class="n">stoptext</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s1">&#39;data/stopwords.txt&#39;</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>        
<span class="n">stopwords</span> <span class="o">=</span> <span class="n">stoptext</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">&#39;</span><span class="se">\n</span><span class="s1">&#39;</span><span class="p">)</span>  


<span class="k">def</span> <span class="nf">clean_text</span><span class="p">(</span><span class="n">text</span><span class="p">):</span>
    <span class="c1"># 用正则表达式提取中文文本</span>
    <span class="n">text</span> <span class="o">=</span> <span class="s1">&#39;&#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">re</span><span class="o">.</span><span class="n">findall</span><span class="p">(</span><span class="s1">&#39;[</span><span class="se">\u4e00</span><span class="s1">-</span><span class="se">\u9fa5</span><span class="s1">]+&#39;</span><span class="p">,</span> <span class="n">text</span><span class="p">))</span>   
    <span class="n">words</span> <span class="o">=</span> <span class="n">jieba</span><span class="o">.</span><span class="n">lcut</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>                               
    <span class="n">words</span> <span class="o">=</span> <span class="p">[</span><span class="n">w</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">words</span> <span class="k">if</span> <span class="n">w</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">stopwords</span><span class="p">]</span>   
    <span class="c1">#整理成用空格间隔词语的文本形式(类似西方语言)</span>
    <span class="k">return</span> <span class="s1">&#39; &#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">words</span><span class="p">)</span>


<span class="n">test_text</span> <span class="o">=</span> <span class="s2">&#34;首先，我认为「全职儿女」不应该被简单地归为啃老。在目前社会环境下，随着经济、教育等发展，年轻...&#34;</span>
<span class="n">clean_text</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="n">test_text</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">    &#39;全职 儿女 简单 地归为 啃 老 社会 环境 经济 教育 发展 年轻&#39;
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s1">&#39;clean_content&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;content&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">clean_text</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/02-df.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="三训练lda模型">三、训练LDA模型</h2>
<h3 id="31-训练">3.1 训练</h3>
<p><img loading="lazy" src="img/03-wordcloud.png" alt=""  />
</p>
<p>根据词云图， 假设对数据比较了解，可以直接设置话题数 n_components = 4</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">sklearn.feature_extraction.text</span> <span class="kn">import</span> <span class="n">CountVectorizer</span><span class="p">,</span><span class="n">TfidfVectorizer</span>
<span class="kn">from</span> <span class="nn">sklearn.decomposition</span> <span class="kn">import</span> <span class="n">LatentDirichletAllocation</span>

<span class="c1"># 构建词典，将词转为数字。将文档转为向量</span>
<span class="n">vectorizer</span> <span class="o">=</span> <span class="n">TfidfVectorizer</span><span class="p">(</span><span class="n">max_df</span><span class="o">=</span><span class="mf">0.5</span><span class="p">,</span> <span class="n">min_df</span><span class="o">=</span><span class="mi">20</span><span class="p">)</span>   
<span class="n">doc_term_matrix</span> <span class="o">=</span> <span class="n">vectorizer</span><span class="o">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;clean_content&#39;</span><span class="p">])</span>

<span class="c1"># 构建LDA话题模型</span>
<span class="c1"># 初始化模型，设置话题数为4,随机状态码888</span>
<span class="n">lda_model</span> <span class="o">=</span> <span class="n">LatentDirichletAllocation</span><span class="p">(</span><span class="n">n_components</span><span class="o">=</span><span class="mi">4</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">888</span><span class="p">)</span>       
<span class="n">lda_output</span> <span class="o">=</span> <span class="n">lda_model</span><span class="o">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">doc_term_matrix</span><span class="p">)</span>
<span class="n">lda_model</span>
</code></pre></div><br>
<h3 id="32-保存模型">3.2 保存模型</h3>
<p>如果训练过程非常久，保存模型，下次就可以跳过训练阶段，直接使用模型。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">joblib</span>
<span class="c1"># # 保存模型</span>
<span class="n">joblib</span><span class="o">.</span><span class="n">dump</span><span class="p">(</span><span class="n">lda_model</span><span class="p">,</span> <span class="s1">&#39;output/全职儿女lda_model.pkl&#39;</span><span class="p">)</span>
</code></pre></div><pre><code>['output/全职儿女lda_model.pkl']
</code></pre>
<br>
<h3 id="33-导入模型">3.3 导入模型</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">joblib</span>

<span class="s2">&#34;&#34;&#34;导入保存的模型&#34;&#34;&#34;</span>
<span class="n">lda_model</span> <span class="o">=</span> <span class="n">joblib</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="s1">&#39;output/全职儿女lda_model.pkl&#39;</span><span class="p">)</span>
<span class="n">lda_model</span>
</code></pre></div><p><img loading="lazy" src="img/lda888.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="四使用lda模型">四、使用LDA模型</h2>
<h3 id="41-查看话题特征词">4.1 查看话题特征词</h3>
<p>获得每个话题对应的的n个特征词，方便后续对每个话题命名和解读</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>


<span class="k">def</span> <span class="nf">show_topics</span><span class="p">(</span><span class="n">vectorizer</span><span class="p">,</span> <span class="n">lda_model</span><span class="p">,</span> <span class="n">top_n</span><span class="o">=</span><span class="mi">30</span><span class="p">):</span>
    <span class="s2">&#34;&#34;&#34;
</span><span class="s2">    显示每个话题最重要的n个词语
</span><span class="s2">    vectorizer: 词袋法或tfidf.基于前面代码这里使用TF-IDF法
</span><span class="s2">    lda_model: 训练好的lda话题模型
</span><span class="s2">    top_n: 设置最重要的n个特征词，默认30个.
</span><span class="s2">    &#34;&#34;&#34;</span>
    
    <span class="n">keywords</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">vectorizer</span><span class="o">.</span><span class="n">get_feature_names_out</span><span class="p">())</span>
    <span class="n">topic_keywords</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="k">for</span> <span class="n">topic_weights</span> <span class="ow">in</span> <span class="n">lda_model</span><span class="o">.</span><span class="n">components_</span><span class="p">:</span>
        <span class="n">top_keyword_locs</span> <span class="o">=</span> <span class="p">(</span><span class="o">-</span><span class="n">topic_weights</span><span class="p">)</span><span class="o">.</span><span class="n">argsort</span><span class="p">()[:</span><span class="n">top_n</span><span class="p">]</span>
        <span class="n">topic_keywords</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">keywords</span><span class="o">.</span><span class="n">take</span><span class="p">(</span><span class="n">top_keyword_locs</span><span class="p">))</span>
    <span class="k">return</span> <span class="n">topic_keywords</span>


<span class="s2">&#34;&#34;&#34;利用show_topics函数展示全职儿女文本中的4个话题，基于每个话题最重要的20个词语为每个话题命名&#34;&#34;&#34;</span>
<span class="n">topic_keywords</span> <span class="o">=</span> <span class="n">show_topics</span><span class="p">(</span><span class="n">vectorizer</span><span class="o">=</span><span class="n">vectorizer</span><span class="p">,</span>   <span class="c1"># 【可改动】vectorizer我们训练的词语空间</span>
                             <span class="n">lda_model</span><span class="o">=</span><span class="n">lda_model</span><span class="p">,</span>     <span class="c1"># 【可改动】lda_model训练的lda模型</span>
                             <span class="n">top_n</span><span class="o">=</span><span class="mi">20</span><span class="p">)</span>                <span class="c1"># 【可改动】最重要的30个词语</span>

<span class="n">df_topic_keywords</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">topic_keywords</span><span class="p">)</span>
<span class="n">df_topic_keywords</span><span class="o">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;Word-&#39;</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">i</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">df_topic_keywords</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">])]</span>
<span class="n">df_topic_keywords</span><span class="o">.</span><span class="n">index</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;Topic-&#39;</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">i</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">df_topic_keywords</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">])]</span>
<span class="n">df_topic_keywords</span>
</code></pre></div><p><img loading="lazy" src="img/04-df.png" alt=""  />
</p>
<br>
<h3 id="42-预测文本的话题id">4.2 预测文本的话题id</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>

<span class="k">def</span> <span class="nf">predict_topic</span><span class="p">(</span><span class="n">text</span><span class="p">):</span>
    <span class="n">doc_term_matrix</span> <span class="o">=</span> <span class="n">vectorizer</span><span class="o">.</span><span class="n">transform</span><span class="p">([</span><span class="n">clean_text</span><span class="p">(</span><span class="n">text</span><span class="p">)])</span>
    <span class="n">topic_term_prob_matrix</span> <span class="o">=</span> <span class="n">lda_model</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">doc_term_matrix</span><span class="p">)</span>
    <span class="n">topic_index</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">topic_term_prob_matrix</span><span class="p">)</span>

    <span class="k">return</span> <span class="n">topic_index</span>

<span class="n">test_text2</span> <span class="o">=</span> <span class="s2">&#34;最近，“全职儿女”话题受到舆论关注。“全职儿女”是指一种新型的脱产生活方式，年轻人脱产寄居父母生活，并通过付出一定的劳动换取经济支持，同时保持学习提升或发展副业的状态。这种生活方式既有其合理性和正当性，也有其问题和风险。我们不能一概而论，也不能一味否定或肯定。&#34;</span>

<span class="n">topic_index</span> <span class="o">=</span> <span class="n">predict_topic</span><span class="p">(</span><span class="n">test_text2</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">&#34;该文本所属Topic: &#34;</span><span class="p">,</span> <span class="n">topic_index</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">    该文本所属Topic:  0
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#批量预测</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;话题ID&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;clean_content&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">predict_topic</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/05-predict.png" alt=""  />
</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s1">&#39;话题ID&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">value_counts</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<pre><code>话题ID
0    214
1     88
3     84
2     25
Name: count, dtype: int64
</code></pre>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#预测结果保存到csv、xlsx中。</span>
<span class="n">df</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s1">&#39;output/话题预测结果.csv&#39;</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">to_excel</span><span class="p">(</span><span class="s1">&#39;output/话题预测结果.xlsx&#39;</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/output.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="获取代码-1">获取代码</h2>
<p><a href="lda-code.zip"><strong>点击下载本文</strong></a></p>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>不可不防的大模型“人肉搜索”能力</title>
      <link>https://textdata.cn/blog/2023-11-13-violatating-privacy-via-inference-with-large-language-model/</link>
      <pubDate>Mon, 13 Nov 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-11-13-violatating-privacy-via-inference-with-large-language-model/</guid>
      <description>今年10月的一项研究显示，语言大模型的推测能力，使其在“某些方面”的准确度几乎接近人类甚至超越人类。这引发了作者对大模型可能被用来“人肉搜索”的担忧。“开盒”从未如此简单？大模型是否会侵害我们的隐私？ 大语言模型(Large language Model,  LLM)可以从文本中准确推断个人属性。</description>
      <content:encoded><![CDATA[<iframe
    src="//player.bilibili.com/player.html?bvid=BV1T84y1X7Jv&page=1"
    scrolling="no"
    height="500px"
    width="800px"
    frameborder="no"
    framespacing="0"
    allowfullscreen="true"
>
</iframe>

<p>今年10月的一项研究显示，语言大模型的推测能力，使其在“某些方面”的准确度几乎接近人类甚至超越人类。这引发了作者对大模型可能被用来“人肉搜索”的担忧。“开盒”从未如此简单？大模型是否会侵害我们的隐私？ 大语言模型(Large language Model,  LLM)可以从文本中准确推断个人属性。</p>
<p><br><br></p>
<h2 id="声明">声明</h2>
<p>本文内容全文整理自 <a href="https://llm-privacy.org/">https://llm-privacy.org/</a></p>
<p>Staab, Robin, Mark Vero, Mislav Balunović, and Martin Vechev. &ldquo;Beyond Memorization: Violating Privacy Via Inference with Large Language Models.&rdquo; <em>arXiv preprint arXiv:2310.07298</em> (2023).</p>
<p><br><br></p>
<h2 id="演示案例">演示案例</h2>
<div align="center">   <p><strong>对照当前最先进的大语言模型（LLM）， 测试您的隐私推理技能！</strong></p> </div>
<p><img loading="lazy" src="img/01-guess.png" alt=""  />
</p>
<p><img loading="lazy" src="img/01-guess-answer.png" alt=""  />
</p>
<br>
<p><img loading="lazy" src="img/02-guess.png" alt=""  />
</p>
<p><img loading="lazy" src="img/02-guess-answer.png" alt=""  />
</p>
<br>
<p><img loading="lazy" src="img/03-guess.png" alt=""  />
</p>
<p><img loading="lazy" src="img/03-guess-answer.png" alt=""  />
</p>
<br>
<br>
<h2 id="qa">Q&amp;A</h2>
<h3 id="q1-有什么问题吗">Q1： 有什么问题吗？</h3>
<p><strong>LLM可以从文本中准确推断个人属性信息</strong>； 当前关于大语言模型（LLM）的隐私研究主要集中在提取记忆的训练数据的问题上。与此同时，模型的推理能力也大幅提升。这就提出了一个问题：<strong>当前的LLM是否能从给定文本推断作者个人属性信息</strong>。我们的<a href="https://llm-privacy.org/#paper">研究</a>表明，随着能力的增强，LLM能够从提供给他们的非结构化文本（例如公共论坛或社交网络帖子）中自动推断出广泛的<strong>个人作者属性</strong>（例如<strong>年龄、性别和出生地</strong>）。推理时间。特别是，我们发现当前的前沿模型（例如 GPT-4 ）在从文本推断此类属性时平均达到<strong>85%</strong> top-1 和<strong>95.8% top-3 的准确度</strong>。与此同时，LLM的快速发展大大降低了此类侵犯隐私推论的相关成本（&gt; 100 倍的金钱和 &gt; 240 倍的时间），使对手能够将侵犯隐私的推论规模远远超出以前通过昂贵的人力所能实现的范围。分析器。</p>
<blockquote>
<p>LLM的回答会有n个排序， 概率从高到低，一般我们收到(看到的)回答是top1， 其他回答是隐藏起来的。第一个回答猜对的概率达到85%，而前三个回答猜对的概率是95.8%。</p>
</blockquote>
<br>
<h3 id="q2-为什么这很重要">Q2： 为什么这很重要？</h3>
<p><strong>它可以直接影响用户隐私</strong>； 人们在互联网留下了大量文本——常常无意中泄露了他们不想透露的个人数据。欧盟的 GDPR 或加州 CCPA 等数据保护法规的制定是为了保护原始个人数据。仅当个人数据以明显的形式存在时，例如具有显式属性字段的私人配置文件，才能遵守此类法规。相比之下，<strong>我们的工作引入了一种威胁模型，其中私人信息是从其存在不明显的上下文中推断出来的</strong>。我们展示了恶意行为者如何通过将用户的在线帖子输入预先训练的LLM来推断出从未打算泄露的用户私人信息。众所周知，一半的美国人口可以通过位置、性别和出生日期等少量属性来唯一识别[<a href="https://dl.acm.org/doi/10.1142/S0218488502001648">Sweeney, &lsquo;02]</a>。LLM可以从互联网上发现的非结构化摘录中推断出其中一些属性，可以使用其他公开信息（例如美国的选民记录）来识别实际的人。这将允许这些行为者将从帖子中推断出的高度个人化的信息（例如，心理健康状况）与真实的人联系起来，并将其用于不良或非法活动，例如有针对性的政治运动、自动分析或跟踪。LLM的广泛可用性和快速发展带来了范式的变化，以前的 NLP 技术缺乏实现此类任务所需的自然语言理解水平。此外，我们还表明，进行侵犯隐私的推理的能力随着模型的大小而变化，预计在不久的将来会对用户隐私产生更大的影响。</p>
<p><img loading="lazy" src="img/04-accuracy.png" alt=""  />
</p>
<br>
<h3 id="q3-这在实践中是如何运作的">Q3: 这在实践中是如何运作的？</h3>
<p><strong>它具有可扩展性并且易于执行</strong>。 我们根据来自 500 多个个人资料的真实 Reddit 评论评估了当前几个 LLM 的隐私推理能力，包括整个 Llama-2 系列、Anthropic 的 Claude 2、Google 的 PaLM 2 和 GPT-4 。我们的实验表明（除了这些LLM取得了令人印象深刻的准确性这一事实之外），这种<strong>侵犯隐私的推论非常容易大规模执行</strong>。特别是，我们发现这是两个因素的结合：</p>
<ul>
<li>首先，我们观察到目前模型中**几乎没有有效的保护措施，这会使侵犯隐私的推论变得更加容易。**值得注意的是，这使我们能够使用简单的提示（仅使用 COT 等基本技术），从而节省了提示工程所需的大量时间和精力。只有在极少数情况下，我们发现模型（跨大型提供商，即 OpenAI、Google、Meta、Anthropic）会阻止请求，在这种情况下，人们将不得不诉诸更复杂的提示技术。</li>
<li>同时，这些模型广泛且易于使用，使对手能够以最小的前期成本大幅扩展。即使有 API 限制，我们的实验实现了 <strong>时间减少100 倍 、 成本减少240 倍</strong>。从那时起，我们联系了所有模型提供商，作为我们负责任的披露政策的一部分，积极讨论如何在未来防止此类推论。我们在这一领域看到了两种有前途的方法：（i）致力于在预先训练的LLM中针对侵犯隐私的推理请求提供具体的保障措施；（ii）为最终用户提供可以保护其生成的文本免受推理的工具。</li>
</ul>
<p><img loading="lazy" src="img/05-cost.png" alt=""  />
</p>
<br>
<h3 id="q4-我们使用匿名工具可以躲过llm的隐私推断吗">Q4: 我们使用匿名工具可以躲过LLM的隐私推断吗？</h3>
<p><strong>LLM的表现优于当前的匿名工具</strong>。 为了测试LLM在最先进的匿名化工具上的表现，我们对所有收集的数据进行了匿名化，重新运行我们的推论。事实证明，即使在应用了高度匿名化之后，文本中仍然保留了足够的相关上下文，供LLM重建部分个人信息。此外，这些工具完全无法解决更多被删除的线索，例如特定的语言特征，同时仍然为侵犯隐私的LLM推论提供了大量信息。<strong>这尤其令人担忧，因为在这些情况下，用户采取了明确的预防隐私泄露的措施，从而造成一种高隐私感的错觉</strong>。同时，使用当前的匿名工具，在匿名化和实用性之间存在显着的权衡。简单地用 <code>*</code>替换部分文本会严重影响数据本身的有用性。</p>
<p><img loading="lazy" src="img/06-privacy-tools.png" alt=""  />
</p>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>bidict库 | Python双向映射功能，让字典更好用</title>
      <link>https://textdata.cn/blog/2023-11-10-bidirectional-mapping-library/</link>
      <pubDate>Thu, 09 Nov 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-11-10-bidirectional-mapping-library/</guid>
      <description>&lt;p&gt;字典是一种键值对key-value pair数据结构， 用key查询到对应的值value， 但不能用value查到对应的key。但有时我们面对的分析任务，需要用value查到对应的key， bidict可以帮我们实现这一特性。&lt;/p&gt;
&lt;br&gt;
&lt;h2 id=&#34;一安装&#34;&gt;一、安装&lt;/h2&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;pip install bidict
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h2 id=&#34;二快速开始&#34;&gt;二、快速开始&lt;/h2&gt;
&lt;h3 id=&#34;21-基本操作&#34;&gt;2.1 基本操作&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;bidict&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;bidict&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;test_data&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;bidict&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;({&lt;/span&gt;
   &lt;span class=&#34;s1&#34;&gt;&amp;#39;华为&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;Huawei&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
   &lt;span class=&#34;s1&#34;&gt;&amp;#39;比亚迪&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;BYD&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
   &lt;span class=&#34;s1&#34;&gt;&amp;#39;吉利&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;Geely&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
   &lt;span class=&#34;s1&#34;&gt;&amp;#39;微软&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;Microsoft&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
   &lt;span class=&#34;s1&#34;&gt;&amp;#39;苹果&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;Apple&amp;#39;&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;})&lt;/span&gt;

&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;test_data&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;华为&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;test_data&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;inverse&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;Microsoft&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;Huawei
微软
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;22-get方法&#34;&gt;2.2 get方法&lt;/h3&gt;
&lt;p&gt;跟Python字典类似，如果字典中没有对应的key，直接查询会出现KeyError错误。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;test_data&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;三星&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[5], line 1
----&amp;gt; 1 test_data[&amp;#39;三星&amp;#39;]

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/bidict/_base.py:523, in BidictBase.__getitem__(self, key)
    521 def __getitem__(self, key: KT) -&amp;gt; VT:
    522     &amp;#34;&amp;#34;&amp;#34;*x.__getitem__(key) ⟺ x[key]*&amp;#34;&amp;#34;&amp;#34;
--&amp;gt; 523     return self._fwdm[key]

KeyError: &amp;#39;三星&amp;#39;

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;p&gt;使用get方法则可避免错误发生。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;test_data.get(&amp;#39;三星&amp;#39;, &amp;#39;missing&amp;#39;)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;missing
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;23-update方法&#34;&gt;2.3 update方法&lt;/h3&gt;
&lt;p&gt;update方法可以用来&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;更改key的value&lt;/li&gt;
&lt;li&gt;新增key-value-pair&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;#更新值&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;test_data&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;update&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;华为&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;HUAWEI&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#新增key-value-pair&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;test_data&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;update&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;三星&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;Samsung&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;


&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;test_data&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;华为&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;test_data&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;三星&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;HUAWEI
Samsung
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;24-pop方法&#34;&gt;2.4 pop方法&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;test_data.pop(&amp;#39;三星&amp;#39;)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;&amp;#39;Samsung&amp;#39;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;p&gt;此时再查看会发现test_data已经没有了三星相关的键值对&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;test_data
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;bidict({&amp;#39;华为&amp;#39;: &amp;#39;HUAWEI&amp;#39;, 
&amp;#39;比亚迪&amp;#39;: &amp;#39;BYD&amp;#39;, 
&amp;#39;吉利&amp;#39;: &amp;#39;Geely&amp;#39;, 
&amp;#39;微软&amp;#39;: &amp;#39;Microsoft&amp;#39;, 
&amp;#39;苹果&amp;#39;: &amp;#39;Apple&amp;#39;})
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
</description>
      <content:encoded><![CDATA[<p>字典是一种键值对key-value pair数据结构， 用key查询到对应的值value， 但不能用value查到对应的key。但有时我们面对的分析任务，需要用value查到对应的key， bidict可以帮我们实现这一特性。</p>
<br>
<h2 id="一安装">一、安装</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">pip install bidict
</code></pre></div><br>
<h2 id="二快速开始">二、快速开始</h2>
<h3 id="21-基本操作">2.1 基本操作</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">bidict</span> <span class="kn">import</span> <span class="n">bidict</span>

<span class="n">test_data</span> <span class="o">=</span> <span class="n">bidict</span><span class="p">({</span>
   <span class="s1">&#39;华为&#39;</span><span class="p">:</span> <span class="s1">&#39;Huawei&#39;</span><span class="p">,</span>
   <span class="s1">&#39;比亚迪&#39;</span><span class="p">:</span> <span class="s1">&#39;BYD&#39;</span><span class="p">,</span>
   <span class="s1">&#39;吉利&#39;</span><span class="p">:</span> <span class="s1">&#39;Geely&#39;</span><span class="p">,</span>
   <span class="s1">&#39;微软&#39;</span><span class="p">:</span> <span class="s1">&#39;Microsoft&#39;</span><span class="p">,</span>
   <span class="s1">&#39;苹果&#39;</span><span class="p">:</span> <span class="s1">&#39;Apple&#39;</span>
<span class="p">})</span>

<span class="nb">print</span><span class="p">(</span><span class="n">test_data</span><span class="p">[</span><span class="s1">&#39;华为&#39;</span><span class="p">])</span>
<span class="nb">print</span><span class="p">(</span><span class="n">test_data</span><span class="o">.</span><span class="n">inverse</span><span class="p">[</span><span class="s1">&#39;Microsoft&#39;</span><span class="p">])</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">Huawei
微软
</code></pre></div><br>
<h3 id="22-get方法">2.2 get方法</h3>
<p>跟Python字典类似，如果字典中没有对应的key，直接查询会出现KeyError错误。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">test_data</span><span class="p">[</span><span class="s1">&#39;三星&#39;</span><span class="p">]</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[5], line 1
----&gt; 1 test_data[&#39;三星&#39;]

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/bidict/_base.py:523, in BidictBase.__getitem__(self, key)
    521 def __getitem__(self, key: KT) -&gt; VT:
    522     &#34;&#34;&#34;*x.__getitem__(key) ⟺ x[key]*&#34;&#34;&#34;
--&gt; 523     return self._fwdm[key]

KeyError: &#39;三星&#39;

</code></pre></div><br>
<p>使用get方法则可避免错误发生。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">test_data.get(&#39;三星&#39;, &#39;missing&#39;)
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">missing
</code></pre></div><br>
<h3 id="23-update方法">2.3 update方法</h3>
<p>update方法可以用来</p>
<ul>
<li>更改key的value</li>
<li>新增key-value-pair</li>
</ul>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#更新值</span>
<span class="n">test_data</span><span class="o">.</span><span class="n">update</span><span class="p">(</span><span class="n">华为</span><span class="o">=</span><span class="s1">&#39;HUAWEI&#39;</span><span class="p">)</span>

<span class="c1">#新增key-value-pair</span>
<span class="n">test_data</span><span class="o">.</span><span class="n">update</span><span class="p">(</span><span class="n">三星</span><span class="o">=</span><span class="s1">&#39;Samsung&#39;</span><span class="p">)</span>


<span class="nb">print</span><span class="p">(</span><span class="n">test_data</span><span class="p">[</span><span class="s1">&#39;华为&#39;</span><span class="p">])</span>
<span class="nb">print</span><span class="p">(</span><span class="n">test_data</span><span class="p">[</span><span class="s1">&#39;三星&#39;</span><span class="p">])</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">HUAWEI
Samsung
</code></pre></div><br>
<h3 id="24-pop方法">2.4 pop方法</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">test_data.pop(&#39;三星&#39;)
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">&#39;Samsung&#39;
</code></pre></div><br>
<p>此时再查看会发现test_data已经没有了三星相关的键值对</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">test_data
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">bidict({&#39;华为&#39;: &#39;HUAWEI&#39;, 
&#39;比亚迪&#39;: &#39;BYD&#39;, 
&#39;吉利&#39;: &#39;Geely&#39;, 
&#39;微软&#39;: &#39;Microsoft&#39;, 
&#39;苹果&#39;: &#39;Apple&#39;})
</code></pre></div><p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>关于「滥用原创」， 大邓的一些说明</title>
      <link>https://textdata.cn/blog/2023-11-07-disclosure-about-illegal-copyright-content/</link>
      <pubDate>Tue, 07 Nov 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-11-07-disclosure-about-illegal-copyright-content/</guid>
      <description>&lt;p&gt;2023-11-06 16:27 ~ 2023-11-7 12:37， 公众号遭遇几个举报， 举报『&lt;strong&gt;公众号： 大邓和他的Python&lt;/strong&gt;』存在『&lt;strong&gt;滥用原创&lt;/strong&gt;』标记违规行为。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/1.pic.jpg&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;因微信公众号审核人员，一般不会花时间阅读几千上万字的学术性内容， 只要一遇到举报， 就会倾向于用户的举报。 这也是大邓自食恶果， 经过我的检查，被举报的这几个文章。 有以下2种类型&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;真原创 误判为 滥用原创&lt;/li&gt;
&lt;li&gt;假原创 判定为 滥用原创&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;一真原创-误判为-滥用原创&#34;&gt;一、真原创 误判为 滥用原创&lt;/h2&gt;
&lt;p&gt;我将真原创定义为，大邓自己生成的内容篇幅超过50%， 一般含大邓自己构造的实验数据、代码、截图、讲解等内容。&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-01-13-information-content-of-critical-audit/&#34;&gt;金融研究 | 使用Python构建「关键审计事项信息含量」&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-09-08-earnings-communication-conference-forward-looking-statements-information/&#34;&gt;中国管理科学 | 使用业绩说明会文本数据测量上市公司前瞻性信息&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二假原创-判定为-滥用原创&#34;&gt;二、假原创 判定为 滥用原创&lt;/h2&gt;
&lt;p&gt;假原创中， 有以下几种类型&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;无原创内容， 活该被举报类型&lt;/li&gt;
&lt;li&gt;有大邓工作量， 被举报标记为滥用原创&lt;/li&gt;
&lt;li&gt;翻译整理， 被标记为滥用原创
&lt;br&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;21-无原创内容-活该被举报类型&#34;&gt;2.1 无原创内容， 活该被举报类型&lt;/h3&gt;
&lt;p&gt;推文虽然标记作者姓名、论文出处，但100%搬运，大邓做工只有搬运这一行为， 工作仅仅是搜集内容和整理公众号格式，大概几十分钟出一篇。我这种行为， 推文活该被举报， 合情合理。&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-03-10-psychological-research-with-word-embeddings/&#34;&gt;转载 | 基于词嵌入技术的心理学研究: 方法及应用&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-10-11-how-can-machine-learning-empower-management-research/&#34;&gt;管理世界 | 机器学习如何赋能管理学研究？——国内外前沿综述和未来展望&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-10-10-measure-the-speed-of-policy-diffusion-from-top-to-down/&#34;&gt;管理科学学报 | 使用LDA算法计算政策扩散速度与扩散程度&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-10-07-esg-measurement/&#34;&gt;企业ESG行为的文本度量法&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;br&gt;
&lt;h3 id=&#34;22-有大邓工作量-被举报标记为滥用原创&#34;&gt;2.2 有大邓工作量， 被举报标记为滥用原创&lt;/h3&gt;
&lt;p&gt;推文中包括摘要 、概念定义、文献梳理，基本摘自(翻译自)论文原文， 涉及到创新点和文本分析技术实现内容， 大部分是有大邓的理解加工和上手敲代码实验， 一般有实验数据、代码、截图、解说。&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-10-16-measurement-of-consumer-certainty-in-language/&#34;&gt;JMR | 测量消费者的语言确定性&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;br&gt;
&lt;h3 id=&#34;23-翻译整理-被标记为滥用原创&#34;&gt;2.3 翻译整理， 被标记为滥用原创&lt;/h3&gt;
&lt;p&gt;全文翻译， 翻译后总字数有约3w字， 虽然是使用谷歌翻译1min， 但为了增加可读性， 很多地方要调整语序， 对部分晦涩难懂的文本分析技术概念， 也会加入自己的理解。 全文整理下来5+小时。&lt;/p&gt;
&lt;p&gt;&lt;a href=&#34;https://textdata.cn/blog/2023-11-03-organization-science-with-word-embeddings/&#34;&gt;OS2022 | 概念空间 | 词嵌入模型如何为组织科学中的测量和理论提供信息&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;最后&#34;&gt;最后&lt;/h2&gt;
&lt;p&gt;之前的活该被举报类型的推文，那是自己之前做的不妥不对， 我是知识产权利益相关方， 不能一方面用规则吃饭生活，另一方面却又在做破坏规则的事情。&lt;/p&gt;
&lt;p&gt;人在做，天在看，不是不报，时候未到！ 自己如果成为恶人，自有恶人来惩治自己。这次被批量举报， 心里感到郁闷， 但也提醒自己以后标记原创时要更加小心。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
</description>
      <content:encoded><![CDATA[<p>2023-11-06 16:27 ~ 2023-11-7 12:37， 公众号遭遇几个举报， 举报『<strong>公众号： 大邓和他的Python</strong>』存在『<strong>滥用原创</strong>』标记违规行为。</p>
<p><img loading="lazy" src="img/1.pic.jpg" alt=""  />
</p>
<p>因微信公众号审核人员，一般不会花时间阅读几千上万字的学术性内容， 只要一遇到举报， 就会倾向于用户的举报。 这也是大邓自食恶果， 经过我的检查，被举报的这几个文章。 有以下2种类型</p>
<ul>
<li>真原创 误判为 滥用原创</li>
<li>假原创 判定为 滥用原创</li>
</ul>
<p><br><br></p>
<h2 id="一真原创-误判为-滥用原创">一、真原创 误判为 滥用原创</h2>
<p>我将真原创定义为，大邓自己生成的内容篇幅超过50%， 一般含大邓自己构造的实验数据、代码、截图、讲解等内容。</p>
<ul>
<li><a href="https://textdata.cn/blog/2023-01-13-information-content-of-critical-audit/">金融研究 | 使用Python构建「关键审计事项信息含量」</a></li>
<li><a href="https://textdata.cn/blog/2023-09-08-earnings-communication-conference-forward-looking-statements-information/">中国管理科学 | 使用业绩说明会文本数据测量上市公司前瞻性信息</a></li>
</ul>
<p><br><br></p>
<h2 id="二假原创-判定为-滥用原创">二、假原创 判定为 滥用原创</h2>
<p>假原创中， 有以下几种类型</p>
<ul>
<li>无原创内容， 活该被举报类型</li>
<li>有大邓工作量， 被举报标记为滥用原创</li>
<li>翻译整理， 被标记为滥用原创
<br></li>
</ul>
<h3 id="21-无原创内容-活该被举报类型">2.1 无原创内容， 活该被举报类型</h3>
<p>推文虽然标记作者姓名、论文出处，但100%搬运，大邓做工只有搬运这一行为， 工作仅仅是搜集内容和整理公众号格式，大概几十分钟出一篇。我这种行为， 推文活该被举报， 合情合理。</p>
<ul>
<li><a href="https://textdata.cn/blog/2023-03-10-psychological-research-with-word-embeddings/">转载 | 基于词嵌入技术的心理学研究: 方法及应用</a></li>
<li><a href="https://textdata.cn/blog/2023-10-11-how-can-machine-learning-empower-management-research/">管理世界 | 机器学习如何赋能管理学研究？——国内外前沿综述和未来展望</a></li>
<li><a href="https://textdata.cn/blog/2023-10-10-measure-the-speed-of-policy-diffusion-from-top-to-down/">管理科学学报 | 使用LDA算法计算政策扩散速度与扩散程度</a></li>
<li><a href="https://textdata.cn/blog/2023-10-07-esg-measurement/">企业ESG行为的文本度量法</a></li>
</ul>
<br>
<h3 id="22-有大邓工作量-被举报标记为滥用原创">2.2 有大邓工作量， 被举报标记为滥用原创</h3>
<p>推文中包括摘要 、概念定义、文献梳理，基本摘自(翻译自)论文原文， 涉及到创新点和文本分析技术实现内容， 大部分是有大邓的理解加工和上手敲代码实验， 一般有实验数据、代码、截图、解说。</p>
<ul>
<li><a href="https://textdata.cn/blog/2023-10-16-measurement-of-consumer-certainty-in-language/">JMR | 测量消费者的语言确定性</a></li>
</ul>
<br>
<h3 id="23-翻译整理-被标记为滥用原创">2.3 翻译整理， 被标记为滥用原创</h3>
<p>全文翻译， 翻译后总字数有约3w字， 虽然是使用谷歌翻译1min， 但为了增加可读性， 很多地方要调整语序， 对部分晦涩难懂的文本分析技术概念， 也会加入自己的理解。 全文整理下来5+小时。</p>
<p><a href="https://textdata.cn/blog/2023-11-03-organization-science-with-word-embeddings/">OS2022 | 概念空间 | 词嵌入模型如何为组织科学中的测量和理论提供信息</a></p>
<p><br><br></p>
<h2 id="最后">最后</h2>
<p>之前的活该被举报类型的推文，那是自己之前做的不妥不对， 我是知识产权利益相关方， 不能一方面用规则吃饭生活，另一方面却又在做破坏规则的事情。</p>
<p>人在做，天在看，不是不报，时候未到！ 自己如果成为恶人，自有恶人来惩治自己。这次被批量举报， 心里感到郁闷， 但也提醒自己以后标记原创时要更加小心。</p>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>TechWeekly-17 每周有趣有用的技术分享</title>
      <link>https://textdata.cn/blog/techweekly17/</link>
      <pubDate>Sun, 05 Nov 2023 10:43:10 +0600</pubDate>
      
      <guid>/blog/techweekly17/</guid>
      <description>本期TechWeekly主要是一些css、js类项目，可以起到点缀网站的效果。</description>
      <content:encoded><![CDATA[<h2 id="youtube-dubbinghttpswwwyoutube-dubbingcom"><a href="https://www.youtube-dubbing.com/">YouTube Dubbing</a></h2>
<p>一个 Chrome 插件，可以将 YouTube 视频的英文语音，转成中文语音。</p>
<br>
<br>
<br>
<br>
]]></content:encoded>
    </item>
    
    <item>
      <title>视频2023 | 文本分析在经管研究中的应用</title>
      <link>https://textdata.cn/blog/2023-11-05-xjtu-text-mining-in-ms/</link>
      <pubDate>Sun, 05 Nov 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-11-05-xjtu-text-mining-in-ms/</guid>
      <description>报告以文本分析方法为例，围绕着文本产生、作用、算法、编程四个方面展开。报告人结合自己的最新研究对大数据时代文本分析方法在管理领域的应用展开讨论，介绍文本编码常见算法，诸如词典法、文档向量化、词向量等，分享此类研究的过程和要点。Application of Text Analysis in Economics and Management Research 西安交通大学管理学院孙少龙老师。</description>
      <content:encoded><![CDATA[<h2 id="摘要">摘要</h2>
<p>从信息流视角看，使用文本数据做研究，要先确认自己研究问题中文本涉及的角色(Sender/Receiver)、了解文本作用方向(Reflect/Impact)。</p>
<p>报告以文本分析为主题，结合最新研究，对当前文本分析在管理领域的应用展开讨论，介绍文本编码常见算法，诸如词典法、文档向量化、词向量等，分享此类研究过程和要点。<br><br></p>
<h2 id="slideshttpstextdatacnblog2023-11-05-xjtu-text-mining-in-msslideshtml"><a href="https://textdata.cn/blog/2023-11-05-xjtu-text-mining-in-ms/slides.html">Slides</a></h2>
<iframe
    src="//player.bilibili.com/player.html?bvid=BV1se4y1C7MV&page=1"
    scrolling="no"
    height="500px"
    width="800px"
    frameborder="no"
    framespacing="0"
    allowfullscreen="true"
>
</iframe>

<p><br><br></p>
<h2 id="背景">背景</h2>
<p><img loading="lazy" src="img/multitudes-of-content-illustration.jpeg" alt=""  />
</p>
<p>维根特斯坦曾言“<strong>语言的界限就是思想的界限</strong>” ， 语言为代表的文本信息充斥在我们日常生活中， 信息潜移默化影响人， 人同时也在产生信息影响着这个世界。 在经管研究中，往往会涉及很多文本数据的编码。但是做研究面临两个问题:</p>
<h3 id="难题1--数据量大">难题1- 数据量大</h3>
<p>量太大，以至于废人力所能及。</p>
<p>时代发展，体现在数据上的特点就是数据大爆炸，过去做经管研究，使用访谈等研究方法，收录的文本内容，规模大多停留在M级。但是现在大数据时代，研究对象相关的文本数据，G级的数据量也是很常见的。</p>
<h3 id="难题2--格式乱">难题2- 格式乱</h3>
<p>信息存储技术发展，有应用不同场景的不同数据存储格式。数据可能是pdf、txt、docx，也可能是音频、视频等转录的文件。如果快捷整理，这也是个难点。</p>
<h3 id="难题3-难编码">难题3-难编码</h3>
<p>数据量少，可以人工阅读对数据进行理解和编码。但是当数据量大到无法处理的级别后，选择何种算法、各种算法技术的优缺点如何把握，对经管学者也是一个需要攻克的的技术难题。</p>
<p><img loading="lazy" src="img/consumer_org_society.png" alt=""  />
</p>
<p>难度大，但因为文本涉及的主体错综复杂，千丝万缕，所以可以研究很多对象。如个人、组织、社会之间的交互。</p>
<p><br><br></p>
<h2 id="编码解码理论">编码解码理论</h2>
<p>斯图亚特·霍尔在《电视话语的编码和解码》提出 <strong>编码-解码理论</strong>。该理论形成于70年代冷战时期，冷战中不两大阵营为了维护各自的社会稳定，为了在意识形态宣传中取胜，都在宣传工作中投入了重金。</p>
<p>当时的宣传工具是单向的广播模式，媒体作为统治阶级的喉舌，要将统治阶级的偏好、价值观等进行加工，生产相应意识形态内容。</p>
<p>而普罗大众，作为内容的接受者， 一成长于该特定意识形态的社会，同时又有一定的自我意识，所以对于一个宣传内容可能会有三种反应，表里都认同、表认同里不认同、表里都不认同。</p>
<p><img loading="lazy" src="img/SenderReceiver.png" alt=""  />
</p>
<h3 id="使用文本想清楚两个问题">使用文本想清楚两个问题</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">- How text reflects its Sender？
- How text impacts its Receiver？
</code></pre></div><h3 id="使用文本明晰三个角度">使用文本明晰三个角度</h3>
<p>我做的研究使用的文本数据，涉及哪些角色、作用力方向、感兴趣的内容。</p>
<ul>
<li><strong>角色</strong>: Sender or Receiver</li>
<li><strong>方向</strong>: Reflect or Impact</li>
<li><strong>内容</strong>: Sender的意识(认知、偏好、&hellip;)   vs  Receiver的意识(认知、偏好、&hellip;)</li>
</ul>
<p>下面是经管领域研究部分汇总，每个学者根据自己学科研究对象，应该能在4*4的矩阵中找到自己对应的位置</p>
<p><img loading="lazy" src="img/%e7%94%9f%e4%ba%a7%e4%b8%8e%e6%b6%88%e8%b4%b9.png" alt=""  />
</p>
<blockquote>
<p>Berger, Jonah, Ashlee Humphreys, Stephan Ludwig, Wendy W. Moe, Oded Netzer, and David A. Schweidel. &ldquo;Uniting the tribes: Using text for marketing insight.&rdquo; Journal of Marketing 84, no. 1 (2020): 1-25.</p>
</blockquote>
<p><br><br></p>
<h2 id="人工编码与机器编码">人工编码与机器编码</h2>
<p><img loading="lazy" src="img/unstructrueddata.png" alt=""  />
</p>
<p>做研究需要有干净的数据做实证分析，最为理想的是表数据，例如excel文件，每一行代表一条记录，每一列代表一个字段。编码的作用就是将非机构化的、脏乱的数据整理为干净整洁的表数据。</p>
<p>要明确编码方法的优点和缺点，在合理的适用范围使用。对于文本数据的编码，需要理解人工和机器两种编码方式的优缺点</p>
<table>
<thead>
<tr>
<th style="text-align:left"></th>
<th>分析方法</th>
<th>优点</th>
<th>缺点</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:left">人工编码</td>
<td>质性（扎根）</td>
<td>少量数据，深刻洞见。</td>
<td>难以应对大数据；<br/>编码标准不统一；</td>
</tr>
<tr>
<td style="text-align:left">机器编码</td>
<td>词频、向量相似度、向量距离</td>
<td>适合大规模文本挖掘<br/>编码标准是统一的;</td>
<td>需要破坏文本的结构，<br>丧失了部分信息量</td>
</tr>
</tbody>
</table>
<p><br><br></p>
<h2 id="机器编码-将文本转为数字或向量">机器编码-将文本转为数字或向量</h2>
<ul>
<li>
<p>符号法(每个词对应一个数字)</p>
<ul>
<li>词典(词频)法</li>
<li>词袋法、TF-IDF</li>
</ul>
</li>
<li>
<p>词嵌入</p>
<ul>
<li>Word2Vec</li>
<li>GloVe</li>
<li>FastText</li>
</ul>
</li>
</ul>
<p>符号法算法假设词语彼此是语义不相关的，目的是把 <strong>文本</strong> 转为某个数字或<strong>向量</strong>。</p>
<p>而词嵌入算法假设不同的词语是由n维个语义组成的线性组合，目的是把 <strong>词语</strong> 转为<strong>向量</strong>。</p>
<br>
<h3 id="符号法">符号法</h3>
<p>符号法就是数某个词或某类词的出现次数(或占比)。符合法是计算机NLP领域的专业叫法，在经管社科领域，最常见的文本分析软件<a href="https://textdata.cn/blog/liwc_python_text_mining/">LIWC</a>其实也是符号法。而LIWC全(Linguistic Inquiry and Word Count，即语义查询与词频统计。</p>
<p><img loading="lazy" src="img/analysis-process.png" alt=""  />
</p>
<h3 id="符号法的应用">符号法的应用</h3>
<table>
<thead>
<tr>
<th>概念指标</th>
<th>测量方法</th>
</tr>
</thead>
<tbody>
<tr>
<td>认真(努力)</td>
<td>测量文本中词语的个数</td>
</tr>
<tr>
<td>情感</td>
<td>使用情感词典，统计文本中正面词占比</td>
</tr>
<tr>
<td>可读性</td>
<td>文本中高难度(或专业性)词占比</td>
</tr>
<tr>
<td>客观性</td>
<td>文本中某个值的方差，如情感<br>- A<code>产品不错， 包装破损， 态度很好， 综合还是推荐大家购买!</code> [5, 1, 5, 4]<br>- B<code>产品垃圾，使用垃圾， 包装破损， 差评!!</code> [1,  1,  1,  1]<br>A的方差更大，更客观</td>
</tr>
<tr>
<td><a href="https://textdata.cn/blog/jcr_concreteness_computation/"><font color='blue'>具体性</font></a></td>
<td>使用具体性词典， 将文本中出现的具体词权重累加，除以总词数，求得具体性得分</td>
</tr>
<tr>
<td>短视主义</td>
<td>统计短视相关词在年报管理层讨论与分析中出现的占比</td>
</tr>
<tr>
<td><a href="https://textdata.cn/blog/2023-01-10-similarity_of_cental_bank_monetary_policy/"><font color='blue'>相似性(政策稳定性)</font></a></td>
<td>cosine(text_vector1, text_vector2)</td>
</tr>
<tr>
<td>&hellip;</td>
<td>&hellip;</td>
</tr>
</tbody>
</table>
<br>
<h3 id="词嵌入">词嵌入</h3>
<p>词嵌入技术有 Word2Vec、Glove，这类技术是挖掘出每个词的上下文语境，通俗的说法就是让计算机，对同样的文章数据，做千万次、上亿次完形填空。这样每个词语都有独特的上下文语义，并以n维向量形式表示，所以词嵌入也可以称之为词向量。</p>
<p><strong>向量模型有近义词相近、概念类似的平行两个特点</strong>。分别举几个例子，方便大家理解。</p>
<p>语义空间是n维，为了便于理解，将其压缩至二维空间。中学的向量大家都比较熟悉，在二维坐标中空间中，两个点的连线可以组成新的向量，相同的向量是平行的。</p>
<p>而在下图的2维语义空间中，good、best语义更接近，所以空间距离更近。同理bad、worst更近。</p>
<p>而vector(good, best)、vector(bad, worst)这两个向量均表示<code>原形-&gt;最高级</code>, 语义向量会近似平行。</p>
<p>同理， vector(good, bad)、 vector(best, worst)两个向量表示 <code>好-&gt;差</code>，语义向量也会近似平行。</p>
<p><img loading="lazy" src="img/embeddings-based.png" alt=""  />
</p>
<br>
<h3 id="词嵌入与认知">词嵌入与认知</h3>
<p>刚刚词嵌入的语义空间中的几个例子，其实就体现了语言的记忆。语义记录了使用该语言的人的记忆。不同的组织，对于同一种概念，会有不同的偏好。例如， Nature2022使用大规模语料数据训练出的词向量，发现语言中残存着人类的某些认知记忆。</p>
<p>通过构建概念词组对儿，在空间中投影，就可以挖掘出词语的在该概念中的分值。例如，使用</p>
<ul>
<li>SMALL = [small, tiny, little&hellip;]</li>
<li>BIG = [big, mega, large&hellip;]</li>
</ul>
<p>每个词都是一个n维的向量，SMALL或BIG都能计算出一个均值向量。大家记得中学的向量投影不，Nature2022就使用这个朴素的方法测量每个动物名称所蕴含的人类尺寸认知。</p>
<p><img loading="lazy" src="img/Concept_Words_Project.png" alt=""  />
</p>
<blockquote>
<p>Grand, G., Blank, I.A., Pereira, F. and Fedorenko, E., 2022. Semantic projection recovers rich human knowledge of multiple object features from word embeddings. Nature Human Behaviour, pp.1-13.</p>
</blockquote>
<br>
<h3 id="机器编码总结">机器编码总结</h3>
<p>这里做个表格对比，大家自己感受下三种技术的异同。</p>
<table>
<thead>
<tr>
<th>器编码方式</th>
<th>计算方法</th>
<th>维度类比</th>
<th>任务</th>
<th>例子</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>符号法-字典</strong>（词频）</td>
<td>数个数</td>
<td>原子</td>
<td>统计每句话里的名词个数</td>
<td>sent_num1 = 2<br>sent_num2 = 1</td>
</tr>
<tr>
<td><strong>符号法-词袋</strong></td>
<td>bag of words<br>one-hot<br>Tf-idf</td>
<td>分子</td>
<td>转化为词向量, 计算两个句子相似度。</td>
<td>vec1 = [1, 1, 1, 1, 1, 0]<br>vec2 = [0, 1, 0, 1, 0, 1]<br>similarity = cosine(vec1, vec2)</td>
</tr>
<tr>
<td><strong>词嵌入</strong></td>
<td>word2vec、<br>glove等</td>
<td>中子、质子、电子</td>
<td>词语相似度。(语义上大小相近，方向相反; 态度、偏见)</td>
<td>mom = [0.2, 0.7, 0.1]<br/>dad   = [0.3, 0.5, -0.2]</td>
</tr>
</tbody>
</table>
<p><br><br></p>
<h2 id="经管-文本分析-文献">经管-文本分析-文献</h2>
<p>在这里我把技术细分为词频、词袋、w2v建词典、w2v认知变迁四个维度，整理了经管6篇论文。大家可以阅读这6篇论文，掌握文本分析的应用场景。</p>
<table>
<thead>
<tr>
<th>文献</th>
<th>定性</th>
<th>词频</th>
<th>词袋</th>
<th>W2V建词典</th>
<th>W2V认知变迁</th>
</tr>
</thead>
<tbody>
<tr>
<td>王伟, 陈伟, 祝效国 and 王洪伟, 2016. 众筹融资成功率与语言风格的说服性&ndash;基于 Kickstarter 的实证研究. <em>管理世界</em>, (5), pp.81-98.</td>
<td>Y</td>
<td>Y</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td><a href="https://textdata.cn/blog/jcr_concreteness_computation/"><font color='blue'>语言具体性如何影响顾客满意度</font></a><br>Packard, Grant, and Jonah Berger. “How concrete language shapes customer satisfaction.” <em>Journal of Consumer Research</em> 47, no. 5 (2021): 787-806.</td>
<td></td>
<td>Y</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Wang, Quan, Beibei Li, and Param Vir Singh. &ldquo;Copycats vs. original mobile apps: A machine learning copycat-detection method and empirical analysis.&rdquo; Information Systems Research 29, no. 2 (2018): 273-291.</td>
<td></td>
<td></td>
<td>Y</td>
<td></td>
<td></td>
</tr>
<tr>
<td><a href="https://textdata.cn/blog/2019-12-08-lazy-prices/"><font color='blue'>文本相似度</font></a><br>Cohen, L., Malloy, C. and Nguyen, Q., 2020. Lazy prices. <em>The Journal of Finance</em>, <em>75</em>(3), pp.1371-1415.</td>
<td></td>
<td></td>
<td>Y</td>
<td></td>
<td></td>
</tr>
<tr>
<td>胡楠,薛付婧,王昊楠. <strong>管理者短视主义</strong>影响企业长期投资吗？——基于文本分析和机器学习[J].管理世界,2021,37(05):139-156+11+19-21.</td>
<td></td>
<td></td>
<td>Y</td>
<td>Y</td>
<td></td>
</tr>
<tr>
<td><a href="https://textdata.cn/blog/2023-11-02-measure-cognitive-diversity-through-language-discursive-diversity/"><font color='blue'>计算团队的话语多样性衡量团队的认知多样性</font></a><br>Lix, Katharina, Amir Goldberg, Sameer B. Srivastava, and Melissa A. Valentine. “Aligning differences: Discursive diversity and team performance.” Management Science 68, no. 11 (2022): 8430-8448.</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>Y</td>
</tr>
</tbody>
</table>
<p><br><br></p>
<h2 id="案例">案例</h2>
<h3 id="案例1-众筹语言风格">案例1-众筹语言风格</h3>
<p>王伟, 陈伟, 祝效国 and 王洪伟, 2016. 众筹融资成功率与语言风格的说服性&ndash;基于 Kickstarter 的实证研究. <em>管理世界</em>, (5), pp.81-98.</p>
<blockquote>
<p>众筹融资效果决定着众筹平台的兴衰。 众筹行为很大程度上是由投资者的主观因素决定的，而影响主观判断的一个重要因素就是语言的说服性。 而这又是一种典型的用 户产生内容（UGC），项目发起者可以采用任意类型的语言风格对项目进行描述。 不同的语 言风格会改变投资者对项目前景的感知，进而影响他们的投资意愿。 首先，依据 Aristotle 修 辞三元组以及 Hovland 说服模型，采用扎根理论，将众筹项目的语言说服风格分为 5 类：诉诸可信、诉诸情感、诉诸逻辑、诉诸回报和诉诸夸张。</p>
<p>然后，<strong>借助文本挖掘方法，构建说服风格语料库，并对项目摘要进行分类。</strong></p>
<p>最后，建立语言说服风格对项目筹资影响的计量模型，并对 <strong>Kickstarter 平台上的 128345 个项目进行实证分析</strong>。 总体来说，由于项目性质的差异，不同 的项目类别对应于不同的最佳说服风格。</p>
</blockquote>
<p><img loading="lazy" src="img/%e4%bc%97%e7%ad%b9-%e7%a7%8d%e5%ad%90%e8%af%8d.png" alt=""  />

<img loading="lazy" src="img/%e4%bc%97%e7%ad%b9-%e6%b5%81%e7%a8%8b%e5%9b%be.png" alt=""  />
</p>
<br>
<h3 id="案例2-lazy-prices文本相似性">案例2 Lazy prices文本相似性</h3>
<p>Cohen, L., Malloy, C. and Nguyen, Q., 2020. <a href="https://textdata.cn/blog/2019-12-08-lazy-prices/">Lazy prices</a>. <em>The Journal of Finance</em>, <em>75</em>(3), pp.1371-1415.</p>
<blockquote>
<p>之前的研究认为，尽管投资者一次对包含重大变化的财务报表的发布作出了迅时反应，但随着时间的流逝，这种公告作用是会减弱的(Brown and Tucker, 2011 and Feldman et al., 2010)。这表示10-K报告会随着时间推移，信息价值大打折扣。尽管我们复现了这个事实，即与常规文件的变更没有重大的公告效应，但我们认为，前人的研究忽略了更重要部分(如MD&amp;A)对对资产价格的影响。</p>
</blockquote>
<blockquote>
<p>确切的说，<strong>并不是报告的披露效应的信息价值变低了，而是投资者越来越难以发现报告中微妙的信息变化， 比如因为报告变得越来越冗杂。投资者只有看到某些新闻后，才会逐渐意识到之前公司报告内容变化的的真正价值</strong>。</p>
<p>使用1995年-2014年所有美国公司季度和年度申报的完整历史记录，研究发现当公司对<strong>报告进行积极更改</strong>时，这种行为<strong>蕴含着</strong>公司未来运营的<strong>重要信号</strong>。</p>
<p><strong>财务报告的语言和结构的变化也对公司的未来收益产生重大影响</strong>：做空&quot;变化&quot;的公司（持有的公司，如果其报告发生变化的，做空该公司股票），买入“不变化”的公司，使用这样的投资组合策略，在2006年的每月alpha值高达1.88%的收益（每年超过22％）。报告中涉及执行官（CEO和CFO）团队的话语风格的变化，或者有关诉讼(风险部分)的话语的变化，都对投资的未来收益有重要作用。</p>
</blockquote>
<p><img loading="lazy" src="img/lazy-prices-1.png" alt=""  />

<img loading="lazy" src="img/lazy-prices-2.png" alt=""  />
</p>
<br>
<h3 id="案例3-山寨-vs-原创">案例3 山寨 vs 原创</h3>
<p>Wang, Quan, Beibei Li, and Param Vir Singh. &ldquo;Copycats vs. original mobile apps: A machine learning copycat-detection method and empirical analysis.&rdquo; Information Systems Research 29, no. 2 (2018): 273-291.</p>
<blockquote>
<p><strong>进行此类研究的主要威慑因素是缺乏一种客观的方法来识别应用程序是模仿者还是原创者。通过结合自然语言处理，潜在语义分析，基于网络的聚类和图像分析等机器学习技术，我们提出了一种将应用识别为原创app或模仿app，可检测两种模仿者的方法：欺骗性和非欺骗性。</strong></p>
<p>根据检测结果，我们进行了经济计量分析，以确定五年间在iOS App Store中发布的<strong>5,141个开发人员的10,100个动作游戏应用程序</strong>样本中，模仿app对原创app需求的影响。我们的结果表明，特定模仿者对原始应用需求的影响取决于模仿者的质量和欺骗程度。高质量的非欺骗性复制品会对原件产生负面影响。相比之下，低质量，欺骗性的模仿者正面影响了对原创app的需求。</p>
<p>结果表明，从总体上讲，模仿app对原创app需求的影响在统计上是微不足道的。<strong>我们的研究通过提供一种识别模仿app的方法</strong>，并提供模仿app对原创app需求影响的证据，为越来越多的移动应用消费文献做出了贡献。</p>
</blockquote>
<p><img loading="lazy" src="img/copycat.png" alt=""  />
</p>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>OS2022 | 概念空间 | 词嵌入模型如何为组织科学中的测量和理论提供信息</title>
      <link>https://textdata.cn/blog/2023-11-03-organization-science-with-word-embeddings/</link>
      <pubDate>Fri, 03 Nov 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-11-03-organization-science-with-word-embeddings/</guid>
      <description>&lt;p&gt;Aceves, Pedro, and James A. Evans. &amp;ldquo;&lt;strong&gt;Mobilizing conceptual spaces: How word embedding models can inform measurement and theory within organization science.&lt;/strong&gt;&amp;rdquo; &lt;em&gt;Organization Science&lt;/em&gt; (2023).&lt;/p&gt;
&lt;br&gt;
&lt;h2 id=&#34;摘要&#34;&gt;摘要&lt;/h2&gt;
&lt;p&gt;词嵌入模型是一种表示多维概念空间的强大方法，在多维概念空间中，所传达的概念可以相互关联、组合和竞争。此类模型代表了机器学习的最新进展，使学者能够用大规模文本数据局部和全局的单词共现，以最小的语义失真程度， 有效地编码复杂的意义系统。尽管词嵌入的使用有可能扩大组织科学中的理论可能性，但嵌入对于组织学者来说很大程度上是未知的，未发挥出词嵌入应有的潜力。我们的目标是通过为用户提供实用的路线图来展示嵌入模型在组织科学中的前景，以在他们的研究中调动该方法，并为开展该类研究的学者提供理论指导。 我们首先明确定义 &lt;strong&gt;概念&lt;/strong&gt; 和 &lt;strong&gt;概念空间&lt;/strong&gt; 的概念，然后继续展示如何使用词嵌入模型来表示和测量这些概念，并指出该方法的优点和缺点。然后，我们提供一组嵌入测量及其理论解释和灵活的扩展。我们的目标是从词嵌入的技术处理中提取概念，并将其置于实践的理论框架中，以加速此类研究。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;一介绍&#34;&gt;一、介绍&lt;/h2&gt;
&lt;p&gt;过去十年，文本作为数据的计算使用在组织科学中显着增长（Hasan 等人，2015 年；Goldberg 等人，2016 年；Srivastava 等人，2018 年；Hannigan 等人，2019 年）。这种增长的主要原因是文本编码的概念信息赋予个人、组织、经济和社会行为以意义（Evans 和 Aceves 2016，Gentzkow 等人 2019），并且在过去十年中，来自组织环境的文本数据急剧增长，大大提高了文本的可用性。然而，文本中编码的 &lt;strong&gt;概念意义&lt;/strong&gt; 本质上是高维的，这使得降低概念复杂性成为研究文本的学者的中心任务。&lt;strong&gt;词嵌入模型是由计算机科学家和语言学家开发的一个新兴工具系列，用于文本信息降维，以此提取概念及其数字表示&lt;/strong&gt;。词嵌入技术的发展使组织科学家依赖于文本数据进行理论构造， 相比之前，数据中信息的保真度更高，由此文本数据与组织研究交叉场景形成了新的理论研究路线。尽管词嵌入模型在组织科学之外得到广泛使用，但由于组织科学领域的学者缺乏对词嵌入技术的理解， 不知如何将它们纳入理论发展过程的原则框架，词嵌入模型对于理论发展的价值仍然被掩盖。&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;词嵌入模型建立在高效的神经网络架构之上，并通过将复杂的语义系统有效编码到具有最小失真的稠密几何空间中，彻底改变了语义分析&lt;/strong&gt;。这些模型代表了数十到数百个维度的空间中的语义，相对于语言中的单词和概念的数量来说，这个维度较低； 但相对于正式社会和文化理论家之前试图呈现概念信息的两到三个维度来说，这个维度却很高（奥斯古德 1964 年，史密斯-洛文和海斯 1988 年）。出于组织科学的目的，这些嵌入模型创建了社会系统中个体所持有的集体知识的 &lt;strong&gt;数字替身&lt;/strong&gt; ， 嵌入可以解决文化上隐含类比（Mikolov et al. 2013b），回答文化偶然问题（Devlin et al. 2019，Radford et al. 2022），并预测未来的知识发现（Tshitoyan等人 2019；Sourati 和 Evans 2021）。组织科学长期以来一直借鉴人工智能（AI）的表征概念， 在这里，我们使用人工智能的表示机制来增强组织理论研究（Csaszar 和 Steinberger 2022）。&lt;/p&gt;
&lt;p&gt;然而，由于神经网络复杂，且难以理解的黑盒性质特性，围绕神经嵌入和人工智能方法对理论发展的价值存在争议。尽管预测能力很强，但此类方法往往缺乏可解释性（Knight 2017，Leavitt et al. 2021）。&lt;strong&gt;在组织科学领域中，学者缺乏此技术的理解，即&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;对于嵌入何时成为组织科学有用的方法论选择&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;如何在既定认识论标准内证明使用“复杂”神经嵌入方法的合理性&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;如何在各种嵌入中进行选择 等方法&lt;/strong&gt;（例如，静态词嵌入与上下文嵌入、预训练嵌入与自定义嵌入）&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;使用嵌入进行研究的适当步骤以及评估嵌入研究的相关标准&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;最值得注意的是，研究界，特别是那些研究组织认知、文化、知识和意义的人，似乎对嵌入方法 &lt;strong&gt;如何适应将方法论选择与理论发展联系起来&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;br&gt;
&lt;p&gt;我们的目的是通过两项贡献来解决这些问题。&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;首先，我们的目标是提供一个理论指南，为嵌入模型提供一个原则性的概念框架，学者可以使用该框架为他们的模型注入意义，并使学者们能够在理论发展过程中运用这些模型。我们这里的主要论点是，词嵌入模型中的每个向量代表一个概念，整个嵌入模型代表生成文本数据的社会系统的概念空间&lt;/strong&gt;。嵌入模型所代表的概念空间是多维空间，其中从规范和知识到想法和发明的概念相互关联。这个框架使组织学者能够利用嵌入模型的概念空间，与组织科学的许多领域之间建立联系。例如，不同公司基于知识视角对该空间的差异化覆盖（Grant 1996），组织理论家在描述规范和制度（Scott 2003），类别学者援引在决定将一个物体归类到哪个概念时（Pontikes 和 Barnett 2015 ），创新学者直接理论化寻求测量发现和发明的新颖性（Fleming 和 Sorenson 2001，2004），并且团队研究人员寻求了解成员在空间中的不同立场如何影响创造力、协调性和绩效（Srikanth 等人，2016）。因为我们以 &lt;strong&gt;概念&lt;/strong&gt; 和 &lt;strong&gt;概念空间&lt;/strong&gt; 为中心的理论框架可以推广到组织理论的许多背景，所以我们希望嵌入模型所支持的研究将促进这些子领域之间更深入、更持久的对话。&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;其次，我们的目标是为利用嵌入模型进行理论发展提供实用的路线图&lt;/strong&gt;。在此过程中，我们引导读者完成使用专利摘要语料库来实现词嵌入模型的过程，以表示现代技术创新的概念空间。我们解释了研究人员需要设置的模型参数，并逐步完成了他们应该采取的验证步骤，以评估模型是否有效地代表了他们感兴趣的概念空间，并提供了方法附录，其中包含实现所讨论的所有内容所需的代码。在注意到嵌入模型的可供性的同时，我们还讨论了它们不断发展的局限性，并提出了它们何时不适合组织分析的建议。然后，我们展示嵌入模型如何实现依赖于概念和概念空间的构造的理论化和测量。&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;我们概述了两大类词嵌入使用方法&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;度量之内/之间进行标记&lt;/strong&gt;，我们提出了跟踪相关分析集内部和之间的概念关系的度量，以帮助我们跟踪与概念广度、概念距离和概念相似性&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;意义及其维度&lt;/strong&gt;，我们提出了四种衡量标准，为了解意义及其与组织的关系提供了不同的窗口。为找出这些测量机会的理论可能性，我们重点介绍了一些研究进展。&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;本论文的一个核心主张是，在组织研究不同广度和深度，词嵌入工具现在使我们能够表示其概念空间，并且比以前更精细地表示细节&lt;/strong&gt;。有鉴于此，我们的目标是展示嵌入模型如何在与组织科学家相关的领域中操作概念空间，使研究人员能够扩展和完善现有理论。我们希望这一理论指南和实践路线图将促进组织科学内部的理论扩展，该扩展首先是扩大对文本数据的访问以及用于分析的随附计算工具（Kovács 等人，2013 年; Goldberg 等人; 2016年，Hannigan 等人, 2016年, 2019； Guo 等人，2020）。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二概念和概念空间&#34;&gt;二、概念和概念空间&lt;/h2&gt;
&lt;p&gt;概念是人类生活的一个基本特征，我们的日常思维很大程度上依赖于它们所代表的信息，使我们能够对周围的人、物体和事件进行分类，并将这些信息传达给其他人（Murphy 2002；Bergen 和 Feldman 2008 年； Cassanto 和 Lupyan，2015 年）。概念是将我们的精神世界粘合在一起的粘合剂（Murphy 2002），赋予精神和物质体验以意义（Hannan et al. 2019）。&lt;strong&gt;在认知科学和心理学的语言中，概念是“事物类别的「心理表征」”（Murphy 2002）。&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;概念有两大功能：分类和交流（Medin and Rips 2005），这些功能都需要语言的帮助。实际上，我们通过在语言中分配一个单词或短语来表示一个稳定概念的信息内容。这就是为什么我们通过说出或写出 “&lt;em&gt;&lt;strong&gt;manager&lt;/strong&gt;&lt;/em&gt;” 一词来提及经理的概念，从而引出它所包含的概念信息，例如对他人的责任、做出决策以及相对于组织同行获得更高的薪水。然后，语言的单词分割并链接了社区的共享概念空间（Lupyan 和 Bergen 2015）。这样，“一个概念就是一个单词或短语的含义……[包括]像 ‘&lt;em&gt;&lt;strong&gt;red&lt;/strong&gt;&lt;/em&gt;’ 和 ‘&lt;em&gt;&lt;strong&gt;grasp&lt;/strong&gt;&lt;/em&gt;’这样的基本的、具体化的单词，以及像 ‘&lt;em&gt;&lt;strong&gt;goal&lt;/strong&gt;&lt;/em&gt;’ 和 ‘&lt;em&gt;&lt;strong&gt;continuity&lt;/strong&gt;&lt;/em&gt;’ 这样的抽象和技术单词”（卑尔根）和 Feldman 2008]）。&lt;/p&gt;
&lt;p&gt;概念并不作为唯一的信息单位存在于真空中。相反，概念之所以有意义，是因为它们彼此相关（Hannan et al. 2019），“通过相似性和上下文的关系紧密地缝合在一起”（Hofstadter and Sander 2013）。在这种多重概念关系中存在着“我们对世界的大部分知识，告诉我们存在什么以及它们具有什么属性”（Murphy 2002，p.1）。例如，概念 &lt;em&gt;&lt;strong&gt;resource&lt;/strong&gt;&lt;/em&gt;  与  &lt;em&gt;&lt;strong&gt;firm&lt;/strong&gt;&lt;/em&gt;、&lt;em&gt;&lt;strong&gt;constraint&lt;/strong&gt;&lt;/em&gt; 和 &lt;em&gt;&lt;strong&gt;natural&lt;/strong&gt;&lt;/em&gt; 等概念相关。在文化系统的层面上，概念之间的相互关系引发了表征概念之间宏观层面有意义的维度。 &lt;em&gt;&lt;strong&gt;manager&lt;/strong&gt;&lt;/em&gt; 概念在某些方面与 &lt;em&gt;&lt;strong&gt;coach&lt;/strong&gt;&lt;/em&gt; 和 &lt;em&gt;&lt;strong&gt;president&lt;/strong&gt;&lt;/em&gt; 的概念很接近，而在其他方面则与&lt;em&gt;&lt;strong&gt;employee&lt;/strong&gt;&lt;/em&gt; 和 &lt;em&gt;&lt;strong&gt;bureaucracy&lt;/strong&gt;&lt;/em&gt; 的概念很接近。将概念理解为存在于复杂几何空间中的点，使我们能够思考和测量概念之间的距离远近（Hannan 等人，2019）。例如，与  &lt;em&gt;&lt;strong&gt;playground&lt;/strong&gt;&lt;/em&gt; 或 &lt;em&gt;&lt;strong&gt;ice cream&lt;/strong&gt;&lt;/em&gt; 相比， &lt;em&gt;&lt;strong&gt;manager&lt;/strong&gt;&lt;/em&gt; 与&lt;em&gt;&lt;strong&gt;organization&lt;/strong&gt;&lt;/em&gt; 和 &lt;em&gt;&lt;strong&gt;leader&lt;/strong&gt;&lt;/em&gt; 概念的联系更加紧密。&lt;strong&gt;我们将这种概念相关的多维空间称为概念空间&lt;/strong&gt;（Hannan et al. 2019)&lt;/p&gt;
&lt;p&gt;重要的是我们用复数来指代概念空间。对于许多单词来说，它们会根据使用的上下文表现出不同的概念信息模式。首先，概念可能会根据使用它们的社会背景而有所不同。例如，如果在执行董事会议室、商品交易大厅或附近的储蓄和贷款机构的背景下说出 “&lt;em&gt;&lt;strong&gt;Bank&lt;/strong&gt;&lt;/em&gt;”，指的是银行而不是河流。概念也可能根据使用时间的不同而有所不同。例如，“&lt;em&gt;&lt;strong&gt;高科技&lt;/strong&gt;&lt;/em&gt;” 一词所引发的概念关系会根据我们研究的是 1960 年代、1990 年代还是今天而有所不同。最后，概念关系因使用它们的社区而异，因此 “&lt;em&gt;&lt;strong&gt;债务&lt;/strong&gt;&lt;/em&gt;” 所捕获的概念将根据其是由首席财务官还是低收入个人使用而有所不同。概念所含信息存在多样性， 正如 Hannan等人（2019）指出，“虽然有些概念可能是天生的或生物驱动的，但大多数都是社会构建的。”&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;三先前研究中的概念和概念空间&#34;&gt;三、先前研究中的概念和概念空间&lt;/h2&gt;
&lt;p&gt;概念以及扩展的概念空间是人类思维和交流的基础（Sperber 和 Wilson 1986；，Murphy 2002；Hofstadter 和 Sander 2013）。正因为如此，概念和概念空间对于许多组织理论框架来说或多或少是明确和关键的。在某些研究（例如类别研究）中，概念具有核心重要性并且已经被明确地理论化。然而，在其他情况下，（例如，公司基于知识视角）概念被隐含地假定，即使它们是决定许多理论期望的基本成分。鉴于概念无处不在，对组织科学所有领域使用概念信息进行全面回顾超出了本文的范围。我们将简短、非详尽的回顾集中在概念和概念空间概念的三个领域——&lt;strong&gt;类别、知识和文化&lt;/strong&gt;。通过嵌入技术处理并追踪存在于个人和社区头脑中的概念信息，研究其对组织行为和结果的影响。&lt;br&gt;&lt;/p&gt;
&lt;h3 id=&#34;31-类别&#34;&gt;3.1 类别&lt;/h3&gt;
&lt;p&gt;类别是具有共同特征和属性的实体组。如前所述，概念是类别的心理表征。对类别的研究主要集中在跨类别或模糊类别是否会增加或减少分类实体的估值。自Zuckerman（1999）以来的工作一直集中在消除歧义条件上，在这些条件下，类别跨越和模糊性会导致积极或消极的估值。许多研究表明，由于感知偏差（Durand et al. 2007）、不符合受众期望（Hsu 2006)、Hsu et al. 2009；Leung and Sharkey 2014） ，跨越模糊的类别会损害实体估值，或降低分类对比度（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B119&#34;&gt;Negro et al. 2010&lt;/a&gt;）。其他研究表明，跨越类别可以创造积极的估值结果，因为它表明非典型性可以放大良好的表现并缓冲不良表现（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B159&#34;&gt;Smith 2011&lt;/a&gt;），一个类别可以锚定认知，而另一个类别可以有益地修改认知（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B188&#34;&gt;Wry et al. 2014&lt;/a&gt;）。还有其他研究表明，效果取决于受众，有些人喜欢跨类别，而另一些人则不喜欢（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B135&#34;&gt;Pontikes 2012&lt;/a&gt;）。通过这些方式，类别可以通过影响有关类别成员资格的概念信息的解释方式，对行为和绩效产生积极或消极的影响。&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#fn5&#34;&gt;4&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;尽管类别范式的贡献历来是通过类别成员的集合和模糊集合理论（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B64&#34;&gt;Hannan et al. 2007&lt;/a&gt;）概念来实现的，但最近的工作开始纳入其多维性（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B65&#34;&gt;Hannan et al. 2019&lt;/a&gt;）和类别的分级归属感。组织学者感兴趣的许多现象都是由概念及其代表的类别之间的精确距离支撑的。例如，鉴于专利所贡献的技术领域，专利通常分为类别和子类。然而，专利中编码的想法可以传播到创新空间的广泛领域，即使只分类在一个类别中。正如我们稍后讨论的，转向概念的几何概念，使分析师能够考虑隶属度、重叠和连续距离影响底层实体评估判断的方式&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B65&#34;&gt;（Hannan 等人，2019 &lt;/a&gt;。&lt;br&gt;&lt;/p&gt;
&lt;h3 id=&#34;32-知识&#34;&gt;3.2 知识&lt;/h3&gt;
&lt;p&gt;众所周知，知识很难具体说明，并且在哲学、认知科学和社会科学领域，围绕其概念性质进行了长期而活跃的争论（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B166&#34;&gt;Steup 和 Neta 2020&lt;/a&gt;）。然而，过去几十年来，组织科学在微观、中观和宏观层面上进行了大量研究，解决有关知识及其在团队、组织和经济活动中的作用的问题。从对团队成员专业知识的研究（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B164&#34;&gt;Srikanth et al. 2016&lt;/a&gt;）到公司基于知识和注意力的观点（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B87&#34;&gt;Kogut and Zander 1992&lt;/a&gt;，&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B56&#34;&gt;Grant 1996&lt;/a&gt;，&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B123&#34;&gt;Ocasio 1997&lt;/a&gt;）；从交互记忆系统（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B143&#34;&gt;Ren 和 Argote，2011&lt;/a&gt;）到创新流程（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B50&#34;&gt;Garud 等，2013&lt;/a&gt;）；从组织设计（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B45&#34;&gt;Foss et al. 2013&lt;/a&gt;）到搜索和探索（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B93&#34;&gt;Lavie et al. 2010&lt;/a&gt;），知识在最近的组织理论化中发挥着核心作用。&lt;/p&gt;
&lt;p&gt;无论人们对知识的定义如何选择，命题性知识从根本上都与概念信息相关。&lt;em&gt;&lt;strong&gt;命题知识采取“ S [主体]知道p [命题]”&lt;/strong&gt;&lt;/em&gt; 的形式（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B80&#34;&gt;Ichikawa and Steup 2018&lt;/a&gt;）。在某种程度上，命题是由语言中的单词编码的，并且单词代表概念信息，命题知识依赖于概念以及它们如何在概念空间中交织（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B110&#34;&gt;McGrath and Frank 2020&lt;/a&gt;）。以命题“泰勒知道氢的主要工业应用是氨的制造”和“特里知道量子算法可以具有较低的时间复杂度”为例。这些知识命题中的每一个都代表了不同的概念意义，前面提到的领域将以不同的方式操作它们。例如，团队学者可能会强调，由泰勒和特里组成的专利团队将拥有多样化的基础知识。采取基于注意力观点的学者会注意到，泰勒和特里可能会以不同的方式关注知识空间，以应对组织变革。研究创新的人可能会注意到如果泰勒和特里共享办公空间，知识重组的潜力。研究搜索的人可能会假设，为了解决问题，泰勒和特里会以不同的方式搜索概念性解决方案。在所有这些情况下，就这些领域通过诉诸语言编码的命题知识来理论化知识动态而言，它们以基本和可测量的方式参与概念和概念空间。&lt;br&gt;&lt;/p&gt;
&lt;h3 id=&#34;33-文化&#34;&gt;3.3 文化&lt;/h3&gt;
&lt;p&gt;文化被不同地概念化为集体的共同价值观、故事、框架、工具包和类别（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B52&#34;&gt;Geertz 1973&lt;/a&gt;；&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B131&#34;&gt;Pettigrew 1979&lt;/a&gt;；&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B92&#34;&gt;Lamont 和 Small 2008&lt;/a&gt;；&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B158&#34;&gt;Small 等人 2010&lt;/a&gt;；&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B54&#34;&gt;Giorgi 等人 2015&lt;/a&gt;）。文化建构已成为组织研究的核心，在从个人和团队到组织和国家的各个层面的分析中都得到了运用（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B54&#34;&gt;Giorgi et al. 2015&lt;/a&gt;）。从理解文化如何塑造职业结构（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B195&#34;&gt;Glynn 2000&lt;/a&gt;）、组织领域（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B194&#34;&gt;Anteby 2010&lt;/a&gt;）和创业环境（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B106&#34;&gt;Lounsbury and Glynn 2001&lt;/a&gt;，&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B141&#34;&gt;Rao and Giorgi 2006&lt;/a&gt;）到它在讲故事（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B106&#34;&gt;Lounsbury and Glynn 2001&lt;/a&gt;）和身份建设中的作用（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B196&#34;&gt;Ravasi 和 Schultz 2006&lt;/a&gt;），从其对人际沟通的塑造（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B165&#34;&gt;Srivastava 等人，2018&lt;/a&gt;）到对组织绩效的影响（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B26&#34;&gt;Corritore 等人，2020&lt;/a&gt;），文化深深地受到概念及其互动方式的调节。文化以集体认知过程为基础（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B36&#34;&gt;DiMaggio 1997&lt;/a&gt;，&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B128&#34;&gt;Patterson 2014&lt;/a&gt;），很大程度上可以通过语言痕迹来获取。语言进入文化的窗口（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B55&#34;&gt;Goldberg et al. 2016&lt;/a&gt;，&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B165&#34;&gt;Srivastava et al. 2018&lt;/a&gt;，&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B26&#34;&gt;Corritore et al. 2020&lt;/a&gt;）很大程度上是通过它所表达的概念来呈现的，使得概念和概念空间成为组织文化研究的重要支柱。&lt;/p&gt;
&lt;p&gt;基于它们在形成范畴、知识和文化方面的关键作用，概念和概念空间已成为许多组织理论赖以建立的知识支架的重要组成部分。然而，概念和概念空间通常仅被用作缺乏精确和可扩展的经验表征的不明确的隐喻。这限制了研究使用粗粒度的代理测量或允许手动编码和解释的小数据集。接下来，我们提出词嵌入模型是一种最先进的工具，用于表示概念和概念空间，可以添加到组织学者工具包中。就组织学者寻求将概念和概念信息所支撑的结构操作化而言，他们将得到这类新模型的帮助。考虑到这一点，我们接下来介绍嵌入模型如何工作以及为什么它们可以作为概念和概念空间的有效表示。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;四使用词嵌入来表示概念和概念空间&#34;&gt;四、使用词嵌入来表示概念和概念空间&lt;/h2&gt;
&lt;h3 id=&#34;41-越来越多地使用文本作为数据&#34;&gt;4.1 越来越多地使用文本作为数据&lt;/h3&gt;
&lt;p&gt;过去 10 年，通过计算工具和方法进行文本数据分析出现了爆炸性增长。从社会学（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B40&#34;&gt;Evans and Aceves 2016&lt;/a&gt;）到经济学（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B53&#34;&gt;Gentzkow et al. 2019&lt;/a&gt;）和政治学（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B58&#34;&gt;Grimmer and Stewart 2013&lt;/a&gt;），文本正迅速成为组织、经济和社会生活的中心观察站。文本数据提供了在线知识社区、财报电话会议和公司报告、产品评估、组织电子邮件和讨论板、历史档案、视频转录和电影字幕、医疗记录、电子商务、社交媒体等多种领域的丰富思想和行为痕迹。媒体平台、新闻文章、科学学科等等。总而言之，这些文本数据源比以往任何时候都更深入、更广泛地进入组织生活。正如&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B40&#34;&gt;Evans 和 Aceves（2016 年&lt;/a&gt;）指出的那样，文本数据现在使我们能够访问“有关正在玩的社交游戏的隐藏元素及其背后的社交世界”的深层信息。然而，这些语料库的庞大规模及其广泛的范围意味着，提取理论上有意义的信息信号越来越多地受到计算方法的帮助，利用信息技术方法获取大量非结构化文本数据，并将它们转换为有意义且相关的度量。&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#fn6&#34;&gt;5&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;文本数据与组织学者习惯使用的定量数据之间的一个主要区别是文本是高维的。正如&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B53&#34;&gt;Gentzkow 等人（2019 年&lt;/a&gt;）指出，“仅使用英语中一千个最常用单词的 30 个单词的 Twitter 消息样本 [&amp;hellip;] 的维度大致与宇宙中的原子一样多。” 因此，使用文本作为数据的学者的中心任务是通过对数据施加限制来降低维度。&lt;strong&gt;过去二十年里，组织科学中用于降低这一维度的一些最常用的计算工具是词典法、语义网络和主题模型。尽管这些方法有其优点，但一个主要缺点是它们无法对文本中存在的细粒度概念关系和关联进行编码&lt;/strong&gt; 。接下来，我们将展示嵌入模型如何利用文本中的局部和更广泛的信息来训练概念含义和概念空间的高保真表示。在此过程中，我们展示了词嵌入模型如何克服先前方法来表示文本中编码的含义的一些局限性，从而允许对理论结构进行更细粒度的测量，并实现新的理论可能性。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;42-词嵌入&#34;&gt;4.2 词嵌入&lt;/h3&gt;
&lt;p&gt;我们之前解释过，概念是事物类别的心理表征，人类通过在词典中分配一个单词或短语来表示稳定的概念，并指出，概念只有在与跨多个维度的其他概念相关并为其提供信息时才有意义。密集的概念空间。在这里，我们认为词嵌入模型是最近开发的一类从机器学习应用于自然语言处理的模型，它使我们能够有效且高效地表示概念空间，并将这些空间用于追求组织科学。词嵌入模型是文本语料库中单词的连续表示，可以进行几何解释。&lt;strong&gt;词嵌入的方法论假设，一个词的含义很大程度上是由出现在其直接和更广泛上下文中的词所决定的，这一想法受到结构语言学家的启发，他们已经证明，含义的差异与局部分布相关（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B68&#34;&gt;Harris 1954&lt;/a&gt;）， 这个想法现在被称为 「分布式语义学」，Firth 的著名描述是：“观其伴而知其意”（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B42&#34;&gt;Firth 1957&lt;/a&gt;，you shall know a word by the company it keeps）， 一个单词所代表的概念或含义可以通过它周围的单词的分布来推断&lt;/strong&gt;。&lt;/p&gt;
&lt;p&gt;以这种分布式方式思考概念和概念空间的底层计算架构可以追溯到 20 世纪 80 年代初期计算机科学家 Geoffrey Hinton 的工作（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B71&#34;&gt;Hinton 1986&lt;/a&gt; , &lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B72&#34;&gt;Hinton et al. 1986&lt;/a&gt;）以及认知科学家在这一时期研究的并行分布式处理模型（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B149&#34;&gt;Rumelhart 等人，1986a&lt;/a&gt;，&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B150&#34;&gt;b&lt;/a&gt;；&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B109&#34;&gt;McClelland 和 Rumelhart，1989&lt;/a&gt;）。分布式架构是当前嵌入语言模型的基础（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B115&#34;&gt;Mikolov et al. 2013b&lt;/a&gt;、&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B130&#34;&gt;Pennington et al. 2014&lt;/a&gt;、&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B35&#34;&gt;Devlin et al. 2019&lt;/a&gt;、&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B104&#34;&gt;Liu et al. 2019&lt;/a&gt;、&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B17&#34;&gt;Brown et al. 2020&lt;/a&gt;、&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B41&#34;&gt;Fedus et al. 2020）。 2021&lt;/a&gt;）， 嵌入模型 Word2Vec 算法(&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B115&#34;&gt;Mikolov 等 2013b&lt;/a&gt;) 相对简单易用，能够处理中等规模的语料库来。 &lt;strong&gt;Word2Vec 与  GloVe（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B130&#34;&gt;Pennington 等人，2014 年&lt;/a&gt;）和 FastText（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B13&#34;&gt;Bojanowski 等人，2017 年&lt;/a&gt;）等嵌入算法，是 ChatGPT 和相关模型的基础&lt;/strong&gt;。&lt;/p&gt;
&lt;p&gt;找个例子来帮助理解算法， 现在我们要创建过去 50 年创新的概念空间表示。首先需要概念活动领域的文本数据， 美国专利局数据提供了创新活动的踪迹，其中包括所有专利的文本、摘要、描述和权利要求。在整篇论文中，我们使用这个专利摘要语料库来指导读者完成训练这个概念空间和构建相关概念测量的过程。数据是从&lt;a href=&#34;https://patentsview.org/&#34;&gt;Patentsview.org&lt;/a&gt;免费下载的，使用 1976 年至 2019 年间发布的所有专利来构建本文中发现的词嵌入模型和测量相关指标。&lt;/p&gt;
&lt;p&gt;想象一下，专利语料库中的每个独特单词都是从放置在巨大冰箱上的随机放置的 &lt;strong&gt;“word magnet”&lt;/strong&gt; 开始的（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B76&#34;&gt;Hovy 2020&lt;/a&gt;）。当连续词袋 (CBOW) 算法滚动浏览语料库时，使用每个目标词周围的单词词(滑动窗口的上下文)来预测目标词（更多内容见下文）。该算法的最终目标是产生一种语义模型，其中出现在相似上下文中的单词彼此接近，而来自不同上下文的单词则相距很远。由于用2维概念空间不足以捕获每个单词的全部含义，因此该算法改为在更高的（100-1,000）维空间内捕捉语义。通过这种方式，目标单词的概念信息是从它周围的单词中归纳出来的，将语料库中的每个单词绘制为&lt;em&gt;n&lt;/em&gt;维空间中的坐标或向量。正是单词在这个&lt;em&gt;n&lt;/em&gt;维向量空间中的相对位置，使我们能够将词嵌入模型可以描述代表人类概念活动区域的概念空间。&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#fn7&#34;&gt;6&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;概念意义的识别假定了嵌入空间的可解释性。接下来，我们提出了对这些概念空间的一系列提示和测量，作为从中产生结构化解释的方法。这很像心理学家使用 &lt;strong&gt;心理测量调查&lt;/strong&gt; 将概念印象转化为可解释的观点（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B112&#34;&gt;Michael Furr 2021&lt;/a&gt;）。或者&lt;strong&gt;认知人类学家如何使用结构化任务，例如排序和排名（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B163&#34;&gt;Spradley 2016&lt;/a&gt;），将概念性的世界观转变为可解释的世界观&lt;/strong&gt;。我们认为嵌入模型必须接受结构化测量（就像向人类受试者提供的心理测量问卷）使他们的 **概念景观(conceptual landscape)**变得可解释。接下来，我们将引导读者如何用专利语料库训练创新概念空间表示的过程。之后， 我们概述了该方法的优点和局限性，并指出这些方法与先前的文本分析方法和组织研究实践的关系。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;43-选择语料库&#34;&gt;4.3 选择语料库&lt;/h3&gt;
&lt;p&gt;学者可以根据应用使用两种词嵌入模型。一方面，研究人员可以使用自有文本语料库来训练表示， 据此了解文本所涉主体(个人、团体、社会)行为的概念空间是什么样子， 以及概念关系揭示人类活动背景。在我们的示例中，专利创新在专利语料库中得到了很好的体现，因此我们在下面展示了如何从头开始训练概念空间表示, 以及它揭示了哪些概念联系。研究人员可以从头开始训练语料库的其他例子包括在线社区（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B18&#34;&gt;Burtch et al. 2021&lt;/a&gt;、&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B2&#34;&gt;Aceves et al. 2022&lt;/a&gt;、&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B23&#34;&gt;Chambers et al. 2022&lt;/a&gt;）、学术学科（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B74&#34;&gt;Hofstra et al. 2020&lt;/a&gt;、&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B102&#34;&gt;Lin et al. 2022&lt;/a&gt;） 、劳动力市场（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B9&#34;&gt;Bana 2022&lt;/a&gt;）、公共记录（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B6&#34;&gt;Arseniev-Koehler et al. 2022&lt;/a&gt;）、产品和公司描述（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B61&#34;&gt;Guzman and Li 2023&lt;/a&gt;）以及财报电话会议和公开演讲（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B85&#34;&gt;Kirgil and Voyer 2022&lt;/a&gt;）。&lt;/p&gt;
&lt;p&gt;或者，如果研究人员想要在较小的语料库中追踪概念动态，而该语料库的大小不足以训练独特的、特定于上下文的嵌入，那么研究者可以使用预训练嵌入模型，需要注意，训练预训练嵌入模型的文本与研究者小语料库在内容、场景要有相似性。广泛使用的预训练嵌入已经在来自海量语料库的文本上进行了训练，例如新闻（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B4&#34;&gt;Google 2013&lt;/a&gt;）、维基百科（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B35&#34;&gt;Devlin et al. 2019&lt;/a&gt;、&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B57&#34;&gt;Grave et al. 2018&lt;/a&gt;）。训练这些预训练嵌入模型的文本语料体量很大， 内容题材往往包含我们较小文本样本中存在的概念。因此使用预训练嵌入对这些概念的信息进行编码，并可用于近似相关距离。政治和历史语义背景下的研究发现，预训练嵌入提供的结果与特定于上下文的嵌入相当（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B89&#34;&gt;Kozlowski et al. 2019&lt;/a&gt;，&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B145&#34;&gt;Rodriguez and Spirling 2022&lt;/a&gt;）。如果有理由相信研究项目中包含的概念和想法没有在这些大量预训练嵌入中得到很好的体现，研究人员可以使用较小语料库中的文本对其进行 &lt;strong&gt;微调（Fine-Tune）&lt;/strong&gt;（ &lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B104&#34;&gt;Liu et al. 2019，Burtch et al.2019&lt;/a&gt;）&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B18&#34;&gt;， 2021&lt;/a&gt;）。微调将预训练的概念空间扭曲为与样本一致（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B104&#34;&gt;Liu et al. 2019&lt;/a&gt;），更好地反映概念之间的关系。&lt;/p&gt;
&lt;p&gt;最后，使用哪一种嵌入(自己训练的嵌入、 预训练的嵌入、微调的嵌入)将取决于研究人员的目的以及他们寻求追踪的概念动态的类型。接下来，我们将重点描述从头开始训练和验证嵌入模型的过程。在接下来的部分中，我们讨论不同参数设置和策略之间的权衡，并鼓励读者遵循文章文本和在线附录。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;44-清理语料库&#34;&gt;4.4 清理语料库&lt;/h3&gt;
&lt;p&gt;训练嵌入模型的第一步是使用 Python 等编程语言录入文本语料库， 首先获取每个专利摘要中的文本， 并将连续的文本进行切词，转化为单词列表 。然后，我们将文本小写，删除标点符号和数字字符串，并将每个摘要转换为称为token的单词列表。但是这可能破坏一些词组语义，这里使用 &lt;em&gt;&lt;strong&gt;bi-gram&lt;/strong&gt;&lt;/em&gt;， 识别高频共现的词组成词组，例如当 &lt;em&gt;&lt;strong&gt;“electric”&lt;/strong&gt;&lt;/em&gt; 和 &lt;em&gt;&lt;strong&gt;“vehicle”&lt;/strong&gt;&lt;/em&gt; 这两个词在某些上下文中一起出现时，它们将被统一形成短语和概念 &lt;em&gt;&lt;strong&gt;“electric_vehicle”&lt;/strong&gt;&lt;/em&gt; 。建立单词或短语列表后，执行单词嵌入算法来学习单词或二元组及其语言上下文之间的最佳距离，以保留语言中单词和短语的概念空间。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;45-训练嵌入模型&#34;&gt;4.5 训练嵌入模型&lt;/h3&gt;
&lt;p&gt;第一步是选择词嵌入算法， 浅层神经网络构建的单词表示（例如，Word2Vec、FastText；&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B115&#34;&gt;Mikolov 等人，2013b&lt;/a&gt;）、共现矩阵的低秩近似（GloVe；&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B130&#34;&gt;Pennington 等人，2014&lt;/a&gt;） ，或来自 Transformer 的深度上下文嵌入（例如 BERT、&lt;em&gt;GPT&lt;/em&gt;；&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B35&#34;&gt;Devlin 等人 2019&lt;/a&gt;、&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B139&#34;&gt;Radford 等人 2022&lt;/a&gt;）。这些不同算法输出，都可以被解释为&lt;em&gt;n&lt;/em&gt;维概念空间，其中单词或短语由空间内的向量位置表示。本文我们只介绍 Word2Vec 算法， word2vec 是一种广泛使用的训练概念空间的算法（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B113&#34;&gt;Mikolov 等人，2013a&lt;/a&gt;，&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B115&#34;&gt;b&lt;/a&gt;）。&lt;/p&gt;
&lt;p&gt;Word2Vec 算法的一种流行实现算法是连续词袋 (CBOW) 算法，可以在 Gensim python 库中轻松访问，该算法使用目标单词的语言上下文来预测被扣掉的目标词 (可以简单的理解为让机器做完形填空题) ， 比较适合小规模数据集。 Word2Vec 还实现了另一种 Skip-Gram 算法，该算法通过从目标单词预测上下文单词来反转 CBOW 的预测任务，比较适合大规模数据集。相比之下，skip-gram 将每个上下文目标对（例如，T：“房子”，C：“宽敞”）视为单独的观察，因此可以更好地捕获精确的语义，但需要更大的语料库才能获得卓越的性能。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;46-维数&#34;&gt;4.6 维数&lt;/h3&gt;
&lt;p&gt;考虑维数很有必要。朴素的模型可以将不重复总词数作为维度， 例如包含 100,000 个不重复单词的语料库， 任何单词都需要  100,000 维才能准确表示。然而，当单词从上下文中被识别为相似时，可以一定范围内减少维度数。&lt;strong&gt;维度过多会导致内存需求和冗余增加，并降低可解释性；维度太少会扭曲距离并且无法解释语言的不及物性&lt;/strong&gt;。通过这种方式，通过具有至少足够的维度来捕获所讨论的复杂语义关系，可以获得准确的预测。&lt;/p&gt;
&lt;p&gt;在实践中，300 维已经成为一个标准，很大程度上源于最初的 Word2Vec 论文之后的惯例，该论文通过交叉验证确定了最佳维数，以减少预测屏蔽词任务中的错误。大多数后续分析都是建立在较小、多样性较低的文本集合上，需要较少的维度，因此 300 通常被用作上限。最近的工作表明，应根据语料库统计数据选择维度 - 语料库词汇表中成对等距单词的数量提供了维度数量的下限，低于此界限通常会导致单词嵌入质量下降（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B127&#34;&gt;帕特尔和巴塔查亚 2017&lt;/a&gt;）。&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B74&#34;&gt;霍夫斯特拉等人。(2020)&lt;/a&gt;使用 100、200 和 300 维的模型找到了稳健的结果。&lt;/p&gt;
&lt;p&gt;如果分析师寻求实现维度可解释性，他们必须以最小失真来确定表示数据所需的维度数。 但这最后一步一半很少执行，因为维度的优化需要大量的时间和计算资源。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;47-窗口尺寸&#34;&gt;4.7 窗口尺寸&lt;/h3&gt;
&lt;p&gt;回想一下，窗口大小是指算法将用来焦点目标词（或其邻居）之前和之后的单词数量。该窗口最小可以是 1。对于较小的窗口，算法将倾向于对句法关系进行编码（例如，名词后跟动词）。&lt;strong&gt;随着窗口大小的增加，更多的含义和语义被编码到模型输出中&lt;/strong&gt;。&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B145&#34;&gt;考虑Rodriguez 和 Spirling (2022)&lt;/a&gt;的示例，其中包含两个句子的语料库：(1)“狮子吃肉”和 (2)“牛吃草”。当窗口大小为一时，我们会知道牛和狮子都吃东西，从这个意义上说，牛和狮子在语法上是等价的，因为我们没有足够的信息来区分两者。然而，随着窗口的增加，算法开始对牛与狮子的含义进行更多编码。&lt;strong&gt;与维度数量一样，这里的回报也递减，窗口大于五个字的模型性能略有改善&lt;/strong&gt;（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B145&#34;&gt;Rodriguez 和 Spirling 2022&lt;/a&gt;）。 &lt;strong&gt;BERT 和 GPT 系列等上下文模型具有更大的窗口，这些窗口通过注意力过程进行驯服，算法通过该过程识别哪些上下文单词对于解释焦点单词的含义很重要&lt;/strong&gt;（Vaswani 等人，2017 年&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B176&#34;&gt;）&lt;/a&gt;。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;48-验证模型&#34;&gt;4.8 验证模型&lt;/h3&gt;
&lt;p&gt;最后一步是验证词嵌入模型，这样做是为了确认算法学习的表示与文本数据所承载的真实人类活动的概念空间表示尽可能相近。论文附录第 2 节描述了关于专利嵌入的七个详细验证程序，表明该模型有效地学习了创新空间的表示。这些包括（1）邻近嵌入词的语义相似性；(2)具有嵌入距离的语义梯度；(3)嵌入簇与语义域之间的对应关系；（4）物理世界距离与嵌入之间的相关性；(5) 社会距离与嵌入之间的相关性；(6) 嵌入空间类比推理的准确性；(7)嵌入文档的语义一致性。我们还讨论了第八个“额外”测试，即图灵测试（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B174&#34;&gt;Turing 1950&lt;/a&gt;）。由 Transformer 支持的现代上下文嵌入的评估标准是它们是否能够与人类毫无区别地参与任何分类、关联、意义生成或集成任务，包括普通对话和专家教程。OpenAI 的 ChatGPT 和许多竞争的聊天机器人已经展示了如此强大的性能，以至于图灵测试正在迅速从上限转变为基线基准。这些验证步骤与论文最后部分的测量相结合，作为嵌入模型的有用提示prompt和测量，使研究人员能够对其编码的概念空间提供结构化解释。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;49-词嵌入方法的优点和缺点&#34;&gt;4.9 词嵌入方法的优点和缺点&lt;/h3&gt;
&lt;h4 id=&#34;491--无需正式指定相关尺寸&#34;&gt;4.9.1  无需正式指定相关尺寸&lt;/h4&gt;
&lt;p&gt;对概念建模的正式尝试试图通过逻辑演绎方法清楚地枚举概念的相关维度（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B47&#34;&gt;Gärdenfors 2004&lt;/a&gt;、&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B48&#34;&gt;Gardenfors 2014&lt;/a&gt;、&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B65&#34;&gt;Hannan 等人 2019&lt;/a&gt;）。尽管这种方法对于理解限定领域内的概念很有用，但即使如此，它也可能不切实际且难以衡量，因为很难先验地陈述分析师应预期的相关维度&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B73&#34;&gt;（Hofstadter 和 Sander 2013 ）&lt;/a&gt;。 &lt;strong&gt;词嵌入的优点在于，概念之间的关系以及对任何给定概念重要的相关维度可以从语言的使用方式中推断出来，因此不需要事前指定&lt;/strong&gt;。鉴于在分析之前没有必要陈述相关维度，即使是最复杂的组织行为剧场也变得易于分析处理。正如其他人所指出的，“词嵌入为语言中包含的多个维度的含义提供了全面且有意义的见解，这是以前的方法无法捕获的”（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B105&#34;&gt;Lix 等人，2022 年&lt;/a&gt;，第 8434 页）。在某种程度上，这种优势源于这样一个事实：神经网络架构能高效地记录意义的维度。&lt;/p&gt;
&lt;br&gt;
&lt;h4 id=&#34;492-更大的有效维度&#34;&gt;4.9.2 更大的有效维度。&lt;/h4&gt;
&lt;p&gt;嵌入通常由 100 到 1,000 个密集编码维度表示。&lt;strong&gt;编码的密度意味着每个词向量在所有建模维度上都有一个非零坐标&lt;/strong&gt;。正如附录中所指出的，主题模型可能具有相同数量的主题（例如，100-1,000），但这些主题被稀疏编码以方便人类解释，使得主题仅具有一些基本上非零的单词加载，并且文档仅具有少量非零的主题负载。&lt;strong&gt;因此，主题模型是为了描述而构建的，但代价是迫使其表示的有效维度从数百个减少到几个，从而扭曲了本来可以在主题空间内计算的距离。相比主体模型， 词嵌入使用密集编码，每维度的嵌入很难理解和描述，但距离具有更大的自由度，可以更精确地编码含义&lt;/strong&gt;。通过这种方式，相对于低维理论和测量，嵌入为分析师提供了“大量潜在轴，个人和社会群体可以沿着这些轴竞争、合作、分裂或合并”（Kozlowski et al. 2019，p.27 &lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B89&#34;&gt;）&lt;/a&gt;。&lt;/p&gt;
&lt;br&gt;
&lt;h4 id=&#34;493-无监督训练&#34;&gt;4.9.3 无监督训练。&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;词嵌入还有一个特殊优点，即训练模型时， 以看似无监督或自监督的方式进行，从而避免了手动编码文本语义内容的繁琐，完全由机器学习&lt;/strong&gt;。在我们的创新示例中，向量空间由我们专利语料库中的每个发明人按照他们所写句子的数量和长度的比例进行监督。每个单词的滑动窗口都是为了向专利审查员和未来的发明者传达一种含义而构建的，该算法用于构建向量空间并以概念上适当的方式定位单词。因此，学者们可以利用专利语料库来训练 &lt;strong&gt;技术创新&lt;/strong&gt; 的概念空间，利用财报电话会议记录和新闻稿来训练 &lt;strong&gt;上市公司沟通&lt;/strong&gt; 的概念空间，利用分析师报告来训练 &lt;strong&gt;投资分析&lt;/strong&gt; 的概念空间，或者特定领域的概念空间。使用内部通信（例如 Slack 和电子邮件）来了解公司的知识。这些概念空间可以在最少的监督下进行训练，因此很快成为有价值的观察站，用于追踪组织科学家关注的组织生活的静态和动态（Hofstra et al. 2020，Whalen et al. 2020，&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B74&#34;&gt;Burtch&lt;/a&gt; et &lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B184&#34;&gt;al&lt;/a&gt; . &lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B18&#34;&gt;2021&lt;/a&gt;，&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B177&#34;&gt;Waller 和 Anderson 2021&lt;/a&gt;、&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B2&#34;&gt;Aceves 等人 2022&lt;/a&gt;、&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B20&#34;&gt;Carlson 2022&lt;/a&gt;、&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B23&#34;&gt;Chambers 等人 2022&lt;/a&gt;、&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B61&#34;&gt;Guzman 和 Li 2023&lt;/a&gt;、&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B94&#34;&gt;Lawson 等人 2022&lt;/a&gt;、&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B105&#34;&gt;Lix 等人 2022&lt;/a&gt;）。&lt;/p&gt;
&lt;br&gt;
&lt;h4 id=&#34;494-共现是不必要的&#34;&gt;4.9.4 共现是不必要的。&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;这些模型的另一个优点是，两个概念不必在任何文档中同时出现，就可以将它们编码为相似的向量&lt;/strong&gt;。所需要的只是它们与相似的概念同时出现。例如，我们可以先验地指出 &lt;em&gt;&lt;strong&gt;医生&lt;/strong&gt;&lt;/em&gt; 和 &lt;em&gt;&lt;strong&gt;律师&lt;/strong&gt;&lt;/em&gt; 在某些方面非常相似（例如，他们需要高级学位，具有高收入水平等），但他们可能永远不会同时出现在语料库的同一文档中。尽管彼此之间缺乏共现性，但它们很可能都独立地与高收入*、&lt;em&gt;高学历&lt;/em&gt;、*白领等概念同时出现，从而最终拥有编码这些相似性的接近向量。&lt;strong&gt;因此，嵌入模型的底层计算架构可以更好地近似社会和文化含义，而无需求助于严格的共现&lt;/strong&gt;。&lt;/p&gt;
&lt;br&gt;
&lt;h4 id=&#34;495-上下文相关的含义结构&#34;&gt;4.9.5 上下文相关的含义结构。&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;使用定制训练的嵌入模型的一个优点是它将捕获上下文相关的含义结构&lt;/strong&gt;。例如，&lt;em&gt;&lt;strong&gt;“甜”&lt;/strong&gt;&lt;/em&gt; 的含义在软件团队的背景下与 &lt;em&gt;&lt;strong&gt;烹饪&lt;/strong&gt;&lt;/em&gt; 的背景下会有所不同。正如&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B105&#34;&gt;Lix 等人。（2022）&lt;/a&gt;指出，在软件团队的背景下，与 &lt;em&gt;&lt;strong&gt;“甜蜜”&lt;/strong&gt;&lt;/em&gt; 最接近的术语是 &lt;em&gt;&lt;strong&gt;“强烈”&lt;/strong&gt;&lt;/em&gt;、 &lt;em&gt;&lt;strong&gt;“兴奋”&lt;/strong&gt;&lt;/em&gt; 和 &lt;em&gt;&lt;strong&gt;“耶”&lt;/strong&gt;&lt;/em&gt;。此外，就同一个单词编码不同概念（一词多义）而言，单词每种含义的概念信息都位于单词嵌入内的线性叠加（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B5&#34;&gt;Arora et al. 2018&lt;/a&gt;）。这意味着编码诸如 &lt;em&gt;&lt;strong&gt;“Bank”&lt;/strong&gt;&lt;/em&gt; 之类的单词的&lt;em&gt;n&lt;/em&gt;维向量包含其代表的所有概念的概念信息，例如 &lt;em&gt;&lt;strong&gt;河边&lt;/strong&gt;&lt;/em&gt; 或 &lt;em&gt;&lt;strong&gt;金融机构&lt;/strong&gt;&lt;/em&gt;。通过这种方式，即使在多义词的情况下，单词的上下文相关含义也会被编码到模型中。当这些上下文相关的含义不仅不同，而且是排他的或相反的时，来自转换器的上下文相关嵌入可以为上下文中的每个单词呈现不同的单词向量。&lt;/p&gt;
&lt;br&gt;
&lt;h4 id=&#34;496-几何有助于概念人群体和组织的细粒度表示&#34;&gt;4.9.6 几何有助于概念、人、群体和组织的细粒度表示。&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;我们认为，词嵌入模型可以在训练的语料库范围内产生人类活动概念空间的细粒度表示&lt;/strong&gt;。&lt;strong&gt;这意味着，从概念空间内编码的信息中，我们可以恢复个人、群体和组织本身的细粒度表示&lt;/strong&gt;。以我们的创新案例为例，&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#F1&#34;&gt;图 1&lt;/a&gt;描述了在说明性二维空间中这是如何实现的。学习到的概念空间将由单词或短语w表示的概念作为其最原子的分析级别。我们的限制性示例显示了在2维空间中排列的九个单词。单词 1-3 由发明人 1 使用，单词 4-6 由发明人 2 使用，单词 7-9 由发明人 3 使用。&lt;strong&gt;通过获取每个人的单词向量的质心向量，我们可以得出每个发明人在创新的概念空间&lt;/strong&gt;。&lt;strong&gt;将这个过程提升到团队和组织级别，我们可以在发明人团队和组织的概念空间内得出独特的向量&lt;/strong&gt;。因此，词嵌入架构不仅在概念的最原子级别上是细粒度的，而且还可以在更聚合级别上提供细粒度的表示。相对于团队多样性、组织差异化和注意力等结构的粗粒度代理，这形成了显着的测量改进，这些结构在嵌入特定概念空间时是有意义的。&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/figure-1.jpeg&#34; alt=&#34;&#34;  /&gt;
&lt;strong&gt;图 1.嵌入作为概念、人员、群体和组织的细粒度表示&lt;/strong&gt;&lt;/p&gt;
&lt;br&gt;
&lt;h4 id=&#34;497-细粒度几何减少了上下文信息的丢失&#34;&gt;4.9.7 细粒度几何减少了上下文信息的丢失。&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;由于粗糙、粗粒度的代理指标无法承载相关信息，在实证分析和相关理论构建中就无法利用这些信息&lt;/strong&gt;。嵌入模型的优势在于其独特的信息表征，可以携带更多的信息，信息的粒度更小，保存的信息量更多。&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#F2&#34;&gt;图 2&lt;/a&gt;使用团队多样性的示例来说明如何实现这一点。&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#F2&#34;&gt;图 2(a)&lt;/a&gt;显示了两个团队，1 和 2，每个团队要么通过熵（一种标准的、集合论多样性的理论度量（顶行））来表示，要么通过概念广度（基于底层概念的细粒度度量）来表示。团队调动的信息（底行）。团队 1 和团队 2 都有四名成员，团队 1 由两名生物化学家、一名化学家和一名分析化学家组成，团队 2 由两名生物化学家、一名海洋学家和一名计算机科学家组成。&lt;strong&gt;由于两个团队的团队成员类型比例相同，因此它们都被编码为具有相同的团队多样性熵度量 1.5&lt;/strong&gt;。**然而，当考虑团队成员的概念信息时，我们发现它们是本质上不同类型的团队，团队 1 的多样性或概念范围远不如团队 2 **。这表明粗粒度的测量可能会留下未开发的有价值的上下文信息（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B187&#34;&gt;Wolpert et al. 2014&lt;/a&gt;，&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B33&#34;&gt;DeDeo 2017&lt;/a&gt;）。因此，我们应该看到更细粒度的衡量标准与相关的、理论上的绩效结果之间的联系更加紧密和一致。&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/figure-2.jpeg&#34; alt=&#34;&#34;  /&gt;
&lt;strong&gt;图 2.（在线彩色)细粒度表示可防止有价值的信息丢失&lt;/strong&gt;&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;专利数据集使我们能够通过三种构建的措施来说明这一主张。首先，集合论团队多样性度量，使用团队先前专利在专利主要类别中的分布（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B79&#34;&gt;Huo 等人，2019&lt;/a&gt;）。第二种替代措施使用专利子类，以便它们提供相对于第一种更细粒度的措施。第三个衡量标准依赖于团队成员先前专利在创新概念空间内的&lt;strong&gt;概念广度&lt;/strong&gt;。&lt;/p&gt;
&lt;br&gt;
&lt;h4 id=&#34;498-词嵌入的局限&#34;&gt;4.9.8 词嵌入的局限。&lt;/h4&gt;
&lt;p&gt;到目前为止，我们的注意力仅限于讨论嵌入模型的结构，描述它们与概念空间的关系，并注意到它们的优点。在这里我们将说明其局限性，讨论它们的严重性、改善方式，以及何时不要用词嵌入的意外情况。我们讨论三类限制。第一个源于神经网络模型一般复杂的“黑匣子”性质，以及这带来的具体挑战，涉及输入数据的偏差，以及模型正确推理的范围，特别是那些对超出分析师背景的数据进行预训练的模型。第二个与这些模型的大小以及训练它们所需的数据量有关。第三个问题涉及词嵌入模型的具体局限性以及从脱离韵律和表达上下文的文本数据中分析含义的挑战。&lt;/p&gt;
&lt;p&gt;许多学者首先担心的是，多级神经网络模型显得复杂且在统计上难以理解，&lt;strong&gt;经常被批评为“黑匣子”方法&lt;/strong&gt;，无法“打开”以询问其性能背后的机制（Knight 2017，Leavitt et &lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B86&#34;&gt;al&lt;/a&gt; . &lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B95&#34;&gt;2021&lt;/a&gt;） 。现代神经网络词嵌入模型通常作为自监督模型实现，该模型启发式搜索单词之间的依赖关系空间以预测屏蔽词的身份。&lt;strong&gt;自从第一个高性能嵌入发布（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B115&#34;&gt;Mikolov 等人，2013b&lt;/a&gt;）以来，对其黑盒性质的一些担忧已经减弱，因为数学家发现最流行的“浅”词嵌入模型（如 Word2Vec 和 FastText）获得了很大的优势&lt;/strong&gt;。其强大功能来自于近似易于理解的矩阵分解方法的运算，例如因子分析、主成分分析和对应分析（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B100&#34;&gt;Levy 和 Goldberg 2014&lt;/a&gt;）。&lt;/p&gt;
&lt;p&gt;“黑盒”输入输出方法带来的一个相关潜在限制是，&lt;strong&gt;输入的偏差将转化为输出中的偏差&lt;/strong&gt;——用于训练嵌入的语料库的偏差将被编码在生成的单词嵌入模型中（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B14&#34;&gt;Bolukbasi等人，2016&lt;/a&gt;）。当模型用于现实世界的下游应用程序（例如推荐服务）时，这可能是有害的。例如，硬编码到嵌入中的种族和性别刻板印象可能会导致有偏见的建议（例如，评估是否适合招聘职位或预测财务违约的可能性），并导致不公平和不道德的决定（例如，拒绝工作或信贷） 。学者们应该根据他们的研究问题和设计，主动考虑这种负外部性是否可能，并在对人类造成伤害的可能性足够高时，偶然放弃嵌入。&lt;strong&gt;然而，在某种程度上，理解社区和研究背景中概念关联的本质是核心，研究人员将需要这些偏见进行分析。如果不包括它们，模型以及研究设计就会错过表征其研究背景的关键社会和文化规律。&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;如果分析人员对生成语料库的上下文没有清晰的了解，就会出现另一个相关的限制，这样他们最终可能会做出不适用和不相关的推论&lt;/strong&gt;。例如，强调意义随时间变化的研究（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B19&#34;&gt;Caliskan et al. 2017&lt;/a&gt;、&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B49&#34;&gt;Garg et al. 2018&lt;/a&gt;、&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B89&#34;&gt;Kozlowski et al. 2019&lt;/a&gt;）的特点是词义表现出来自外源冲击的间断变化，从而重新配置了概念关联的结构。穿过空间。想一想 2005 年卡特里娜飓风之后“卡特里娜”的含义发生了怎样的变化。2009 年金融危机之后，金融术语的含义发生了重新配置，部分原因是添加了“问题资产救助计划”等许多新术语。忽略外源冲击可能会导致对后面和验证部分中描述的措施的错误解释，将其视为仅由进化产生的结果，从而导致错误的推论。这是一个特别成问题的问题，因为许多最准确的词嵌入模型都是在从网络上提取的大量文本语料库上进行预训练的。此类模型可用于引导非常小的文本数据之间的有意义距离，这是一项常见任务，但&lt;strong&gt;如果预训练数据是异构的，则距离可能无法反映焦点文本的概念世界&lt;/strong&gt;。&lt;/p&gt;
&lt;p&gt;接下来的两个限制必然是其嵌入优势的另一面。词嵌入模型产生的细粒度信息会带来特定研究可能或可能无法维持的成本。首先是模型尺寸。&lt;strong&gt;每个单词的数百个维度的细粒度信息或上下文嵌入需要比简单的字典计数或潜在狄利克雷分配主题模型更大的存储空间&lt;/strong&gt;。这与通常用于将数据维度减少到两个或三个的因子和主成分分析形成鲜明对比。词嵌入模型使用更多维度（通常为 200-500）来更准确地预测数据的屏蔽部分。尽管如此，当前个人计算机的计算能力和存储能力现在允许训练合理大小的嵌入。&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;与此相关的是，词嵌入模型需要比先前模型更多的文本才能稳健地估计概念空间&lt;/strong&gt;。当大型语料库与研究主题相似并且可以用作理论相关文档或微调过程的初始化的代理时，可以通过迁移学习来弥补这一挑战。&lt;strong&gt;然而，有时相关语言在内容、目的或形式上与模型预训练的数据有很大不同，它需要独立建模，但又足够小，无法维持对嵌入模型的稳健估计。在这种情况下，使用字典计数或主题模型可能会更好，因为数据只能维持粗粒度的关联，而这些方法旨在捕获粗粒度的关联。&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;最后一类通常涉及词嵌入和文本方法的特殊限制。首先，静态词嵌入本身并不处理一词多义，即一个词（例如 &lt;em&gt;&lt;strong&gt;“bank”&lt;/strong&gt;&lt;/em&gt; ）编码多个概念（例如金融机构、河边、侧向倾斜）的情况。尽管多义词的存在可能会影响后续一些指标的测量，但也存在抵消的力量。一方面，研究发现多义词的含义以相互线性叠加的方式编码在单词向量内（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B5&#34;&gt;Arora et al. 2018&lt;/a&gt;）。这意味着该算法通过同时考虑单词的所有含义来对单词在概念空间中的位置进行编码，从而克服了原本可能存在的严重缺陷。另一方面，上下文嵌入架构（在线附录中有更详细的描述）通过根据焦点词周围的上下文输出不同的向量来明确解决多义词的问题。每个单词不是单个向量，而是根据用途而变化的向量云。如果分析师怀疑一词多义可能是特定分析的严重问题，他们可以偶然使用上下文嵌入并规避这种担忧。&lt;/p&gt;
&lt;p&gt;最后一个潜在的限制是文本方法的一般特征。只要文本数据是转录语音话语的产物（例如，欧洲央行或美联储主席演讲、政治演讲、财报电话会议、电视或电影文字记录、对话互动），语音的语调、语气和音色将没有纳入到嵌入表示中。考虑到。&lt;strong&gt;鉴于某些语言（例如中文）更严重地依赖语调来传达含义，这可能或多或少存在问题，具体取决于话语发生的社会背景及其表达语言&lt;/strong&gt;。因此，在语调和语气在语料库中发挥重要作用的情况下，学者们应该讨论他们的嵌入模型选择和解释决策的后果。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;410-在研究中使用词嵌入模型的路线图&#34;&gt;4.10 在研究中使用词嵌入模型的路线图&lt;/h3&gt;
&lt;p&gt;现在我们大脑对词嵌入模型是什么、如何表示概念空间、如何训练、优点和局限性有了框架性的认知，接下来可以将它们整合到研究和理论构建的标准方法中。&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#T1&#34;&gt;表 1&lt;/a&gt;列出了如何将嵌入模型集成到科学流程中的路线图。&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;步骤 1-3 是研究过程中的标准步骤，包括确定一个可行且有趣的研究问题，通过在适当的实证背景下进行评估，为重要的理论问题提供信息（Weick 1989 &lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B179&#34;&gt;）&lt;/a&gt;。&lt;/li&gt;
&lt;li&gt;步骤 4-9 总结了本文到目前为止对嵌入模型的讨论。&lt;/li&gt;
&lt;li&gt;步骤 10 和 11 ，与下一节指标度量有关，通过标准定量和定性方法调动该度量。&lt;/li&gt;
&lt;/ul&gt;
&lt;br&gt;
&lt;p&gt;&lt;strong&gt;表 1.在研究中使用词嵌入模型的路线图&lt;/strong&gt;&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;步骤&lt;/th&gt;
&lt;th&gt;活动&lt;/th&gt;
&lt;th&gt;基本原理&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1. &lt;strong&gt;确定研究问题&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;如果研究问题至关重要，请确定文本数据是否有助于在理论研究上有帮助。&lt;/td&gt;
&lt;td&gt;吸引研究人员把注意力聚焦在理论问题、词嵌入构建研究构念回答问题的交叉点。&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2. &lt;strong&gt;理论建立及相关理论构建&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;确定使用哪种理论框架来解决研究问题以及通过嵌入模型来操作哪种理论构念。&lt;/td&gt;
&lt;td&gt;理论构念与其词嵌入指标(构念的衡量）之间的紧密联系能够实现累积的理论发展。&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3. &lt;strong&gt;定义经验背景&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;选择适当的实证背景，在其中回答研究问题并动员理论框架和构念。&lt;/td&gt;
&lt;td&gt;确保研究问题、理论框架和用构念以逻辑方式相互加强。&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4.&lt;strong&gt;指定将用于表示经验背景的概念空间的文本数据&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;描述将用于训练词嵌入模型和测量感兴趣的理论构念的文本数据的范围。 数据是否有效地涵盖了您想要得出理论结论的经验背景下的行为活动范围？&lt;/td&gt;
&lt;td&gt;确保用于计算理论构造度量的词嵌入模型在逻辑上映射到并有效地代表所提出的理论框架内的实证研究背景。 文本数据的范围应该在逻辑上映射到所讲述的理论故事的范围。&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5.&lt;strong&gt;确定文本数据的大小和范围&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;数据是否足够大以学习相关概念空间的准确表示？&lt;/td&gt;
&lt;td&gt;文本数据的大小将决定是否应该训练自定义嵌入，或者是否应该使用可用数据对现成的嵌入进行微调。&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6. &lt;strong&gt;给定数据大小，要么训练独特的词嵌入模型，要么微调现有模型&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;如果文本数据足够大，则训练自定义嵌入来表示感兴趣的经验上下文的概念空间。 如果文本数据不够大，请使用这些数据来微调现有的现成嵌入模型。&lt;/td&gt;
&lt;td&gt;确保用于测量理论结构的嵌入模型能够有效地表示经验背景的相关概念空间。&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7. &lt;strong&gt;如果训练独特的模型，请选择一种算法&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;在连续词袋 (CBOW) 或 Skip-Gram 模型之间进行选择。&lt;/td&gt;
&lt;td&gt;CBOW：在较小的数据集上可以有更好的性能。 &lt;br&gt;Skip-gram：可以更好地捕获语义。&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8. &lt;strong&gt;如果训练独特的模型，确定相关参数&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;选择窗口大小和维数。&lt;/td&gt;
&lt;td&gt;窗口大小：标准做法是 5。较小的窗口可以更大程度地捕获语法，较大的窗口可以更大程度地捕获语义，但收益递减并增加计算成本。 维度数：标准做法是 300，超过此点后性能回报递减。&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9. &lt;strong&gt;验证词嵌入模型&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;请遵循在线附录中的验证程序。&lt;/td&gt;
&lt;td&gt;确认嵌入模型准确有效地表示了经验背景的概念空间。&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10. &lt;strong&gt;计算相关度量&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;通过确定将用于实施感兴趣的理论构念的相关概念集，创建“实际措施和应用”部分中的措施之一。&lt;/td&gt;
&lt;td&gt;使学者能够将该测量用于定量或定性分析。&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11. &lt;strong&gt;在标准定性或定量方法中使用计算的度量&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;对于定量分析，该度量要么成为自变量，要么成为因变量。 对于定性分析，学者可以提供解释性分析，因为它们可能适用于其他类型的档案、民族志或视听数据。&lt;/td&gt;
&lt;td&gt;嵌入模型表示对生成数据的社会背景的概念空间的描述。&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;五实际措施与应用&#34;&gt;五、实际措施与应用&lt;/h2&gt;
&lt;p&gt;现在已经正式定义了 &lt;strong&gt;概念&lt;/strong&gt; 和 &lt;strong&gt;概念空间&lt;/strong&gt; 的含义，并说明了先前的文献如何处理概念信息,  介绍了嵌入模型表示能力的底层逻辑，并在在线附录中完成了支持这种直觉的几个验证步骤。也评论了嵌入模型给概念信息分析带来的几个优点和相关缺点。&lt;/p&gt;
&lt;p&gt;在本章中，我们将介绍一些新研究， 学习他们如何用嵌入生成独特指标。&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#T2&#34;&gt;表 2&lt;/a&gt;总结了这些指标及示例应用。&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;&lt;strong&gt;表 2.词嵌入测量和示例应用&lt;/strong&gt;&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;措施&lt;/th&gt;
&lt;th&gt;研究性学习&lt;/th&gt;
&lt;th&gt;关键构念&lt;/th&gt;
&lt;th&gt;研究问题&lt;/th&gt;
&lt;th&gt;代表性调查结果&lt;/th&gt;
&lt;th&gt;嵌入在这种情况下的优点&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1. &lt;strong&gt;概念广度&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B105&#34;&gt;利克斯等人。(2022)&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;话语多样性——在一组给定的互动中，群体成员所传达的含义彼此分歧的程度。&lt;/td&gt;
&lt;td&gt;一个群体的话语多样性如何影响其绩效？&lt;/td&gt;
&lt;td&gt;高绩效团队会调整他们的共享认知以匹配任务的要求（例如，构思与协调）。&lt;/td&gt;
&lt;td&gt;能够随着时间的推移以细粒度的细节和动态地追踪小组对话的概念广度，使学者们能够追踪话语多样性的新理论构造。&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2.&lt;strong&gt;概念距离和相似度&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B74&#34;&gt;霍夫斯特拉等人。(2020)&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;语义遥远的科学新颖性：博士论文中新链接概念的语义距离。&lt;/td&gt;
&lt;td&gt;代表性不足的群体是否更有可能产生科学创新？&lt;/td&gt;
&lt;td&gt;相对于男性，女性引入了更遥远的新奇事物。 然而，这种语义上遥远的新颖性在该学科中很少受到关注。&lt;/td&gt;
&lt;td&gt;能够追踪新概念组合的概念距离，使学者不仅可以研究是否做出了新组合，还可以研究这些组合的语义距离最终如何影响其影响。&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3.&lt;strong&gt;概念X性&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B94&#34;&gt;劳森等人。(2022)&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;性别刻板印象：男性（而非女性）与以成就为导向的代理特征（例如自信和果断）相关的程度。&lt;/td&gt;
&lt;td&gt;雇用女性首席执行官和董事会成员是否与组织对代理语言的性别使用发生变化有关？&lt;/td&gt;
&lt;td&gt;当组织雇用女性首席执行官和董事会成员时，女性的语义与代理的语义变得更加一致。&lt;/td&gt;
&lt;td&gt;对 22 家标准普尔 500 强公司的 43,000 多份文件（包含超过 12 亿字）进行分析，深入细致地研究女性的含义如何因聘用女性领导者而发生变化。否则这样的分析是不可能的。&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;4.概念意义&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B63&#34;&gt;汉密尔顿等人。(2016)&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;词语的文化意义：词语的含义随时间变化的程度。&lt;/td&gt;
&lt;td&gt;语义演化的可能驱动因素是什么？&lt;/td&gt;
&lt;td&gt;跨历史时期的语义变化率与词频的逆幂律成正比。 与频率无关，具有更多含义的单词具有更高的语义变化率。&lt;/td&gt;
&lt;td&gt;能够探索跨多个知识和文化领域的大型历史时期和大量文本中的语义变化。例如，他们可以详细追踪同性恋这个词的含义如何从&lt;em&gt;快乐&lt;/em&gt;和&lt;em&gt;艳丽&lt;/em&gt;等概念转向&lt;em&gt;同性恋&lt;/em&gt;和&lt;em&gt;女同性恋&lt;/em&gt;等概念。&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5. &lt;strong&gt;文化和知识连续体中的概念立场&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B89&#34;&gt;科兹洛夫斯基等人。(2019)&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;社会阶层标记：区分社会阶层维度的概念。&lt;/td&gt;
&lt;td&gt;20世纪社会阶级的标志是如何变化的？&lt;/td&gt;
&lt;td&gt;尽管社会阶级维度在历史上保持稳定，但阶级文化标记在每个维度中的定位方式却不断发生变化（例如，员工从士兵和肌肉等概念转变&lt;em&gt;为&lt;/em&gt;白领&lt;em&gt;和&lt;/em&gt;中产阶级&lt;em&gt;等&lt;/em&gt;概念*）*。&lt;/td&gt;
&lt;td&gt;能够将文化相关的概念投射到文化相关的兴趣连续体上，从而使研究人员不仅可以在单个历史时期内而且可以在其历史演变过程中了解广泛共享的社会关联。&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6. &lt;strong&gt;概念维度&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B89&#34;&gt;科兹洛夫斯基等人。(2019)&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;阶级的文化维度：理解社会阶级的维度（富裕、教育、修养、地位、就业、道德、性别）&lt;/td&gt;
&lt;td&gt;20 世纪文化阶层的规模有多稳定？&lt;/td&gt;
&lt;td&gt;20世纪，尽管发生了巨大的经济转型，阶级规模仍然非常稳定。&lt;/td&gt;
&lt;td&gt;能够对阶级的多个概念维度进行实证分析，从而理解 20 世纪美国它们之间的相互关系。&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;br&gt;
&lt;h3 id=&#34;51-概念广度&#34;&gt;5.1 概念广度&lt;/h3&gt;
&lt;h4 id=&#34;511-指标&#34;&gt;5.1.1 指标&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;可以测量文档中单词之间的距离来计算它们在概念空间中的分布范围&lt;/strong&gt;。文档可以是从专利到个人电子邮件通信的任何内容。我们可以测量每个单词与其他单词的平均距离有多远。&lt;strong&gt;获取文档内元素的平均距离（或每个单词与文档质心之间的距离）可以衡量该文档内的「概念宽度」&lt;/strong&gt;。例如，我们衡量每项专利的概念广度， 可以从两个简单的文档开始，&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;doc1 =  [&amp;#34;biochemistry&amp;#34;, &amp;#34;chemistry&amp;#34;, &amp;#34;analytical_chemistry&amp;#34;]
doc2 =  [&amp;#34;chemistry&amp;#34;, &amp;#34;oceanography&amp;#34;, &amp;#34;computer&amp;#34;]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;使用我们的专利嵌入模型，我们得到第一组(doc1)的平均宽度为 29，第二组（doc2）平均宽度为 47。这表明第二组在概念上比第一组更广泛。&lt;/p&gt;
&lt;p&gt;当我们衡量文档集合而不是单词的概念广度时，同样的逻辑也适用。例如，我们想了解发明者团队的广度。在这种情况下，我们可以将团队中的每个发明人视为嵌入概念空间中的“文档”，参考如图&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#F1&#34;&gt;1&lt;/a&gt; , 从下往上，依次是词概念空间、发明人概念空间、团队概念空间、组织概念空间。一个发明人团队的成员已经在涉及纳米技术、生物技术和软件的概念空间领域发表了先前的专利，那么在概念上将被认为比所有成员只发表了纳米技术专利的团队更广泛。即使所有发明人都将其公开的专利限制在一个类别内，该指标仍然会提供显着的变化。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/figure-1.jpeg&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h4 id=&#34;512--应用&#34;&gt;5.1.2  应用&lt;/h4&gt;
&lt;p&gt;这种概念广度的度量已在最近的工作中用于追踪各种理论构念。&lt;strong&gt;&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B105&#34;&gt;利克斯等人。(2022)&lt;/a&gt;衡量团队成员在参与软件项目的不同阶段时的 话语广度&lt;/strong&gt;。&lt;strong&gt;他们能够追踪每个独特项目阶段概念参与的多样性，发现表现最好的团队有能力改变他们的认知以适应手头不断变化的任务，在提出新想法时表现出更大的话语广度，而在转换时表现出较低的广度依赖于协调的任务。这种细粒度的知识参与概念很难用以前的文本分析方法来追踪&lt;/strong&gt;。 详细内容可阅读大邓近期推文 &lt;a href=&#34;https://textdata.cn/blog/2023-11-02-measure-cognitive-diversity-through-language-discursive-diversity/&#34;&gt;MS2022 | 使用语言差异性测量团队认知差异性&lt;/a&gt; 。&lt;/p&gt;
&lt;p&gt;另外，研究人员使用概念广度来追踪在线社区成员根据状态变化分配注意力的范围，发现状态和注意力广度之间存在 U 形关系（Aceves et al. 2022 &lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B2&#34;&gt;）&lt;/a&gt;。这些研究人员训练了 150 个知识领域的概念空间，从而能够追踪不同知识领域的相似注意力动态，从计算机编程和数学到育儿和园艺。由于他们有能力在数百个社区的文本中大规模部署算法，因此他们能够计算出超过 2000 万成员如何在这些问答社区上发布的 2300 万个问题中分配注意力。&lt;/p&gt;
&lt;p&gt;其他工作在整个语言中实施了这种方法，追踪语言在所有知识领域具有更宽或更窄的概念空间的程度（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B1&#34;&gt;Aceves 和 Evans 2021&lt;/a&gt;）。使用圣经、电影字幕和以多种语言编写的政治文件等文本的并行翻译（包含相同的信息但以不同的语言编码），他们能够追踪概念在不同语言中相互关联的程度存在显着差异。他们发现，尽管一些语言将不同的概念子空间紧密地联系在一起，并将不同的概念领域编织在一起，但其他语言却稀疏且更加支离破碎，更强烈地分隔了不同的意义域。然后，他们观察概念空间的语言密度如何塑造数百种语言的真实对话和维基百科文章的概念广度。&lt;/p&gt;
&lt;p&gt;所有三篇论文都为不同文献的研究开辟了新的理论途径，例证了该方法的潜力。如果没有概念空间的概念及其通过嵌入模型的表示，这些新的研究途径将很难实施。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;52-概念距离和相似度&#34;&gt;5.2 概念距离和相似度&lt;/h3&gt;
&lt;h4 id=&#34;521-指标&#34;&gt;5.2.1 指标&lt;/h4&gt;
&lt;p&gt;当我们的分析重点在于集合内的元素时，前面描述的概念广度构念是相关的。当我们的分析重点是不同集合之间的关系时，可以使用相同的基础度量。在这种情况下，我们将指的是概念距离或相似性，而不是概念广度。&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#fn14&#34;&gt;13&lt;/a&gt;形式上，如果我们有至少两个集合，每个集合中至少有一个元素，我们可以计算这些集合之间的&lt;strong&gt;概念距离，作为每个集合的质心或多维平均值之间的距离&lt;/strong&gt;。最基本的是，我们可以计算两个集合之间的概念距离，每个集合包含一个单词。这无非是衡量这些词之间的概念距离。随着元素数量和集合数量的增加，底层计算保持不变，但理论可能性的范围扩大。还可以通过训练文档嵌入模型来计算这种距离/相似性度量，该模型在嵌入空间中为每个文档分配一个向量，其权重按照单词共现的相同逻辑进行训练（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B96&#34;&gt;Le 和 Mikolov 2014&lt;/a&gt;），将文档本身视为文档中的另一个单词，将这些单词用作与其共现的单词。&lt;/p&gt;
&lt;p&gt;通过将概念相似性与衡量专利相似性的现有技术进行比较，我们可以一睹该衡量标准的潜力。首先，研究人员可以通过查看专利授予机构使用的官方分类来追踪专利的相似性，同一类别的专利被认为比不同类别的专利更相似（Singh 和 Marx 2013，Aharonson 和&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B157&#34;&gt;Schilling&lt;/a&gt; 2016 &lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B3&#34;&gt;）&lt;/a&gt;）。这种方法的局限性在于分类度量是粗粒度的，并且不太可能考虑所有相关的技术特征，特别是当类别边界必然滞后于技术进化时（Thompson 和 Melanie Fox-Kean 2005，Singh&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B172&#34;&gt;和&lt;/a&gt;Agrawal &lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B155&#34;&gt;2011&lt;/a&gt;，&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B7&#34;&gt;Arts 等人，2018&lt;/a&gt;）。其次，研究人员可以获取两项专利并测量它们之间的单词重叠（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B7&#34;&gt;Arts et al. 2018&lt;/a&gt;）。然而，这种方法是有限的，因为它仅适用于成对的文档，无法确定专利相对于整个知识体系的位置。&lt;/p&gt;
&lt;p&gt;概念相似性解决了这些限制（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B184&#34;&gt;Whalen 等人，2020&lt;/a&gt;）。首先，它允许我们追踪专利在相关知识空间中的精确位置，从而访问知识系统中的所有相关的细粒度信息。其次，我们能够精确量化任何专利或专利组相对于任何其他专利或专利组的位置。&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#fn15&#34;&gt;14&lt;/a&gt;第三，随着新知识进入系统，知识的性质和结构不断演变，随着时间的推移重塑 &lt;strong&gt;概念边界&lt;/strong&gt; 和关联。&lt;strong&gt;嵌入使我们能够衡量专利发布时存在的概念空间内的专利相似性，使我们能够摆脱使用滞后的、周期性偏离的类别，并可能对连续的发明概念空间强加类别差异&lt;/strong&gt;。概念距离的所有这些优点都适用于其他知识和文化领域，在这些领域中，我们寻求测量思想、个人、群体或组织之间的距离或相似性，从而扩展现有的跨研究领域并开辟新的理论领域。&lt;/p&gt;
&lt;h4 id=&#34;522-应用&#34;&gt;5.2.2 应用&lt;/h4&gt;
&lt;p&gt;正如我们上面所做的那样，这种&lt;strong&gt;概念相似性的衡量方法最近被用来描述专利数据中的创新空间&lt;/strong&gt;（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B184&#34;&gt;Whalen 等人，2020&lt;/a&gt;）。研究人员使用  &lt;strong&gt;doc2vec&lt;/strong&gt;  框架计算了超过 6 亿个专利对的相似度。在生成这些知识相似性度量时，作者还使用这些分数提出了有趣的辅助度量，包括可操作的度量（a）现有技术接近度——专利引用与其自身相似或不相似的现有技术的程度，（b）现有技术同质性——一项专利引用知识空间领域彼此远离的程度，(c) 影响邻近性——一项专利被与其自身相似或不相似的未来专利引用的程度，以及(d) 影响同质性——一项专利通过其前向引用与一组不同的未来专利相关的程度。&lt;/p&gt;
&lt;p&gt;学者们也使用了这一衡量标准，重点关注概念距离。&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B18&#34;&gt;伯奇等人。(2021)&lt;/a&gt;使用概念距离的 &lt;strong&gt;doc2vec&lt;/strong&gt; 实现来调查同行奖励是否会影响在线社区内贡献的新颖性。这里的&lt;strong&gt;新颖性是根据社区成员获奖前后贡献的距离来衡量的&lt;/strong&gt;。作者发现，获奖后，奖项会导致知识空间内的新颖性减少，剥削行为增多。同样，&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B74&#34;&gt;霍夫斯特拉等人。（2020）&lt;/a&gt;使用 Word2Vec 距离度量来捕获科学论文将新颖性引入科学文献的程度，发现来自代表性不足群体的学生负责将最具新颖性引入系统。&lt;/p&gt;
&lt;p&gt;其他人则利用这一措施来实施公司差异化。在发展中国家微型企业的背景下，&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B20&#34;&gt;Carlson（2022）&lt;/a&gt;使用 BERT 架构（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B35&#34;&gt;Devlin et al. 2019&lt;/a&gt;，&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B142&#34;&gt;Reimers and Gurevych 2020&lt;/a&gt;）来计算其数据集中所有微型企业的成对余弦距离。通过这些距离，他们能够估计八个发展中国家的 10,000 家微型企业的差异化与收入和利润的增加相关。同样，&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B61&#34;&gt;Guzman 和 Li（2023）&lt;/a&gt;使用距离的 doc2vec 实现来使用 Crunchbase 数据来衡量初创公司的创始战略差异化。作者发现与&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B20&#34;&gt;Carlson (2022)&lt;/a&gt;类似的结果，即差异化经验的新公司在早期融资和股权结果方面有所增加。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;53-概念x&#34;&gt;5.3 概念X&lt;/h3&gt;
&lt;h4 id=&#34;531-指标&#34;&gt;5.3.1 指标&lt;/h4&gt;
&lt;p&gt;文档距离的另一个用途是追踪语料库中的任何文档与捕获感兴趣的构念X的焦点(原型）的相似性， 这样的测量将捕获任何观察的 &lt;strong&gt;概念X性&lt;/strong&gt;( Conceptual X-ness)。这种测量的第一步是描述与我们寻求尽可能精确测量的结构相关的概念信息。例如，如果我们想要捕获专利与 &lt;strong&gt;时间&lt;/strong&gt; 或 &lt;strong&gt;几何&lt;/strong&gt; 等概念相关的程度，我们可以构建一个我们认为映射到、定义或与这些概念相关的单词列表 。对于每个列表，我们计算其质心向量 (c#27)，然后测量任何给定专利距离 &lt;strong&gt;时间&lt;/strong&gt; 和 &lt;strong&gt;几何&lt;/strong&gt; 概念有多远。&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#fn16&#34;&gt;15&lt;/a&gt; 对于附录表 A2 中使用的专利，我们可以看到，与头颈约束装置专利相关的前两项专利更接近时间概念，正如所预期的那样光和时间在概念上交织的程度。概念性的&lt;em&gt;X&lt;/em&gt;度度量可用于追踪思想、个人、团体、组织或任何其他相关聚集的组成。&lt;/p&gt;
&lt;h4 id=&#34;532-应用&#34;&gt;5.3.2 应用&lt;/h4&gt;
&lt;p&gt;最近在一篇论文中使用了这种方法，该论文追踪了雇用女性担任高级领导角色对女性在这些组织中意味着什么的影响（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B94&#34;&gt;Lawson 等人，2022&lt;/a&gt;）。作者首先使用 SEC 文件和财报电话会议记录训练了 Word2Vec 嵌入。然后，他们创建并验证了一组 100 个单词来捕捉 &lt;strong&gt;代理概念&lt;/strong&gt; 的含义（例如，有能力、独立、主导），并观察了内部任命高级女性领导前后 &lt;strong&gt;代理概念&lt;/strong&gt; 与 &lt;strong&gt;女性&lt;/strong&gt; 概念之间的距离。该组织发现，在 &lt;strong&gt;女性&lt;/strong&gt; 被任命为高层管理人员之后的一段时间内，女性的含义在概念空间中更加接近于机构职位。作者使用不同的嵌入超参数和维度大小复制了他们的结果，说明了嵌入模型的鲁棒性，条件是具有捕获概念空间内语义变化的最小必要维度。&lt;/p&gt;
&lt;p&gt;这里有趣的理论机会包括更深入地参与理论传统的可能性，这些理论传统在组织科学以外的领域具有影响力，但由于缺乏可行的方法来以原则性的方式量化其理论构造，因此这些理论传统仍然处于我们的领域之外。依赖文学解释。正如我们所提出的，测量 &lt;strong&gt;概念X性&lt;/strong&gt; 使得扩大与理想形式（* &lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B134&#34;&gt;Plato Bloom 1968&lt;/a&gt;）、理想类型（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B178&#34;&gt;Weber 2011&lt;/a&gt;）、家族相似性（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B186&#34;&gt;Wittgenstein 2010&lt;/a&gt;）和原型（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B147&#34;&gt;Rosch 1973&lt;/a&gt;）相关的理论构造的测量成为可能。以一致、有原则和可复制的方式。在这方面，概念性的&lt;em&gt;X&lt;/em&gt;性代表着开放大量的认知和社会理论，以便在组织的背景下进行实证检验和扩展。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;54-语义转变和漂移&#34;&gt;5.4 语义转变和漂移&lt;/h3&gt;
&lt;h4 id=&#34;541-指标&#34;&gt;5.4.1 指标&lt;/h4&gt;
&lt;p&gt;概念空间使我们能够识别术语的含义如何随着时间和空间的变化而变化。探索概念意义的一种方法是为不同的个人、公司、行业、地理位置或时间段创建独特的嵌入模型，以了解它们之间的含义有何不同（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B148&#34;&gt;Roy 等人，2019 年&lt;/a&gt;；&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B181&#34;&gt;Welch 等人，2020a&lt;/a&gt;，&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B182&#34;&gt;b&lt;/a&gt;）。一旦识别出相关的兴趣分歧，我们就可以采用相关的语料库（例如，专利、财报电话会议、报纸）并为数据中的每个语料库训练概念空间。&lt;strong&gt;在我们的专利示例中，我们可能会训练两种嵌入模型，一种是 1990 年功能性磁共振成像技术发明之前的时期，另一种是 1990 年之后的时期&lt;/strong&gt;。然后我们可以探索与大脑和神经科学相关的概念的含义如何随着这一创新而改变。例如，在功能性磁共振成像发明之前和之后与不同大脑区域最相关的术语是什么。接下来，我们可以比较不同公司或国家的含义变化有何不同，以及这种变化的格局如何影响所涉及的公司和行业的组织和市场结果。显式动态词嵌入允许嵌入之间具有更大的可比性，但必然会忽略特殊的词和用途（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B63&#34;&gt;Hamilton et al. 2016&lt;/a&gt;、&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B192&#34;&gt;Zhang et al. 2016&lt;/a&gt;、&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B190&#34;&gt;Yao et al. 2018&lt;/a&gt;、&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B103&#34;&gt;Liu et al. 2020&lt;/a&gt;）。这些算法的输出带有时间戳词向量包含特定时期的语义信息，但在历史上保持可比性。&lt;/p&gt;
&lt;br&gt;
&lt;h4 id=&#34;54-2-应用&#34;&gt;5.4. 2 应用&lt;/h4&gt;
&lt;p&gt;第一篇在社会科学背景下使用词嵌入方法的主要论文就是使用这种方法来研究意义（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B63&#34;&gt;Hamilton et al. 2016&lt;/a&gt;、&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B19&#34;&gt;Caliskan et al. 2017&lt;/a&gt;、&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B49&#34;&gt;Garg et al. 2018&lt;/a&gt;、&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B89&#34;&gt;Kozlowski et al. 2019&lt;/a&gt;）。在第一篇论文中，研究人员使用四种语言的六个历史语料库，通过观察概念空间中最近的单词在过去几十年中如何变化来追踪单词含义随时间的变化（Hamilton et al. 2016 &lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B63&#34;&gt;）&lt;/a&gt;。使用 Word2Vec 嵌入，他们追踪了 &lt;strong&gt;同性恋&lt;/strong&gt;  概念的含义如何从 1900 年代围绕 &lt;strong&gt;“愚蠢”&lt;/strong&gt;、**“甜蜜” **和 **“开朗”  **等术语的含义转变为围绕 1950 年代 &lt;strong&gt;“嬉闹”&lt;/strong&gt;、 &lt;strong&gt;“机智”&lt;/strong&gt; 和 &lt;strong&gt;“聪明”&lt;/strong&gt; 等术语的含义，并且最终以 20 世纪 90 年代女同性恋、双性恋和同性恋等术语的含义结束。在另一篇论文中，研究人员研究了词嵌入中的刻板关联之间的关系及其与当代社会经验数据的关系（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B19&#34;&gt;Caliskan et al. 2017&lt;/a&gt;）。例如，他们追踪了职业的性别刻板印象，发现职业具有女性意义，因为它们与女性参与劳动力市场相关。在另一项研究中，&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B49&#34;&gt;Garg 等人。(2018)使用预先训练的 Google News Word2Vec 模型（ &lt;/a&gt;&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B4&#34;&gt;Google 2013&lt;/a&gt; ）量化了美国 100 多年历史中的性别和种族刻板印象，阐明了不同的形容词和职业如何或多或少地与不同人群（例如，男性与女性）密切相关，白人与亚洲人与西班牙裔）随着时间的推移。&lt;/p&gt;
&lt;p&gt;最近通过词嵌入追踪含义的工作已经使用这种方法更深入地研究了特定的上下文。一项研究使用 19 世纪第一人称叙述的语料库来追踪黑人和白人男性和女性的交叉身份如何映射到五个社会机构，包括政治、经济、文化、家庭领域和权威关系（Nelson 2021 &lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B120&#34;&gt;）&lt;/a&gt;。&lt;strong&gt;举论文中的一个例子，作者测量了与“精致”概念的距离，发现它与白人女性的联系最密切，而与黑人男性的联系最少&lt;/strong&gt;。&lt;/p&gt;
&lt;p&gt;在其他工作中，研究人员利用这种方法来衡量政治领导人的 &lt;strong&gt;集体意向性&lt;/strong&gt; （人们参与集体推理和行动的能力），并比较共和党和民主党领导人如何以不同的方式动员集体意向性（Kirgil and Voyer 2022 &lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B85&#34;&gt;）&lt;/a&gt;。他们通过创建复数代词（我们，我们的）、复数常量（国家名称）和复数名词（人）的复合列表来测量集体意向性。然后，使用词嵌入模型，他们找到了各州集体意向向量最接近的术语，使他们能够比较不同领导人如何不同地动员集体意向。总的来说，这些意义研究表明，就语言为我们提供了解文化的窗口而言（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B55&#34;&gt;Goldberg et al. 2016&lt;/a&gt;、&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B165&#34;&gt;Srivastava et al. 2018&lt;/a&gt;、&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B26&#34;&gt;Corritore et al. 2020&lt;/a&gt;），嵌入模型为我们提供了一种独特的表达方式透过那扇窗户看到的照片。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;55-文化和知识连续性中的概念地位&#34;&gt;5.5 文化和知识连续性中的概念地位&lt;/h3&gt;
&lt;h4 id=&#34;551-指标&#34;&gt;5.5.1 指标&lt;/h4&gt;
&lt;p&gt;另一种新颖的测量方法可以通过追踪概念相对于感兴趣的概念维度的位置来创建。如前所述，嵌入模型可用于解决类比推理任务，例如**“国王”-“男人”+“女人”=“女王”&lt;strong&gt;。 该架构可用于定义概念空间内任何感兴趣的维度。&lt;strong&gt;在国王-王后的例子中，性别维度通过“男人”-“女人”和“国王”-“女王”向量进行操作。&lt;/strong&gt;&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B89&#34;&gt;科兹洛夫斯基等人。（2019）&lt;/a&gt;详细介绍了如何在概念空间内构建此类维度。首先，研究人员需要确定感兴趣的维度。对于我们这里的例子，我们将把不同的概念投射到男性-女性性别维度上。为此&lt;/strong&gt;，我们首先确定定义性别维度的相关术语**。这里我们使用集合 [&amp;lsquo;man&amp;rsquo;, &amp;lsquo;him&amp;rsquo;, &amp;lsquo;he&amp;rsquo;, &amp;lsquo;male&amp;rsquo;, &amp;lsquo;men&amp;rsquo;] 和 [&amp;lsquo;woman&amp;rsquo;, &amp;lsquo;her&amp;rsquo;, &amp;lsquo;she&amp;rsquo;, &amp;lsquo;female&amp;rsquo;, &amp;lsquo;women&amp;rsquo;]。 &lt;strong&gt;然后我们计算不同概念在这个男性-女性概念轴(维度)上的正交投影。&lt;/strong&gt; 在线附录中的图 A4 将每个概念投射到 &lt;strong&gt;男性-女性概念轴&lt;/strong&gt;。 更消极的预测表明与女性气质的关联更强，而更积极的预测表明与男性气质的相关性相当。如图 A4 所示，这些预测与关于这些概念的性别状态的一般直觉一致，使我们能够明确说明每个概念相对于其他概念在这个维度中的位置。正如预期的那样，&lt;strong&gt;军事&lt;/strong&gt; 和 &lt;strong&gt;农业&lt;/strong&gt; 与 &lt;strong&gt;男性气质&lt;/strong&gt; 的联系最为密切，而 &lt;strong&gt;卫生棉条&lt;/strong&gt; 和 &lt;strong&gt;口红则&lt;/strong&gt; 与 女性气质的联系最为密切。按照这个程序，学者们现在可以测量任何概念在任何感兴趣的维度和任何文本丰富的时空背景中的位置。此外，不同语言的语料库可以独立训练和对齐，或者同时训练和对齐，以方便国际分析（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B81&#34;&gt;Johnson et al. 2017&lt;/a&gt;，&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B116&#34;&gt;Milbauer et al. 2021&lt;/a&gt;）。&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;双极概念维度的投影方法可以进一步扩展到锚定具有多种含义的低维子空间，其中单词和概念可以被绘制并理解为这些含义的混合&lt;/strong&gt;。这可以通过理论上选择“原型”的集合来执行，即具有已知且广泛共享含义的极值点，并在这些极值锚定义的子空间中绘制所有相关单词或概念。[例如，在对一个新的基于信息技术的创业企业进行分类时，人们可能会问它在 Uber、亚马逊、谷歌或比特币所刻画的空间中适合什么位置(Breiman 1994，Eugster 2012，Damle 和 Sun 2017）。&lt;/p&gt;
&lt;br&gt;
&lt;h4 id=&#34;552-应用&#34;&gt;5.5.2 应用&lt;/h4&gt;
&lt;p&gt;这项措施的制定和运用是为了研究 20 世纪和 21 世纪社会阶层的演变（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B89&#34;&gt;Kozlowski 等人，2019&lt;/a&gt;）。他们研究了根据 20 世纪出版的数百万本书的文本训练的嵌入，按照上述程序操作了阶级的维度，试图了解社会阶级的底层维度在 20 世纪是如何变化的。为此，他们提出了以下理论上的&lt;strong&gt;概念轴(维度)&lt;/strong&gt;：富裕程度（富人与穷人）、教育程度（受过教育与未受教育）、修养（有教养与未受教育）、地位（有声望与无声望）、道德（善与恶）、就业（雇主-雇员）和性别（男人-女人），分别嵌入 20 世纪的每个十年。然后，他们可以在这些维度上投射不同类别的概念，例如音乐风格、体育和职业，以了解这些概念在本世纪的过程中如何演变和发展。&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B6&#34;&gt;研究人员应用这种方法来研究健康、道德（ Arseniev-Koehler et al. 2022&lt;/a&gt;）、政治意识形态（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B171&#34;&gt;Taylor and Stoltz 2021&lt;/a&gt;）和地位（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B129&#34;&gt;Peng et al. 2021&lt;/a&gt; ）等背景下的其他类型的文化关联。&lt;/p&gt;
&lt;p&gt;研究人员不仅将概念投射到这些概念轴(维度)上，而且将整个文档投射到这些维度上，从而推动了测量的可能性（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B170&#34;&gt;Taylor 和 Stoltz 2020&lt;/a&gt;）。此外，尽管以前的措施依赖于研究人员指定感兴趣的连续体的相关维度，但最近的工作已经转向自动识别这些连续体（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B116&#34;&gt;Milbauer et al. 2021&lt;/a&gt;）。&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B116&#34;&gt;Milbauer 等人&lt;/a&gt;利用 Reddit 社区的内容。&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B116&#34;&gt;（2021）&lt;/a&gt;创建了一个无监督的程序来识别社区中的多个意识形态极点，使他们能够超越静态的左右意识形态维度，发现现代话语中发挥作用的许多两极分化和意识形态差异的轴。人们可以想象在许多组织环境中使用这种方法来识别团队、小组、单位或部门之间存在的许多潜在冲突来源。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;56-概念维度&#34;&gt;5.6 概念维度&lt;/h3&gt;
&lt;p&gt;之前，我们讨论了研究人员如何调查关键术语在相关文化维度上的位置，描述概念的位置在性别维度上的差异。然而，这并不是概念轴(维度)的唯一用途，因为概念空间还允许我们测量和理解相关维度本身如何相互关联。该措施的扩展是使用空间内的编码维度并将它们相互比较。例如，&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B89&#34;&gt;科兹洛夫斯基等人。（2019）&lt;/a&gt;利用他们既定的阶级维度来追踪整个 20 世纪每个维度如何与其他维度相关，例如，表明随着世纪的发展，富裕与教育的关系变得更加密切，而与教育的关系无关。栽培。通过这种方式，组织学者可以理解相关维度之间的关系在相关概念空间中可能有何不同。例如，学者可以研究不同文化维度在组织或行业内部和之间紧密或松散联系的程度。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;六讨论&#34;&gt;六、讨论&lt;/h2&gt;
&lt;p&gt;最后，我们简要讨论了一些利用嵌入模型进行思考的新兴方法，然后讨论了我们认为理论、方法论和组织的有价值的机会，这些机会源于将这些模型理解为概念空间的细粒度表示。这个讨论必然是说明性的，但暗示了现在这些精致的意义模型的可操作性的广泛可能性。&lt;/p&gt;
&lt;h3 id=&#34;61-词嵌入方法的富有成果的扩展&#34;&gt;6.1 词嵌入方法的富有成果的扩展&lt;/h3&gt;
&lt;p&gt;词嵌入的底层计算架构最近经历了扩展，可以在与之前讨论的不同方向上动员组织研究。我们简要提到三个，并在在线附录中提供更详细的描述。首先，概念和语言的层次结构在“直线”、欧几里得几何中很难得到体现，需要许多难以理解的维度来用标准嵌入来捕获。然而，层次结构可以用负弯曲双曲嵌入来原生表示（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B90&#34;&gt;Krioukov et al. 2010&lt;/a&gt;，&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B126&#34;&gt;Papadopoulos et al. 2012&lt;/a&gt;，&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B22&#34;&gt;Chamberlain et al. 2017&lt;/a&gt;，&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B121&#34;&gt;Nickel and Kiela 2017&lt;/a&gt;），为探索复杂现代的交叉层次结构提供了新的测量可能性。组织。例如，将公司名称嵌入双曲空间中将能够直接发现典型的“中心公司”，并在商业新闻语料库中与所有其他公司进行比较。额外的双曲维度将揭示子层次结构，反映商业评论员所持有的概念和比较价值的不同维度。&lt;/p&gt;
&lt;p&gt;其次，模型语言的深度学习方法为词嵌入增加了关键的上下文敏感性。考虑像 BERT ( &lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B35&#34;&gt;Devlin et al. 2019&lt;/a&gt; ) 和 GPT 系列模型 ( &lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B140&#34;&gt;Radford et al. 2019&lt;/a&gt; ) 这样的大规模模型，它们使用“注意力”的神经网络机制来识别影响焦点词含义的上下文词 ( &lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B176&#34;&gt;Vaswani ) et al. 2017&lt;/a&gt;），组装成一个称为转换器的架构，可以将问题转换为答案，将文本转换为翻译，将请求转换为响应。这种模型产生的内容可以被描述为上下文嵌入，这样每个单词不是由单个向量表示，而是由向量云表示，每个向量代表不同上下文中的该单词。“google”上下文中的“Apple”与“orange”上下文中的“apple”具有不同的值。这些模型极大地提高了预测能力，并进一步扩展了我们对概念空间进行精确建模的能力，但代价是复杂性和计算量更大。&lt;/p&gt;
&lt;p&gt;最后，嵌入架构可以扩展到在序列或更高维上下文中排列的任意符号集。例如，图像已被用来衡量抽象艺术图像的新颖性和创造力（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B10&#34;&gt;Banerjee and Ingram 2022&lt;/a&gt;，&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B11&#34;&gt;Banerjee and Kaplan 2022&lt;/a&gt;），分析警察预约照片（大头照），并识别与司法拒绝保释相关的先前未概念化的紧急特征听证会（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B107&#34;&gt;Ludwig 和 Mullainathan 2022&lt;/a&gt;）。&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B101&#34;&gt;音乐（ Liang et al. 2020&lt;/a&gt;）、音频剪辑（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B70&#34;&gt;Hershey et al. 2017&lt;/a&gt;、&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B189&#34;&gt;Xie and Virtanen 2019&lt;/a&gt;）和视频（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B191&#34;&gt;Zellers et al. 2021&lt;/a&gt; ）的多维空间是使用audio2vec（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B168&#34;&gt;Taglisacchi et al. 2020&lt;/a&gt;）等工具构建的。 、signal2vec（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B118&#34;&gt;Nalmpantis 和 Vrakas 2019&lt;/a&gt;）和 video2vec（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B62&#34;&gt;Habibian 等人 2017&lt;/a&gt;），为组织学者接触代表组织生活视听体验的新型媒体打开了大门。&lt;/p&gt;
&lt;p&gt;最近对双曲线、上下文、图像和音频嵌入的扩展表明，嵌入模型的底层计算框架的持续改进和扩展将继续下去，为组织科学中持续的实证、测量和理论创新奠定了基础。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;62-词嵌入和组织理论&#34;&gt;6.2 词嵌入和组织理论&lt;/h3&gt;
&lt;p&gt;在理论层面上，将嵌入模型理解为概念空间的有原则的、细粒度的表示有可能刺激新的理论发展并完善现有理论。例如，意义研究中的经典陈述影响了文学理论和文化社会学等其他领域，但未能在组织科学中站稳脚跟。自20 世纪初德索绪尔 ( &lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B32&#34;&gt;de Saussure 1986&lt;/a&gt; )的著作带来语言学的结构转向以来，许多人都试图将意义在组织和社会生活中的作用理论化。列维-斯特劳斯汇集了来自全球各地的多样而广泛的民族志，以向世界文化所特有的表面混乱提出深层的文化秩序，并认为复杂的意义是从有意义的元素的结合中产生的（列维-斯特劳斯 2016 &lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B99&#34;&gt;）&lt;/a&gt;。福柯理论化了话语和权力如何紧密相连，权力和知识如何以自我强化的联盟结合在一起（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B46&#34;&gt;Foucault 2012&lt;/a&gt;）。&lt;em&gt;布迪厄将惯习&lt;/em&gt;的概念阐述为“持久的、可互换的处置系统，倾向于充当结构结构的结构化结构，即作为实践的生成和结构的原则”（Bourdieu 1977，第72页&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B16&#34;&gt;）&lt;/a&gt;。尽管这些理论很有吸引力，但迄今为止它们只能进行松散且间接的测试。如果没有可靠的实证立足点，他们就永远无法在管理和组织理论中取得突出地位。然而，概念空间的实证操作化现在使得这些文化理论基础著作的参与和扩展变得容易处理，其中的许多结构现在可以辩护地测量。嵌入模型将使这些理论与管理和组织理论相关。&lt;/p&gt;
&lt;p&gt;我们还希望嵌入模型能够对现有理论框架进行更深入的研究和锐化。一组能够受益的文献是那些与知识相关的文献。鉴于组织学者可以获得的大部分知识都被编码在语言的符号概念系统中，现在可以通过更多可用的文本数据源来获取知识，并且可以通过嵌入模型的概念空间来表示。材料科学领域的最新工作已经使用此类模型来有效预测未来的知识发现，比科学家提出的知识发现早几十年（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B173&#34;&gt;Tshitoyan 等人，2019 年&lt;/a&gt;；&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B161&#34;&gt;Sourati 和 Evans，2021 年&lt;/a&gt;）。其他工作表明，这些发现可以推广到生物和物理科学（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B154&#34;&gt;Shi 和 Evans 2023&lt;/a&gt;）。概念空间的明确表示可以对整个社会系统中知识的特征和结构进行详细的调查。一方面，这些模型就像望远镜一样，打开了知识的天空，使其大规模结构变得可见，以供研究、理论发展和完善。另一方面，这些模型充当显微镜，使我们能够更深入地观察构成更大知识系统的意义原子结构。测量方面的这一进步将丰富对定义人类和组织经验的大型多维知识系统中的机制的测试。它还将使我们能够递归地评估管理和组织奖学金的知识，从而刺激创新。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;63-词嵌入和实证研究&#34;&gt;6.3 词嵌入和实证研究&lt;/h3&gt;
&lt;p&gt;在&lt;em&gt;实证层面&lt;/em&gt;，词嵌入模型可以提高组织科学不同领域的测量保真度，从而在实证结果与理论主张和框架之间实现更好的映射。我们用团队和群体内部多样性研究的例子来说明这一点。据说，不同群体所获得的许多好处是由于群体中的个人代表问题和解决方案的方式不同而产生的（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B75&#34;&gt;Hong 和 Page 2004&lt;/a&gt;）。由具有不同方法的个人组成的小组将更好地执行各种任务，因为他们将拥有更广泛的知识、观点和可供借鉴的信息资源（Cox et al. 1991，&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B27&#34;&gt;Williams&lt;/a&gt; and &lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B185&#34;&gt;O&amp;rsquo;Reilly 1998&lt;/a&gt;）。然而，由于测量困难，对团队多样性的研究很少测量问题和解决方案空间的不同概念。相反，它假设解决问题的团队成员的身份多样性（人口、文化、种族或经验）与其功能多样性（团队成员如何代表和解决问题）之间存在联系（Nisbett 和 Ross 1980，Hong&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B122&#34;&gt;和&lt;/a&gt;Page &lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B75&#34;&gt;2004&lt;/a&gt;，&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B175&#34;&gt;van Dijk 等人，2017&lt;/a&gt;）。&lt;/p&gt;
&lt;p&gt;由于缺乏高保真方法来访问团队成员在问题和解决方案的概念空间中的位置，因此通常假定身份和功能多样性之间存在联系。用于操作研究的身份多样性和用于理论化的功能多样性之间脱节的一个重要后果是，虽然理论积极使用功能多样性的思想和术语（从根本上讲是几何和高维的），但测试依赖于集合-与身份成员资格相关的理论概念。&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B175&#34;&gt;我们预测，这将解释团队多样性文献（ van Dijk et al. 2017&lt;/a&gt; ）结果中的大部分歧义，因为研究设计忽视了功能多样性和身份多样性之间的同源性。然而，诸如概念广度之类的衡量标准可以阐明这一理论交叉点上的悬而未决的问题。我们现在可以指定（1）团队的基本概念广度，以及（2）这种基本广度可能驱动结果的程度。解决这些问题可以为许多分析层面的研究提供信息，从个人和团队的成功（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B164&#34;&gt;Srikanth 等人，2016 年&lt;/a&gt;，&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B175&#34;&gt;van Dijk 等人，2017 年&lt;/a&gt;）到公司和行业绩效（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B144&#34;&gt;Roberson 等人，2017 年&lt;/a&gt;）。我们希望我们的插图能够激发在组织研究领域生成细粒度意义测量的新可能性。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;64-组织内部的词嵌入&#34;&gt;6.4 组织内部的词嵌入&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;最后，我们认为词嵌入方法将对我们研究的组织产生影响&lt;/strong&gt;。我们说明了在劳动力市场背景下潜在的嵌入必须塑造组织行为。从招聘到工作设计，从培训到晋升，人力资源管理的一个核心挑战是有效地将个人与组织内的角色、工作、情况和任务相匹配（Weller et al. 2019 &lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B183&#34;&gt;）&lt;/a&gt;。随着比赛质量的提高，各种绩效指标也会提高，包括工作满意度（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B8&#34;&gt;Ashforth 和 Saks 1996&lt;/a&gt;）、个人生产力（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B125&#34;&gt;Paauwe 2009&lt;/a&gt;）和组织绩效（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B38&#34;&gt;Dyer 和 Reeves 1995&lt;/a&gt;）。有效匹配的一个问题是不同维度的匹配的重要性程度。在一家公司中，技能可能最为重要，而在其他公司中，技能可能是文化契合度、态度、技能和经验的相互作用。由于嵌入模型捕获了所有这些维度，管理者可以为每个相关维度嵌入不同的原型描述，同时还嵌入个人资料和其他相关通信（例如电子邮件、松弛消息等），以衡量每个人与每个相关维度之间的匹配接近度。这样做可以让管理者更好地识别高维匹配及其对员工、社区和公司绩效的影响。&lt;/p&gt;
&lt;p&gt;嵌入模型旨在为人力资源的宏观管理提供新的视角（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B183&#34;&gt;Weller et al. 2019&lt;/a&gt;）。来自大型组织的相关信息存储在人力资源经理、一线经理、员工、同事和外部招聘人员中。然而，无法集中访问这些信息。通过嵌入，组织可以从所有数字面包屑的文本（电子邮件、聊天、工作描述、正式报告、绩效管理记录等）构建概念空间。这样做并使用相似性分析将使公司能够绘制和了解相关人力资本的位置位于公司对面。管理人员可以利用这些系统来准确了解任何员工的概念职位与任何给定的公司要求的差距有多大。这不仅可以为招聘、雇用、员工流动和流动等流程提供信息，还可以为培训、社交、工作设计和公司重组提供信息。因此，在劳动力市场和组织适应的背景下，嵌入模型可以产生有用的创新。人们可以想象许多其他组织实践和结构可以从这些模型及其测量可能性中受益，包括产品设计、市场分析和战略生成。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;结论&#34;&gt;结论&lt;/h2&gt;
&lt;p&gt;我们同意&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B65&#34;&gt;Hanan等人的观点。（2019&lt;/a&gt;，第 2 页）当他们观察到，考虑到概念和分类对几乎所有人类行为和社会互动的中心地位，人们对概念如何运作的关注如此之少，这是多么令人惊讶。现代组织内部及其周围进行的许多活动都需要概念信息的激活和传播。当一个人解决新问题、提出新想法或与他人合作时，就会发生这种情况。从围绕饮水机的良性闲聊到重新配置全球资本主义秩序或将人类登陆火星，概念及其所嵌入的概念空间发挥着核心、关键的作用。&lt;/p&gt;
&lt;p&gt;正如本文所示，我们现在拥有一系列重要的工具，可以为广泛而深入的理论想象和实证研究打开&lt;strong&gt;概念世界&lt;/strong&gt;和&lt;strong&gt;概念空间&lt;/strong&gt;。我们希望本文能够激发对嵌入可以提供信息的大量问题和理论的学术探索。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
</description>
      <content:encoded><![CDATA[<p>Aceves, Pedro, and James A. Evans. &ldquo;<strong>Mobilizing conceptual spaces: How word embedding models can inform measurement and theory within organization science.</strong>&rdquo; <em>Organization Science</em> (2023).</p>
<br>
<h2 id="摘要">摘要</h2>
<p>词嵌入模型是一种表示多维概念空间的强大方法，在多维概念空间中，所传达的概念可以相互关联、组合和竞争。此类模型代表了机器学习的最新进展，使学者能够用大规模文本数据局部和全局的单词共现，以最小的语义失真程度， 有效地编码复杂的意义系统。尽管词嵌入的使用有可能扩大组织科学中的理论可能性，但嵌入对于组织学者来说很大程度上是未知的，未发挥出词嵌入应有的潜力。我们的目标是通过为用户提供实用的路线图来展示嵌入模型在组织科学中的前景，以在他们的研究中调动该方法，并为开展该类研究的学者提供理论指导。 我们首先明确定义 <strong>概念</strong> 和 <strong>概念空间</strong> 的概念，然后继续展示如何使用词嵌入模型来表示和测量这些概念，并指出该方法的优点和缺点。然后，我们提供一组嵌入测量及其理论解释和灵活的扩展。我们的目标是从词嵌入的技术处理中提取概念，并将其置于实践的理论框架中，以加速此类研究。</p>
<p><br><br></p>
<h2 id="一介绍">一、介绍</h2>
<p>过去十年，文本作为数据的计算使用在组织科学中显着增长（Hasan 等人，2015 年；Goldberg 等人，2016 年；Srivastava 等人，2018 年；Hannigan 等人，2019 年）。这种增长的主要原因是文本编码的概念信息赋予个人、组织、经济和社会行为以意义（Evans 和 Aceves 2016，Gentzkow 等人 2019），并且在过去十年中，来自组织环境的文本数据急剧增长，大大提高了文本的可用性。然而，文本中编码的 <strong>概念意义</strong> 本质上是高维的，这使得降低概念复杂性成为研究文本的学者的中心任务。<strong>词嵌入模型是由计算机科学家和语言学家开发的一个新兴工具系列，用于文本信息降维，以此提取概念及其数字表示</strong>。词嵌入技术的发展使组织科学家依赖于文本数据进行理论构造， 相比之前，数据中信息的保真度更高，由此文本数据与组织研究交叉场景形成了新的理论研究路线。尽管词嵌入模型在组织科学之外得到广泛使用，但由于组织科学领域的学者缺乏对词嵌入技术的理解， 不知如何将它们纳入理论发展过程的原则框架，词嵌入模型对于理论发展的价值仍然被掩盖。</p>
<p><strong>词嵌入模型建立在高效的神经网络架构之上，并通过将复杂的语义系统有效编码到具有最小失真的稠密几何空间中，彻底改变了语义分析</strong>。这些模型代表了数十到数百个维度的空间中的语义，相对于语言中的单词和概念的数量来说，这个维度较低； 但相对于正式社会和文化理论家之前试图呈现概念信息的两到三个维度来说，这个维度却很高（奥斯古德 1964 年，史密斯-洛文和海斯 1988 年）。出于组织科学的目的，这些嵌入模型创建了社会系统中个体所持有的集体知识的 <strong>数字替身</strong> ， 嵌入可以解决文化上隐含类比（Mikolov et al. 2013b），回答文化偶然问题（Devlin et al. 2019，Radford et al. 2022），并预测未来的知识发现（Tshitoyan等人 2019；Sourati 和 Evans 2021）。组织科学长期以来一直借鉴人工智能（AI）的表征概念， 在这里，我们使用人工智能的表示机制来增强组织理论研究（Csaszar 和 Steinberger 2022）。</p>
<p>然而，由于神经网络复杂，且难以理解的黑盒性质特性，围绕神经嵌入和人工智能方法对理论发展的价值存在争议。尽管预测能力很强，但此类方法往往缺乏可解释性（Knight 2017，Leavitt et al. 2021）。<strong>在组织科学领域中，学者缺乏此技术的理解，即</strong></p>
<ul>
<li><strong>对于嵌入何时成为组织科学有用的方法论选择</strong></li>
<li><strong>如何在既定认识论标准内证明使用“复杂”神经嵌入方法的合理性</strong></li>
<li><strong>如何在各种嵌入中进行选择 等方法</strong>（例如，静态词嵌入与上下文嵌入、预训练嵌入与自定义嵌入）</li>
<li><strong>使用嵌入进行研究的适当步骤以及评估嵌入研究的相关标准</strong></li>
<li>最值得注意的是，研究界，特别是那些研究组织认知、文化、知识和意义的人，似乎对嵌入方法 <strong>如何适应将方法论选择与理论发展联系起来</strong></li>
</ul>
<br>
<p>我们的目的是通过两项贡献来解决这些问题。</p>
<p><strong>首先，我们的目标是提供一个理论指南，为嵌入模型提供一个原则性的概念框架，学者可以使用该框架为他们的模型注入意义，并使学者们能够在理论发展过程中运用这些模型。我们这里的主要论点是，词嵌入模型中的每个向量代表一个概念，整个嵌入模型代表生成文本数据的社会系统的概念空间</strong>。嵌入模型所代表的概念空间是多维空间，其中从规范和知识到想法和发明的概念相互关联。这个框架使组织学者能够利用嵌入模型的概念空间，与组织科学的许多领域之间建立联系。例如，不同公司基于知识视角对该空间的差异化覆盖（Grant 1996），组织理论家在描述规范和制度（Scott 2003），类别学者援引在决定将一个物体归类到哪个概念时（Pontikes 和 Barnett 2015 ），创新学者直接理论化寻求测量发现和发明的新颖性（Fleming 和 Sorenson 2001，2004），并且团队研究人员寻求了解成员在空间中的不同立场如何影响创造力、协调性和绩效（Srikanth 等人，2016）。因为我们以 <strong>概念</strong> 和 <strong>概念空间</strong> 为中心的理论框架可以推广到组织理论的许多背景，所以我们希望嵌入模型所支持的研究将促进这些子领域之间更深入、更持久的对话。</p>
<p><strong>其次，我们的目标是为利用嵌入模型进行理论发展提供实用的路线图</strong>。在此过程中，我们引导读者完成使用专利摘要语料库来实现词嵌入模型的过程，以表示现代技术创新的概念空间。我们解释了研究人员需要设置的模型参数，并逐步完成了他们应该采取的验证步骤，以评估模型是否有效地代表了他们感兴趣的概念空间，并提供了方法附录，其中包含实现所讨论的所有内容所需的代码。在注意到嵌入模型的可供性的同时，我们还讨论了它们不断发展的局限性，并提出了它们何时不适合组织分析的建议。然后，我们展示嵌入模型如何实现依赖于概念和概念空间的构造的理论化和测量。</p>
<br>
<p>我们概述了两大类词嵌入使用方法</p>
<ul>
<li><strong>度量之内/之间进行标记</strong>，我们提出了跟踪相关分析集内部和之间的概念关系的度量，以帮助我们跟踪与概念广度、概念距离和概念相似性</li>
<li><strong>意义及其维度</strong>，我们提出了四种衡量标准，为了解意义及其与组织的关系提供了不同的窗口。为找出这些测量机会的理论可能性，我们重点介绍了一些研究进展。</li>
</ul>
<p><strong>本论文的一个核心主张是，在组织研究不同广度和深度，词嵌入工具现在使我们能够表示其概念空间，并且比以前更精细地表示细节</strong>。有鉴于此，我们的目标是展示嵌入模型如何在与组织科学家相关的领域中操作概念空间，使研究人员能够扩展和完善现有理论。我们希望这一理论指南和实践路线图将促进组织科学内部的理论扩展，该扩展首先是扩大对文本数据的访问以及用于分析的随附计算工具（Kovács 等人，2013 年; Goldberg 等人; 2016年，Hannigan 等人, 2016年, 2019； Guo 等人，2020）。</p>
<p><br><br></p>
<h2 id="二概念和概念空间">二、概念和概念空间</h2>
<p>概念是人类生活的一个基本特征，我们的日常思维很大程度上依赖于它们所代表的信息，使我们能够对周围的人、物体和事件进行分类，并将这些信息传达给其他人（Murphy 2002；Bergen 和 Feldman 2008 年； Cassanto 和 Lupyan，2015 年）。概念是将我们的精神世界粘合在一起的粘合剂（Murphy 2002），赋予精神和物质体验以意义（Hannan et al. 2019）。<strong>在认知科学和心理学的语言中，概念是“事物类别的「心理表征」”（Murphy 2002）。</strong></p>
<p>概念有两大功能：分类和交流（Medin and Rips 2005），这些功能都需要语言的帮助。实际上，我们通过在语言中分配一个单词或短语来表示一个稳定概念的信息内容。这就是为什么我们通过说出或写出 “<em><strong>manager</strong></em>” 一词来提及经理的概念，从而引出它所包含的概念信息，例如对他人的责任、做出决策以及相对于组织同行获得更高的薪水。然后，语言的单词分割并链接了社区的共享概念空间（Lupyan 和 Bergen 2015）。这样，“一个概念就是一个单词或短语的含义……[包括]像 ‘<em><strong>red</strong></em>’ 和 ‘<em><strong>grasp</strong></em>’这样的基本的、具体化的单词，以及像 ‘<em><strong>goal</strong></em>’ 和 ‘<em><strong>continuity</strong></em>’ 这样的抽象和技术单词”（卑尔根）和 Feldman 2008]）。</p>
<p>概念并不作为唯一的信息单位存在于真空中。相反，概念之所以有意义，是因为它们彼此相关（Hannan et al. 2019），“通过相似性和上下文的关系紧密地缝合在一起”（Hofstadter and Sander 2013）。在这种多重概念关系中存在着“我们对世界的大部分知识，告诉我们存在什么以及它们具有什么属性”（Murphy 2002，p.1）。例如，概念 <em><strong>resource</strong></em>  与  <em><strong>firm</strong></em>、<em><strong>constraint</strong></em> 和 <em><strong>natural</strong></em> 等概念相关。在文化系统的层面上，概念之间的相互关系引发了表征概念之间宏观层面有意义的维度。 <em><strong>manager</strong></em> 概念在某些方面与 <em><strong>coach</strong></em> 和 <em><strong>president</strong></em> 的概念很接近，而在其他方面则与<em><strong>employee</strong></em> 和 <em><strong>bureaucracy</strong></em> 的概念很接近。将概念理解为存在于复杂几何空间中的点，使我们能够思考和测量概念之间的距离远近（Hannan 等人，2019）。例如，与  <em><strong>playground</strong></em> 或 <em><strong>ice cream</strong></em> 相比， <em><strong>manager</strong></em> 与<em><strong>organization</strong></em> 和 <em><strong>leader</strong></em> 概念的联系更加紧密。<strong>我们将这种概念相关的多维空间称为概念空间</strong>（Hannan et al. 2019)</p>
<p>重要的是我们用复数来指代概念空间。对于许多单词来说，它们会根据使用的上下文表现出不同的概念信息模式。首先，概念可能会根据使用它们的社会背景而有所不同。例如，如果在执行董事会议室、商品交易大厅或附近的储蓄和贷款机构的背景下说出 “<em><strong>Bank</strong></em>”，指的是银行而不是河流。概念也可能根据使用时间的不同而有所不同。例如，“<em><strong>高科技</strong></em>” 一词所引发的概念关系会根据我们研究的是 1960 年代、1990 年代还是今天而有所不同。最后，概念关系因使用它们的社区而异，因此 “<em><strong>债务</strong></em>” 所捕获的概念将根据其是由首席财务官还是低收入个人使用而有所不同。概念所含信息存在多样性， 正如 Hannan等人（2019）指出，“虽然有些概念可能是天生的或生物驱动的，但大多数都是社会构建的。”</p>
<p><br><br></p>
<h2 id="三先前研究中的概念和概念空间">三、先前研究中的概念和概念空间</h2>
<p>概念以及扩展的概念空间是人类思维和交流的基础（Sperber 和 Wilson 1986；，Murphy 2002；Hofstadter 和 Sander 2013）。正因为如此，概念和概念空间对于许多组织理论框架来说或多或少是明确和关键的。在某些研究（例如类别研究）中，概念具有核心重要性并且已经被明确地理论化。然而，在其他情况下，（例如，公司基于知识视角）概念被隐含地假定，即使它们是决定许多理论期望的基本成分。鉴于概念无处不在，对组织科学所有领域使用概念信息进行全面回顾超出了本文的范围。我们将简短、非详尽的回顾集中在概念和概念空间概念的三个领域——<strong>类别、知识和文化</strong>。通过嵌入技术处理并追踪存在于个人和社区头脑中的概念信息，研究其对组织行为和结果的影响。<br></p>
<h3 id="31-类别">3.1 类别</h3>
<p>类别是具有共同特征和属性的实体组。如前所述，概念是类别的心理表征。对类别的研究主要集中在跨类别或模糊类别是否会增加或减少分类实体的估值。自Zuckerman（1999）以来的工作一直集中在消除歧义条件上，在这些条件下，类别跨越和模糊性会导致积极或消极的估值。许多研究表明，由于感知偏差（Durand et al. 2007）、不符合受众期望（Hsu 2006)、Hsu et al. 2009；Leung and Sharkey 2014） ，跨越模糊的类别会损害实体估值，或降低分类对比度（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B119">Negro et al. 2010</a>）。其他研究表明，跨越类别可以创造积极的估值结果，因为它表明非典型性可以放大良好的表现并缓冲不良表现（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B159">Smith 2011</a>），一个类别可以锚定认知，而另一个类别可以有益地修改认知（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B188">Wry et al. 2014</a>）。还有其他研究表明，效果取决于受众，有些人喜欢跨类别，而另一些人则不喜欢（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B135">Pontikes 2012</a>）。通过这些方式，类别可以通过影响有关类别成员资格的概念信息的解释方式，对行为和绩效产生积极或消极的影响。<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#fn5">4</a></p>
<p>尽管类别范式的贡献历来是通过类别成员的集合和模糊集合理论（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B64">Hannan et al. 2007</a>）概念来实现的，但最近的工作开始纳入其多维性（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B65">Hannan et al. 2019</a>）和类别的分级归属感。组织学者感兴趣的许多现象都是由概念及其代表的类别之间的精确距离支撑的。例如，鉴于专利所贡献的技术领域，专利通常分为类别和子类。然而，专利中编码的想法可以传播到创新空间的广泛领域，即使只分类在一个类别中。正如我们稍后讨论的，转向概念的几何概念，使分析师能够考虑隶属度、重叠和连续距离影响底层实体评估判断的方式<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B65">（Hannan 等人，2019 </a>。<br></p>
<h3 id="32-知识">3.2 知识</h3>
<p>众所周知，知识很难具体说明，并且在哲学、认知科学和社会科学领域，围绕其概念性质进行了长期而活跃的争论（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B166">Steup 和 Neta 2020</a>）。然而，过去几十年来，组织科学在微观、中观和宏观层面上进行了大量研究，解决有关知识及其在团队、组织和经济活动中的作用的问题。从对团队成员专业知识的研究（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B164">Srikanth et al. 2016</a>）到公司基于知识和注意力的观点（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B87">Kogut and Zander 1992</a>，<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B56">Grant 1996</a>，<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B123">Ocasio 1997</a>）；从交互记忆系统（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B143">Ren 和 Argote，2011</a>）到创新流程（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B50">Garud 等，2013</a>）；从组织设计（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B45">Foss et al. 2013</a>）到搜索和探索（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B93">Lavie et al. 2010</a>），知识在最近的组织理论化中发挥着核心作用。</p>
<p>无论人们对知识的定义如何选择，命题性知识从根本上都与概念信息相关。<em><strong>命题知识采取“ S [主体]知道p [命题]”</strong></em> 的形式（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B80">Ichikawa and Steup 2018</a>）。在某种程度上，命题是由语言中的单词编码的，并且单词代表概念信息，命题知识依赖于概念以及它们如何在概念空间中交织（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B110">McGrath and Frank 2020</a>）。以命题“泰勒知道氢的主要工业应用是氨的制造”和“特里知道量子算法可以具有较低的时间复杂度”为例。这些知识命题中的每一个都代表了不同的概念意义，前面提到的领域将以不同的方式操作它们。例如，团队学者可能会强调，由泰勒和特里组成的专利团队将拥有多样化的基础知识。采取基于注意力观点的学者会注意到，泰勒和特里可能会以不同的方式关注知识空间，以应对组织变革。研究创新的人可能会注意到如果泰勒和特里共享办公空间，知识重组的潜力。研究搜索的人可能会假设，为了解决问题，泰勒和特里会以不同的方式搜索概念性解决方案。在所有这些情况下，就这些领域通过诉诸语言编码的命题知识来理论化知识动态而言，它们以基本和可测量的方式参与概念和概念空间。<br></p>
<h3 id="33-文化">3.3 文化</h3>
<p>文化被不同地概念化为集体的共同价值观、故事、框架、工具包和类别（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B52">Geertz 1973</a>；<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B131">Pettigrew 1979</a>；<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B92">Lamont 和 Small 2008</a>；<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B158">Small 等人 2010</a>；<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B54">Giorgi 等人 2015</a>）。文化建构已成为组织研究的核心，在从个人和团队到组织和国家的各个层面的分析中都得到了运用（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B54">Giorgi et al. 2015</a>）。从理解文化如何塑造职业结构（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B195">Glynn 2000</a>）、组织领域（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B194">Anteby 2010</a>）和创业环境（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B106">Lounsbury and Glynn 2001</a>，<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B141">Rao and Giorgi 2006</a>）到它在讲故事（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B106">Lounsbury and Glynn 2001</a>）和身份建设中的作用（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B196">Ravasi 和 Schultz 2006</a>），从其对人际沟通的塑造（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B165">Srivastava 等人，2018</a>）到对组织绩效的影响（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B26">Corritore 等人，2020</a>），文化深深地受到概念及其互动方式的调节。文化以集体认知过程为基础（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B36">DiMaggio 1997</a>，<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B128">Patterson 2014</a>），很大程度上可以通过语言痕迹来获取。语言进入文化的窗口（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B55">Goldberg et al. 2016</a>，<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B165">Srivastava et al. 2018</a>，<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B26">Corritore et al. 2020</a>）很大程度上是通过它所表达的概念来呈现的，使得概念和概念空间成为组织文化研究的重要支柱。</p>
<p>基于它们在形成范畴、知识和文化方面的关键作用，概念和概念空间已成为许多组织理论赖以建立的知识支架的重要组成部分。然而，概念和概念空间通常仅被用作缺乏精确和可扩展的经验表征的不明确的隐喻。这限制了研究使用粗粒度的代理测量或允许手动编码和解释的小数据集。接下来，我们提出词嵌入模型是一种最先进的工具，用于表示概念和概念空间，可以添加到组织学者工具包中。就组织学者寻求将概念和概念信息所支撑的结构操作化而言，他们将得到这类新模型的帮助。考虑到这一点，我们接下来介绍嵌入模型如何工作以及为什么它们可以作为概念和概念空间的有效表示。</p>
<p><br><br></p>
<h2 id="四使用词嵌入来表示概念和概念空间">四、使用词嵌入来表示概念和概念空间</h2>
<h3 id="41-越来越多地使用文本作为数据">4.1 越来越多地使用文本作为数据</h3>
<p>过去 10 年，通过计算工具和方法进行文本数据分析出现了爆炸性增长。从社会学（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B40">Evans and Aceves 2016</a>）到经济学（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B53">Gentzkow et al. 2019</a>）和政治学（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B58">Grimmer and Stewart 2013</a>），文本正迅速成为组织、经济和社会生活的中心观察站。文本数据提供了在线知识社区、财报电话会议和公司报告、产品评估、组织电子邮件和讨论板、历史档案、视频转录和电影字幕、医疗记录、电子商务、社交媒体等多种领域的丰富思想和行为痕迹。媒体平台、新闻文章、科学学科等等。总而言之，这些文本数据源比以往任何时候都更深入、更广泛地进入组织生活。正如<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B40">Evans 和 Aceves（2016 年</a>）指出的那样，文本数据现在使我们能够访问“有关正在玩的社交游戏的隐藏元素及其背后的社交世界”的深层信息。然而，这些语料库的庞大规模及其广泛的范围意味着，提取理论上有意义的信息信号越来越多地受到计算方法的帮助，利用信息技术方法获取大量非结构化文本数据，并将它们转换为有意义且相关的度量。<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#fn6">5</a></p>
<p>文本数据与组织学者习惯使用的定量数据之间的一个主要区别是文本是高维的。正如<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B53">Gentzkow 等人（2019 年</a>）指出，“仅使用英语中一千个最常用单词的 30 个单词的 Twitter 消息样本 [&hellip;] 的维度大致与宇宙中的原子一样多。” 因此，使用文本作为数据的学者的中心任务是通过对数据施加限制来降低维度。<strong>过去二十年里，组织科学中用于降低这一维度的一些最常用的计算工具是词典法、语义网络和主题模型。尽管这些方法有其优点，但一个主要缺点是它们无法对文本中存在的细粒度概念关系和关联进行编码</strong> 。接下来，我们将展示嵌入模型如何利用文本中的局部和更广泛的信息来训练概念含义和概念空间的高保真表示。在此过程中，我们展示了词嵌入模型如何克服先前方法来表示文本中编码的含义的一些局限性，从而允许对理论结构进行更细粒度的测量，并实现新的理论可能性。</p>
<br>
<h3 id="42-词嵌入">4.2 词嵌入</h3>
<p>我们之前解释过，概念是事物类别的心理表征，人类通过在词典中分配一个单词或短语来表示稳定的概念，并指出，概念只有在与跨多个维度的其他概念相关并为其提供信息时才有意义。密集的概念空间。在这里，我们认为词嵌入模型是最近开发的一类从机器学习应用于自然语言处理的模型，它使我们能够有效且高效地表示概念空间，并将这些空间用于追求组织科学。词嵌入模型是文本语料库中单词的连续表示，可以进行几何解释。<strong>词嵌入的方法论假设，一个词的含义很大程度上是由出现在其直接和更广泛上下文中的词所决定的，这一想法受到结构语言学家的启发，他们已经证明，含义的差异与局部分布相关（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B68">Harris 1954</a>）， 这个想法现在被称为 「分布式语义学」，Firth 的著名描述是：“观其伴而知其意”（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B42">Firth 1957</a>，you shall know a word by the company it keeps）， 一个单词所代表的概念或含义可以通过它周围的单词的分布来推断</strong>。</p>
<p>以这种分布式方式思考概念和概念空间的底层计算架构可以追溯到 20 世纪 80 年代初期计算机科学家 Geoffrey Hinton 的工作（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B71">Hinton 1986</a> , <a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B72">Hinton et al. 1986</a>）以及认知科学家在这一时期研究的并行分布式处理模型（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B149">Rumelhart 等人，1986a</a>，<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B150">b</a>；<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B109">McClelland 和 Rumelhart，1989</a>）。分布式架构是当前嵌入语言模型的基础（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B115">Mikolov et al. 2013b</a>、<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B130">Pennington et al. 2014</a>、<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B35">Devlin et al. 2019</a>、<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B104">Liu et al. 2019</a>、<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B17">Brown et al. 2020</a>、<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B41">Fedus et al. 2020）。 2021</a>）， 嵌入模型 Word2Vec 算法(<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B115">Mikolov 等 2013b</a>) 相对简单易用，能够处理中等规模的语料库来。 <strong>Word2Vec 与  GloVe（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B130">Pennington 等人，2014 年</a>）和 FastText（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B13">Bojanowski 等人，2017 年</a>）等嵌入算法，是 ChatGPT 和相关模型的基础</strong>。</p>
<p>找个例子来帮助理解算法， 现在我们要创建过去 50 年创新的概念空间表示。首先需要概念活动领域的文本数据， 美国专利局数据提供了创新活动的踪迹，其中包括所有专利的文本、摘要、描述和权利要求。在整篇论文中，我们使用这个专利摘要语料库来指导读者完成训练这个概念空间和构建相关概念测量的过程。数据是从<a href="https://patentsview.org/">Patentsview.org</a>免费下载的，使用 1976 年至 2019 年间发布的所有专利来构建本文中发现的词嵌入模型和测量相关指标。</p>
<p>想象一下，专利语料库中的每个独特单词都是从放置在巨大冰箱上的随机放置的 <strong>“word magnet”</strong> 开始的（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B76">Hovy 2020</a>）。当连续词袋 (CBOW) 算法滚动浏览语料库时，使用每个目标词周围的单词词(滑动窗口的上下文)来预测目标词（更多内容见下文）。该算法的最终目标是产生一种语义模型，其中出现在相似上下文中的单词彼此接近，而来自不同上下文的单词则相距很远。由于用2维概念空间不足以捕获每个单词的全部含义，因此该算法改为在更高的（100-1,000）维空间内捕捉语义。通过这种方式，目标单词的概念信息是从它周围的单词中归纳出来的，将语料库中的每个单词绘制为<em>n</em>维空间中的坐标或向量。正是单词在这个<em>n</em>维向量空间中的相对位置，使我们能够将词嵌入模型可以描述代表人类概念活动区域的概念空间。<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#fn7">6</a></p>
<p>概念意义的识别假定了嵌入空间的可解释性。接下来，我们提出了对这些概念空间的一系列提示和测量，作为从中产生结构化解释的方法。这很像心理学家使用 <strong>心理测量调查</strong> 将概念印象转化为可解释的观点（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B112">Michael Furr 2021</a>）。或者<strong>认知人类学家如何使用结构化任务，例如排序和排名（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B163">Spradley 2016</a>），将概念性的世界观转变为可解释的世界观</strong>。我们认为嵌入模型必须接受结构化测量（就像向人类受试者提供的心理测量问卷）使他们的 **概念景观(conceptual landscape)**变得可解释。接下来，我们将引导读者如何用专利语料库训练创新概念空间表示的过程。之后， 我们概述了该方法的优点和局限性，并指出这些方法与先前的文本分析方法和组织研究实践的关系。</p>
<br>
<h3 id="43-选择语料库">4.3 选择语料库</h3>
<p>学者可以根据应用使用两种词嵌入模型。一方面，研究人员可以使用自有文本语料库来训练表示， 据此了解文本所涉主体(个人、团体、社会)行为的概念空间是什么样子， 以及概念关系揭示人类活动背景。在我们的示例中，专利创新在专利语料库中得到了很好的体现，因此我们在下面展示了如何从头开始训练概念空间表示, 以及它揭示了哪些概念联系。研究人员可以从头开始训练语料库的其他例子包括在线社区（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B18">Burtch et al. 2021</a>、<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B2">Aceves et al. 2022</a>、<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B23">Chambers et al. 2022</a>）、学术学科（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B74">Hofstra et al. 2020</a>、<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B102">Lin et al. 2022</a>） 、劳动力市场（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B9">Bana 2022</a>）、公共记录（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B6">Arseniev-Koehler et al. 2022</a>）、产品和公司描述（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B61">Guzman and Li 2023</a>）以及财报电话会议和公开演讲（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B85">Kirgil and Voyer 2022</a>）。</p>
<p>或者，如果研究人员想要在较小的语料库中追踪概念动态，而该语料库的大小不足以训练独特的、特定于上下文的嵌入，那么研究者可以使用预训练嵌入模型，需要注意，训练预训练嵌入模型的文本与研究者小语料库在内容、场景要有相似性。广泛使用的预训练嵌入已经在来自海量语料库的文本上进行了训练，例如新闻（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B4">Google 2013</a>）、维基百科（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B35">Devlin et al. 2019</a>、<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B57">Grave et al. 2018</a>）。训练这些预训练嵌入模型的文本语料体量很大， 内容题材往往包含我们较小文本样本中存在的概念。因此使用预训练嵌入对这些概念的信息进行编码，并可用于近似相关距离。政治和历史语义背景下的研究发现，预训练嵌入提供的结果与特定于上下文的嵌入相当（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B89">Kozlowski et al. 2019</a>，<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B145">Rodriguez and Spirling 2022</a>）。如果有理由相信研究项目中包含的概念和想法没有在这些大量预训练嵌入中得到很好的体现，研究人员可以使用较小语料库中的文本对其进行 <strong>微调（Fine-Tune）</strong>（ <a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B104">Liu et al. 2019，Burtch et al.2019</a>）<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B18">， 2021</a>）。微调将预训练的概念空间扭曲为与样本一致（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B104">Liu et al. 2019</a>），更好地反映概念之间的关系。</p>
<p>最后，使用哪一种嵌入(自己训练的嵌入、 预训练的嵌入、微调的嵌入)将取决于研究人员的目的以及他们寻求追踪的概念动态的类型。接下来，我们将重点描述从头开始训练和验证嵌入模型的过程。在接下来的部分中，我们讨论不同参数设置和策略之间的权衡，并鼓励读者遵循文章文本和在线附录。</p>
<br>
<h3 id="44-清理语料库">4.4 清理语料库</h3>
<p>训练嵌入模型的第一步是使用 Python 等编程语言录入文本语料库， 首先获取每个专利摘要中的文本， 并将连续的文本进行切词，转化为单词列表 。然后，我们将文本小写，删除标点符号和数字字符串，并将每个摘要转换为称为token的单词列表。但是这可能破坏一些词组语义，这里使用 <em><strong>bi-gram</strong></em>， 识别高频共现的词组成词组，例如当 <em><strong>“electric”</strong></em> 和 <em><strong>“vehicle”</strong></em> 这两个词在某些上下文中一起出现时，它们将被统一形成短语和概念 <em><strong>“electric_vehicle”</strong></em> 。建立单词或短语列表后，执行单词嵌入算法来学习单词或二元组及其语言上下文之间的最佳距离，以保留语言中单词和短语的概念空间。</p>
<br>
<h3 id="45-训练嵌入模型">4.5 训练嵌入模型</h3>
<p>第一步是选择词嵌入算法， 浅层神经网络构建的单词表示（例如，Word2Vec、FastText；<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B115">Mikolov 等人，2013b</a>）、共现矩阵的低秩近似（GloVe；<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B130">Pennington 等人，2014</a>） ，或来自 Transformer 的深度上下文嵌入（例如 BERT、<em>GPT</em>；<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B35">Devlin 等人 2019</a>、<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B139">Radford 等人 2022</a>）。这些不同算法输出，都可以被解释为<em>n</em>维概念空间，其中单词或短语由空间内的向量位置表示。本文我们只介绍 Word2Vec 算法， word2vec 是一种广泛使用的训练概念空间的算法（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B113">Mikolov 等人，2013a</a>，<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B115">b</a>）。</p>
<p>Word2Vec 算法的一种流行实现算法是连续词袋 (CBOW) 算法，可以在 Gensim python 库中轻松访问，该算法使用目标单词的语言上下文来预测被扣掉的目标词 (可以简单的理解为让机器做完形填空题) ， 比较适合小规模数据集。 Word2Vec 还实现了另一种 Skip-Gram 算法，该算法通过从目标单词预测上下文单词来反转 CBOW 的预测任务，比较适合大规模数据集。相比之下，skip-gram 将每个上下文目标对（例如，T：“房子”，C：“宽敞”）视为单独的观察，因此可以更好地捕获精确的语义，但需要更大的语料库才能获得卓越的性能。</p>
<br>
<h3 id="46-维数">4.6 维数</h3>
<p>考虑维数很有必要。朴素的模型可以将不重复总词数作为维度， 例如包含 100,000 个不重复单词的语料库， 任何单词都需要  100,000 维才能准确表示。然而，当单词从上下文中被识别为相似时，可以一定范围内减少维度数。<strong>维度过多会导致内存需求和冗余增加，并降低可解释性；维度太少会扭曲距离并且无法解释语言的不及物性</strong>。通过这种方式，通过具有至少足够的维度来捕获所讨论的复杂语义关系，可以获得准确的预测。</p>
<p>在实践中，300 维已经成为一个标准，很大程度上源于最初的 Word2Vec 论文之后的惯例，该论文通过交叉验证确定了最佳维数，以减少预测屏蔽词任务中的错误。大多数后续分析都是建立在较小、多样性较低的文本集合上，需要较少的维度，因此 300 通常被用作上限。最近的工作表明，应根据语料库统计数据选择维度 - 语料库词汇表中成对等距单词的数量提供了维度数量的下限，低于此界限通常会导致单词嵌入质量下降（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B127">帕特尔和巴塔查亚 2017</a>）。<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B74">霍夫斯特拉等人。(2020)</a>使用 100、200 和 300 维的模型找到了稳健的结果。</p>
<p>如果分析师寻求实现维度可解释性，他们必须以最小失真来确定表示数据所需的维度数。 但这最后一步一半很少执行，因为维度的优化需要大量的时间和计算资源。</p>
<br>
<h3 id="47-窗口尺寸">4.7 窗口尺寸</h3>
<p>回想一下，窗口大小是指算法将用来焦点目标词（或其邻居）之前和之后的单词数量。该窗口最小可以是 1。对于较小的窗口，算法将倾向于对句法关系进行编码（例如，名词后跟动词）。<strong>随着窗口大小的增加，更多的含义和语义被编码到模型输出中</strong>。<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B145">考虑Rodriguez 和 Spirling (2022)</a>的示例，其中包含两个句子的语料库：(1)“狮子吃肉”和 (2)“牛吃草”。当窗口大小为一时，我们会知道牛和狮子都吃东西，从这个意义上说，牛和狮子在语法上是等价的，因为我们没有足够的信息来区分两者。然而，随着窗口的增加，算法开始对牛与狮子的含义进行更多编码。<strong>与维度数量一样，这里的回报也递减，窗口大于五个字的模型性能略有改善</strong>（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B145">Rodriguez 和 Spirling 2022</a>）。 <strong>BERT 和 GPT 系列等上下文模型具有更大的窗口，这些窗口通过注意力过程进行驯服，算法通过该过程识别哪些上下文单词对于解释焦点单词的含义很重要</strong>（Vaswani 等人，2017 年<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B176">）</a>。</p>
<br>
<h3 id="48-验证模型">4.8 验证模型</h3>
<p>最后一步是验证词嵌入模型，这样做是为了确认算法学习的表示与文本数据所承载的真实人类活动的概念空间表示尽可能相近。论文附录第 2 节描述了关于专利嵌入的七个详细验证程序，表明该模型有效地学习了创新空间的表示。这些包括（1）邻近嵌入词的语义相似性；(2)具有嵌入距离的语义梯度；(3)嵌入簇与语义域之间的对应关系；（4）物理世界距离与嵌入之间的相关性；(5) 社会距离与嵌入之间的相关性；(6) 嵌入空间类比推理的准确性；(7)嵌入文档的语义一致性。我们还讨论了第八个“额外”测试，即图灵测试（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B174">Turing 1950</a>）。由 Transformer 支持的现代上下文嵌入的评估标准是它们是否能够与人类毫无区别地参与任何分类、关联、意义生成或集成任务，包括普通对话和专家教程。OpenAI 的 ChatGPT 和许多竞争的聊天机器人已经展示了如此强大的性能，以至于图灵测试正在迅速从上限转变为基线基准。这些验证步骤与论文最后部分的测量相结合，作为嵌入模型的有用提示prompt和测量，使研究人员能够对其编码的概念空间提供结构化解释。</p>
<br>
<h3 id="49-词嵌入方法的优点和缺点">4.9 词嵌入方法的优点和缺点</h3>
<h4 id="491--无需正式指定相关尺寸">4.9.1  无需正式指定相关尺寸</h4>
<p>对概念建模的正式尝试试图通过逻辑演绎方法清楚地枚举概念的相关维度（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B47">Gärdenfors 2004</a>、<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B48">Gardenfors 2014</a>、<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B65">Hannan 等人 2019</a>）。尽管这种方法对于理解限定领域内的概念很有用，但即使如此，它也可能不切实际且难以衡量，因为很难先验地陈述分析师应预期的相关维度<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B73">（Hofstadter 和 Sander 2013 ）</a>。 <strong>词嵌入的优点在于，概念之间的关系以及对任何给定概念重要的相关维度可以从语言的使用方式中推断出来，因此不需要事前指定</strong>。鉴于在分析之前没有必要陈述相关维度，即使是最复杂的组织行为剧场也变得易于分析处理。正如其他人所指出的，“词嵌入为语言中包含的多个维度的含义提供了全面且有意义的见解，这是以前的方法无法捕获的”（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B105">Lix 等人，2022 年</a>，第 8434 页）。在某种程度上，这种优势源于这样一个事实：神经网络架构能高效地记录意义的维度。</p>
<br>
<h4 id="492-更大的有效维度">4.9.2 更大的有效维度。</h4>
<p>嵌入通常由 100 到 1,000 个密集编码维度表示。<strong>编码的密度意味着每个词向量在所有建模维度上都有一个非零坐标</strong>。正如附录中所指出的，主题模型可能具有相同数量的主题（例如，100-1,000），但这些主题被稀疏编码以方便人类解释，使得主题仅具有一些基本上非零的单词加载，并且文档仅具有少量非零的主题负载。<strong>因此，主题模型是为了描述而构建的，但代价是迫使其表示的有效维度从数百个减少到几个，从而扭曲了本来可以在主题空间内计算的距离。相比主体模型， 词嵌入使用密集编码，每维度的嵌入很难理解和描述，但距离具有更大的自由度，可以更精确地编码含义</strong>。通过这种方式，相对于低维理论和测量，嵌入为分析师提供了“大量潜在轴，个人和社会群体可以沿着这些轴竞争、合作、分裂或合并”（Kozlowski et al. 2019，p.27 <a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B89">）</a>。</p>
<br>
<h4 id="493-无监督训练">4.9.3 无监督训练。</h4>
<p><strong>词嵌入还有一个特殊优点，即训练模型时， 以看似无监督或自监督的方式进行，从而避免了手动编码文本语义内容的繁琐，完全由机器学习</strong>。在我们的创新示例中，向量空间由我们专利语料库中的每个发明人按照他们所写句子的数量和长度的比例进行监督。每个单词的滑动窗口都是为了向专利审查员和未来的发明者传达一种含义而构建的，该算法用于构建向量空间并以概念上适当的方式定位单词。因此，学者们可以利用专利语料库来训练 <strong>技术创新</strong> 的概念空间，利用财报电话会议记录和新闻稿来训练 <strong>上市公司沟通</strong> 的概念空间，利用分析师报告来训练 <strong>投资分析</strong> 的概念空间，或者特定领域的概念空间。使用内部通信（例如 Slack 和电子邮件）来了解公司的知识。这些概念空间可以在最少的监督下进行训练，因此很快成为有价值的观察站，用于追踪组织科学家关注的组织生活的静态和动态（Hofstra et al. 2020，Whalen et al. 2020，<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B74">Burtch</a> et <a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B184">al</a> . <a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B18">2021</a>，<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B177">Waller 和 Anderson 2021</a>、<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B2">Aceves 等人 2022</a>、<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B20">Carlson 2022</a>、<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B23">Chambers 等人 2022</a>、<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B61">Guzman 和 Li 2023</a>、<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B94">Lawson 等人 2022</a>、<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B105">Lix 等人 2022</a>）。</p>
<br>
<h4 id="494-共现是不必要的">4.9.4 共现是不必要的。</h4>
<p><strong>这些模型的另一个优点是，两个概念不必在任何文档中同时出现，就可以将它们编码为相似的向量</strong>。所需要的只是它们与相似的概念同时出现。例如，我们可以先验地指出 <em><strong>医生</strong></em> 和 <em><strong>律师</strong></em> 在某些方面非常相似（例如，他们需要高级学位，具有高收入水平等），但他们可能永远不会同时出现在语料库的同一文档中。尽管彼此之间缺乏共现性，但它们很可能都独立地与高收入*、<em>高学历</em>、*白领等概念同时出现，从而最终拥有编码这些相似性的接近向量。<strong>因此，嵌入模型的底层计算架构可以更好地近似社会和文化含义，而无需求助于严格的共现</strong>。</p>
<br>
<h4 id="495-上下文相关的含义结构">4.9.5 上下文相关的含义结构。</h4>
<p><strong>使用定制训练的嵌入模型的一个优点是它将捕获上下文相关的含义结构</strong>。例如，<em><strong>“甜”</strong></em> 的含义在软件团队的背景下与 <em><strong>烹饪</strong></em> 的背景下会有所不同。正如<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B105">Lix 等人。（2022）</a>指出，在软件团队的背景下，与 <em><strong>“甜蜜”</strong></em> 最接近的术语是 <em><strong>“强烈”</strong></em>、 <em><strong>“兴奋”</strong></em> 和 <em><strong>“耶”</strong></em>。此外，就同一个单词编码不同概念（一词多义）而言，单词每种含义的概念信息都位于单词嵌入内的线性叠加（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B5">Arora et al. 2018</a>）。这意味着编码诸如 <em><strong>“Bank”</strong></em> 之类的单词的<em>n</em>维向量包含其代表的所有概念的概念信息，例如 <em><strong>河边</strong></em> 或 <em><strong>金融机构</strong></em>。通过这种方式，即使在多义词的情况下，单词的上下文相关含义也会被编码到模型中。当这些上下文相关的含义不仅不同，而且是排他的或相反的时，来自转换器的上下文相关嵌入可以为上下文中的每个单词呈现不同的单词向量。</p>
<br>
<h4 id="496-几何有助于概念人群体和组织的细粒度表示">4.9.6 几何有助于概念、人、群体和组织的细粒度表示。</h4>
<p><strong>我们认为，词嵌入模型可以在训练的语料库范围内产生人类活动概念空间的细粒度表示</strong>。<strong>这意味着，从概念空间内编码的信息中，我们可以恢复个人、群体和组织本身的细粒度表示</strong>。以我们的创新案例为例，<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#F1">图 1</a>描述了在说明性二维空间中这是如何实现的。学习到的概念空间将由单词或短语w表示的概念作为其最原子的分析级别。我们的限制性示例显示了在2维空间中排列的九个单词。单词 1-3 由发明人 1 使用，单词 4-6 由发明人 2 使用，单词 7-9 由发明人 3 使用。<strong>通过获取每个人的单词向量的质心向量，我们可以得出每个发明人在创新的概念空间</strong>。<strong>将这个过程提升到团队和组织级别，我们可以在发明人团队和组织的概念空间内得出独特的向量</strong>。因此，词嵌入架构不仅在概念的最原子级别上是细粒度的，而且还可以在更聚合级别上提供细粒度的表示。相对于团队多样性、组织差异化和注意力等结构的粗粒度代理，这形成了显着的测量改进，这些结构在嵌入特定概念空间时是有意义的。</p>
<br>
<p><img loading="lazy" src="img/figure-1.jpeg" alt=""  />
<strong>图 1.嵌入作为概念、人员、群体和组织的细粒度表示</strong></p>
<br>
<h4 id="497-细粒度几何减少了上下文信息的丢失">4.9.7 细粒度几何减少了上下文信息的丢失。</h4>
<p><strong>由于粗糙、粗粒度的代理指标无法承载相关信息，在实证分析和相关理论构建中就无法利用这些信息</strong>。嵌入模型的优势在于其独特的信息表征，可以携带更多的信息，信息的粒度更小，保存的信息量更多。<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#F2">图 2</a>使用团队多样性的示例来说明如何实现这一点。<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#F2">图 2(a)</a>显示了两个团队，1 和 2，每个团队要么通过熵（一种标准的、集合论多样性的理论度量（顶行））来表示，要么通过概念广度（基于底层概念的细粒度度量）来表示。团队调动的信息（底行）。团队 1 和团队 2 都有四名成员，团队 1 由两名生物化学家、一名化学家和一名分析化学家组成，团队 2 由两名生物化学家、一名海洋学家和一名计算机科学家组成。<strong>由于两个团队的团队成员类型比例相同，因此它们都被编码为具有相同的团队多样性熵度量 1.5</strong>。**然而，当考虑团队成员的概念信息时，我们发现它们是本质上不同类型的团队，团队 1 的多样性或概念范围远不如团队 2 **。这表明粗粒度的测量可能会留下未开发的有价值的上下文信息（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B187">Wolpert et al. 2014</a>，<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B33">DeDeo 2017</a>）。因此，我们应该看到更细粒度的衡量标准与相关的、理论上的绩效结果之间的联系更加紧密和一致。</p>
<br>
<p><img loading="lazy" src="img/figure-2.jpeg" alt=""  />
<strong>图 2.（在线彩色)细粒度表示可防止有价值的信息丢失</strong></p>
<br>
<p>专利数据集使我们能够通过三种构建的措施来说明这一主张。首先，集合论团队多样性度量，使用团队先前专利在专利主要类别中的分布（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B79">Huo 等人，2019</a>）。第二种替代措施使用专利子类，以便它们提供相对于第一种更细粒度的措施。第三个衡量标准依赖于团队成员先前专利在创新概念空间内的<strong>概念广度</strong>。</p>
<br>
<h4 id="498-词嵌入的局限">4.9.8 词嵌入的局限。</h4>
<p>到目前为止，我们的注意力仅限于讨论嵌入模型的结构，描述它们与概念空间的关系，并注意到它们的优点。在这里我们将说明其局限性，讨论它们的严重性、改善方式，以及何时不要用词嵌入的意外情况。我们讨论三类限制。第一个源于神经网络模型一般复杂的“黑匣子”性质，以及这带来的具体挑战，涉及输入数据的偏差，以及模型正确推理的范围，特别是那些对超出分析师背景的数据进行预训练的模型。第二个与这些模型的大小以及训练它们所需的数据量有关。第三个问题涉及词嵌入模型的具体局限性以及从脱离韵律和表达上下文的文本数据中分析含义的挑战。</p>
<p>许多学者首先担心的是，多级神经网络模型显得复杂且在统计上难以理解，<strong>经常被批评为“黑匣子”方法</strong>，无法“打开”以询问其性能背后的机制（Knight 2017，Leavitt et <a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B86">al</a> . <a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B95">2021</a>） 。现代神经网络词嵌入模型通常作为自监督模型实现，该模型启发式搜索单词之间的依赖关系空间以预测屏蔽词的身份。<strong>自从第一个高性能嵌入发布（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B115">Mikolov 等人，2013b</a>）以来，对其黑盒性质的一些担忧已经减弱，因为数学家发现最流行的“浅”词嵌入模型（如 Word2Vec 和 FastText）获得了很大的优势</strong>。其强大功能来自于近似易于理解的矩阵分解方法的运算，例如因子分析、主成分分析和对应分析（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B100">Levy 和 Goldberg 2014</a>）。</p>
<p>“黑盒”输入输出方法带来的一个相关潜在限制是，<strong>输入的偏差将转化为输出中的偏差</strong>——用于训练嵌入的语料库的偏差将被编码在生成的单词嵌入模型中（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B14">Bolukbasi等人，2016</a>）。当模型用于现实世界的下游应用程序（例如推荐服务）时，这可能是有害的。例如，硬编码到嵌入中的种族和性别刻板印象可能会导致有偏见的建议（例如，评估是否适合招聘职位或预测财务违约的可能性），并导致不公平和不道德的决定（例如，拒绝工作或信贷） 。学者们应该根据他们的研究问题和设计，主动考虑这种负外部性是否可能，并在对人类造成伤害的可能性足够高时，偶然放弃嵌入。<strong>然而，在某种程度上，理解社区和研究背景中概念关联的本质是核心，研究人员将需要这些偏见进行分析。如果不包括它们，模型以及研究设计就会错过表征其研究背景的关键社会和文化规律。</strong></p>
<p><strong>如果分析人员对生成语料库的上下文没有清晰的了解，就会出现另一个相关的限制，这样他们最终可能会做出不适用和不相关的推论</strong>。例如，强调意义随时间变化的研究（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B19">Caliskan et al. 2017</a>、<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B49">Garg et al. 2018</a>、<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B89">Kozlowski et al. 2019</a>）的特点是词义表现出来自外源冲击的间断变化，从而重新配置了概念关联的结构。穿过空间。想一想 2005 年卡特里娜飓风之后“卡特里娜”的含义发生了怎样的变化。2009 年金融危机之后，金融术语的含义发生了重新配置，部分原因是添加了“问题资产救助计划”等许多新术语。忽略外源冲击可能会导致对后面和验证部分中描述的措施的错误解释，将其视为仅由进化产生的结果，从而导致错误的推论。这是一个特别成问题的问题，因为许多最准确的词嵌入模型都是在从网络上提取的大量文本语料库上进行预训练的。此类模型可用于引导非常小的文本数据之间的有意义距离，这是一项常见任务，但<strong>如果预训练数据是异构的，则距离可能无法反映焦点文本的概念世界</strong>。</p>
<p>接下来的两个限制必然是其嵌入优势的另一面。词嵌入模型产生的细粒度信息会带来特定研究可能或可能无法维持的成本。首先是模型尺寸。<strong>每个单词的数百个维度的细粒度信息或上下文嵌入需要比简单的字典计数或潜在狄利克雷分配主题模型更大的存储空间</strong>。这与通常用于将数据维度减少到两个或三个的因子和主成分分析形成鲜明对比。词嵌入模型使用更多维度（通常为 200-500）来更准确地预测数据的屏蔽部分。尽管如此，当前个人计算机的计算能力和存储能力现在允许训练合理大小的嵌入。</p>
<p><strong>与此相关的是，词嵌入模型需要比先前模型更多的文本才能稳健地估计概念空间</strong>。当大型语料库与研究主题相似并且可以用作理论相关文档或微调过程的初始化的代理时，可以通过迁移学习来弥补这一挑战。<strong>然而，有时相关语言在内容、目的或形式上与模型预训练的数据有很大不同，它需要独立建模，但又足够小，无法维持对嵌入模型的稳健估计。在这种情况下，使用字典计数或主题模型可能会更好，因为数据只能维持粗粒度的关联，而这些方法旨在捕获粗粒度的关联。</strong></p>
<p>最后一类通常涉及词嵌入和文本方法的特殊限制。首先，静态词嵌入本身并不处理一词多义，即一个词（例如 <em><strong>“bank”</strong></em> ）编码多个概念（例如金融机构、河边、侧向倾斜）的情况。尽管多义词的存在可能会影响后续一些指标的测量，但也存在抵消的力量。一方面，研究发现多义词的含义以相互线性叠加的方式编码在单词向量内（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B5">Arora et al. 2018</a>）。这意味着该算法通过同时考虑单词的所有含义来对单词在概念空间中的位置进行编码，从而克服了原本可能存在的严重缺陷。另一方面，上下文嵌入架构（在线附录中有更详细的描述）通过根据焦点词周围的上下文输出不同的向量来明确解决多义词的问题。每个单词不是单个向量，而是根据用途而变化的向量云。如果分析师怀疑一词多义可能是特定分析的严重问题，他们可以偶然使用上下文嵌入并规避这种担忧。</p>
<p>最后一个潜在的限制是文本方法的一般特征。只要文本数据是转录语音话语的产物（例如，欧洲央行或美联储主席演讲、政治演讲、财报电话会议、电视或电影文字记录、对话互动），语音的语调、语气和音色将没有纳入到嵌入表示中。考虑到。<strong>鉴于某些语言（例如中文）更严重地依赖语调来传达含义，这可能或多或少存在问题，具体取决于话语发生的社会背景及其表达语言</strong>。因此，在语调和语气在语料库中发挥重要作用的情况下，学者们应该讨论他们的嵌入模型选择和解释决策的后果。</p>
<br>
<h3 id="410-在研究中使用词嵌入模型的路线图">4.10 在研究中使用词嵌入模型的路线图</h3>
<p>现在我们大脑对词嵌入模型是什么、如何表示概念空间、如何训练、优点和局限性有了框架性的认知，接下来可以将它们整合到研究和理论构建的标准方法中。<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#T1">表 1</a>列出了如何将嵌入模型集成到科学流程中的路线图。</p>
<ul>
<li>步骤 1-3 是研究过程中的标准步骤，包括确定一个可行且有趣的研究问题，通过在适当的实证背景下进行评估，为重要的理论问题提供信息（Weick 1989 <a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B179">）</a>。</li>
<li>步骤 4-9 总结了本文到目前为止对嵌入模型的讨论。</li>
<li>步骤 10 和 11 ，与下一节指标度量有关，通过标准定量和定性方法调动该度量。</li>
</ul>
<br>
<p><strong>表 1.在研究中使用词嵌入模型的路线图</strong></p>
<table>
<thead>
<tr>
<th>步骤</th>
<th>活动</th>
<th>基本原理</th>
</tr>
</thead>
<tbody>
<tr>
<td>1. <strong>确定研究问题</strong></td>
<td>如果研究问题至关重要，请确定文本数据是否有助于在理论研究上有帮助。</td>
<td>吸引研究人员把注意力聚焦在理论问题、词嵌入构建研究构念回答问题的交叉点。</td>
</tr>
<tr>
<td>2. <strong>理论建立及相关理论构建</strong></td>
<td>确定使用哪种理论框架来解决研究问题以及通过嵌入模型来操作哪种理论构念。</td>
<td>理论构念与其词嵌入指标(构念的衡量）之间的紧密联系能够实现累积的理论发展。</td>
</tr>
<tr>
<td>3. <strong>定义经验背景</strong></td>
<td>选择适当的实证背景，在其中回答研究问题并动员理论框架和构念。</td>
<td>确保研究问题、理论框架和用构念以逻辑方式相互加强。</td>
</tr>
<tr>
<td>4.<strong>指定将用于表示经验背景的概念空间的文本数据</strong></td>
<td>描述将用于训练词嵌入模型和测量感兴趣的理论构念的文本数据的范围。 数据是否有效地涵盖了您想要得出理论结论的经验背景下的行为活动范围？</td>
<td>确保用于计算理论构造度量的词嵌入模型在逻辑上映射到并有效地代表所提出的理论框架内的实证研究背景。 文本数据的范围应该在逻辑上映射到所讲述的理论故事的范围。</td>
</tr>
<tr>
<td>5.<strong>确定文本数据的大小和范围</strong></td>
<td>数据是否足够大以学习相关概念空间的准确表示？</td>
<td>文本数据的大小将决定是否应该训练自定义嵌入，或者是否应该使用可用数据对现成的嵌入进行微调。</td>
</tr>
<tr>
<td>6. <strong>给定数据大小，要么训练独特的词嵌入模型，要么微调现有模型</strong></td>
<td>如果文本数据足够大，则训练自定义嵌入来表示感兴趣的经验上下文的概念空间。 如果文本数据不够大，请使用这些数据来微调现有的现成嵌入模型。</td>
<td>确保用于测量理论结构的嵌入模型能够有效地表示经验背景的相关概念空间。</td>
</tr>
<tr>
<td>7. <strong>如果训练独特的模型，请选择一种算法</strong></td>
<td>在连续词袋 (CBOW) 或 Skip-Gram 模型之间进行选择。</td>
<td>CBOW：在较小的数据集上可以有更好的性能。 <br>Skip-gram：可以更好地捕获语义。</td>
</tr>
<tr>
<td>8. <strong>如果训练独特的模型，确定相关参数</strong></td>
<td>选择窗口大小和维数。</td>
<td>窗口大小：标准做法是 5。较小的窗口可以更大程度地捕获语法，较大的窗口可以更大程度地捕获语义，但收益递减并增加计算成本。 维度数：标准做法是 300，超过此点后性能回报递减。</td>
</tr>
<tr>
<td>9. <strong>验证词嵌入模型</strong></td>
<td>请遵循在线附录中的验证程序。</td>
<td>确认嵌入模型准确有效地表示了经验背景的概念空间。</td>
</tr>
<tr>
<td>10. <strong>计算相关度量</strong></td>
<td>通过确定将用于实施感兴趣的理论构念的相关概念集，创建“实际措施和应用”部分中的措施之一。</td>
<td>使学者能够将该测量用于定量或定性分析。</td>
</tr>
<tr>
<td>11. <strong>在标准定性或定量方法中使用计算的度量</strong></td>
<td>对于定量分析，该度量要么成为自变量，要么成为因变量。 对于定性分析，学者可以提供解释性分析，因为它们可能适用于其他类型的档案、民族志或视听数据。</td>
<td>嵌入模型表示对生成数据的社会背景的概念空间的描述。</td>
</tr>
</tbody>
</table>
<p><br><br></p>
<h2 id="五实际措施与应用">五、实际措施与应用</h2>
<p>现在已经正式定义了 <strong>概念</strong> 和 <strong>概念空间</strong> 的含义，并说明了先前的文献如何处理概念信息,  介绍了嵌入模型表示能力的底层逻辑，并在在线附录中完成了支持这种直觉的几个验证步骤。也评论了嵌入模型给概念信息分析带来的几个优点和相关缺点。</p>
<p>在本章中，我们将介绍一些新研究， 学习他们如何用嵌入生成独特指标。<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#T2">表 2</a>总结了这些指标及示例应用。</p>
<br>
<p><strong>表 2.词嵌入测量和示例应用</strong></p>
<table>
<thead>
<tr>
<th>措施</th>
<th>研究性学习</th>
<th>关键构念</th>
<th>研究问题</th>
<th>代表性调查结果</th>
<th>嵌入在这种情况下的优点</th>
</tr>
</thead>
<tbody>
<tr>
<td>1. <strong>概念广度</strong></td>
<td><a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B105">利克斯等人。(2022)</a></td>
<td>话语多样性——在一组给定的互动中，群体成员所传达的含义彼此分歧的程度。</td>
<td>一个群体的话语多样性如何影响其绩效？</td>
<td>高绩效团队会调整他们的共享认知以匹配任务的要求（例如，构思与协调）。</td>
<td>能够随着时间的推移以细粒度的细节和动态地追踪小组对话的概念广度，使学者们能够追踪话语多样性的新理论构造。</td>
</tr>
<tr>
<td>2.<strong>概念距离和相似度</strong></td>
<td><a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B74">霍夫斯特拉等人。(2020)</a></td>
<td>语义遥远的科学新颖性：博士论文中新链接概念的语义距离。</td>
<td>代表性不足的群体是否更有可能产生科学创新？</td>
<td>相对于男性，女性引入了更遥远的新奇事物。 然而，这种语义上遥远的新颖性在该学科中很少受到关注。</td>
<td>能够追踪新概念组合的概念距离，使学者不仅可以研究是否做出了新组合，还可以研究这些组合的语义距离最终如何影响其影响。</td>
</tr>
<tr>
<td>3.<strong>概念X性</strong></td>
<td><a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B94">劳森等人。(2022)</a></td>
<td>性别刻板印象：男性（而非女性）与以成就为导向的代理特征（例如自信和果断）相关的程度。</td>
<td>雇用女性首席执行官和董事会成员是否与组织对代理语言的性别使用发生变化有关？</td>
<td>当组织雇用女性首席执行官和董事会成员时，女性的语义与代理的语义变得更加一致。</td>
<td>对 22 家标准普尔 500 强公司的 43,000 多份文件（包含超过 12 亿字）进行分析，深入细致地研究女性的含义如何因聘用女性领导者而发生变化。否则这样的分析是不可能的。</td>
</tr>
<tr>
<td><strong>4.概念意义</strong></td>
<td><a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B63">汉密尔顿等人。(2016)</a></td>
<td>词语的文化意义：词语的含义随时间变化的程度。</td>
<td>语义演化的可能驱动因素是什么？</td>
<td>跨历史时期的语义变化率与词频的逆幂律成正比。 与频率无关，具有更多含义的单词具有更高的语义变化率。</td>
<td>能够探索跨多个知识和文化领域的大型历史时期和大量文本中的语义变化。例如，他们可以详细追踪同性恋这个词的含义如何从<em>快乐</em>和<em>艳丽</em>等概念转向<em>同性恋</em>和<em>女同性恋</em>等概念。</td>
</tr>
<tr>
<td>5. <strong>文化和知识连续体中的概念立场</strong></td>
<td><a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B89">科兹洛夫斯基等人。(2019)</a></td>
<td>社会阶层标记：区分社会阶层维度的概念。</td>
<td>20世纪社会阶级的标志是如何变化的？</td>
<td>尽管社会阶级维度在历史上保持稳定，但阶级文化标记在每个维度中的定位方式却不断发生变化（例如，员工从士兵和肌肉等概念转变<em>为</em>白领<em>和</em>中产阶级<em>等</em>概念*）*。</td>
<td>能够将文化相关的概念投射到文化相关的兴趣连续体上，从而使研究人员不仅可以在单个历史时期内而且可以在其历史演变过程中了解广泛共享的社会关联。</td>
</tr>
<tr>
<td>6. <strong>概念维度</strong></td>
<td><a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B89">科兹洛夫斯基等人。(2019)</a></td>
<td>阶级的文化维度：理解社会阶级的维度（富裕、教育、修养、地位、就业、道德、性别）</td>
<td>20 世纪文化阶层的规模有多稳定？</td>
<td>20世纪，尽管发生了巨大的经济转型，阶级规模仍然非常稳定。</td>
<td>能够对阶级的多个概念维度进行实证分析，从而理解 20 世纪美国它们之间的相互关系。</td>
</tr>
</tbody>
</table>
<br>
<h3 id="51-概念广度">5.1 概念广度</h3>
<h4 id="511-指标">5.1.1 指标</h4>
<p><strong>可以测量文档中单词之间的距离来计算它们在概念空间中的分布范围</strong>。文档可以是从专利到个人电子邮件通信的任何内容。我们可以测量每个单词与其他单词的平均距离有多远。<strong>获取文档内元素的平均距离（或每个单词与文档质心之间的距离）可以衡量该文档内的「概念宽度」</strong>。例如，我们衡量每项专利的概念广度， 可以从两个简单的文档开始，</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">doc1 =  [&#34;biochemistry&#34;, &#34;chemistry&#34;, &#34;analytical_chemistry&#34;]
doc2 =  [&#34;chemistry&#34;, &#34;oceanography&#34;, &#34;computer&#34;]
</code></pre></div><p>使用我们的专利嵌入模型，我们得到第一组(doc1)的平均宽度为 29，第二组（doc2）平均宽度为 47。这表明第二组在概念上比第一组更广泛。</p>
<p>当我们衡量文档集合而不是单词的概念广度时，同样的逻辑也适用。例如，我们想了解发明者团队的广度。在这种情况下，我们可以将团队中的每个发明人视为嵌入概念空间中的“文档”，参考如图<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#F1">1</a> , 从下往上，依次是词概念空间、发明人概念空间、团队概念空间、组织概念空间。一个发明人团队的成员已经在涉及纳米技术、生物技术和软件的概念空间领域发表了先前的专利，那么在概念上将被认为比所有成员只发表了纳米技术专利的团队更广泛。即使所有发明人都将其公开的专利限制在一个类别内，该指标仍然会提供显着的变化。</p>
<p><img loading="lazy" src="img/figure-1.jpeg" alt=""  />
</p>
<br>
<h4 id="512--应用">5.1.2  应用</h4>
<p>这种概念广度的度量已在最近的工作中用于追踪各种理论构念。<strong><a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B105">利克斯等人。(2022)</a>衡量团队成员在参与软件项目的不同阶段时的 话语广度</strong>。<strong>他们能够追踪每个独特项目阶段概念参与的多样性，发现表现最好的团队有能力改变他们的认知以适应手头不断变化的任务，在提出新想法时表现出更大的话语广度，而在转换时表现出较低的广度依赖于协调的任务。这种细粒度的知识参与概念很难用以前的文本分析方法来追踪</strong>。 详细内容可阅读大邓近期推文 <a href="https://textdata.cn/blog/2023-11-02-measure-cognitive-diversity-through-language-discursive-diversity/">MS2022 | 使用语言差异性测量团队认知差异性</a> 。</p>
<p>另外，研究人员使用概念广度来追踪在线社区成员根据状态变化分配注意力的范围，发现状态和注意力广度之间存在 U 形关系（Aceves et al. 2022 <a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B2">）</a>。这些研究人员训练了 150 个知识领域的概念空间，从而能够追踪不同知识领域的相似注意力动态，从计算机编程和数学到育儿和园艺。由于他们有能力在数百个社区的文本中大规模部署算法，因此他们能够计算出超过 2000 万成员如何在这些问答社区上发布的 2300 万个问题中分配注意力。</p>
<p>其他工作在整个语言中实施了这种方法，追踪语言在所有知识领域具有更宽或更窄的概念空间的程度（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B1">Aceves 和 Evans 2021</a>）。使用圣经、电影字幕和以多种语言编写的政治文件等文本的并行翻译（包含相同的信息但以不同的语言编码），他们能够追踪概念在不同语言中相互关联的程度存在显着差异。他们发现，尽管一些语言将不同的概念子空间紧密地联系在一起，并将不同的概念领域编织在一起，但其他语言却稀疏且更加支离破碎，更强烈地分隔了不同的意义域。然后，他们观察概念空间的语言密度如何塑造数百种语言的真实对话和维基百科文章的概念广度。</p>
<p>所有三篇论文都为不同文献的研究开辟了新的理论途径，例证了该方法的潜力。如果没有概念空间的概念及其通过嵌入模型的表示，这些新的研究途径将很难实施。</p>
<br>
<h3 id="52-概念距离和相似度">5.2 概念距离和相似度</h3>
<h4 id="521-指标">5.2.1 指标</h4>
<p>当我们的分析重点在于集合内的元素时，前面描述的概念广度构念是相关的。当我们的分析重点是不同集合之间的关系时，可以使用相同的基础度量。在这种情况下，我们将指的是概念距离或相似性，而不是概念广度。<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#fn14">13</a>形式上，如果我们有至少两个集合，每个集合中至少有一个元素，我们可以计算这些集合之间的<strong>概念距离，作为每个集合的质心或多维平均值之间的距离</strong>。最基本的是，我们可以计算两个集合之间的概念距离，每个集合包含一个单词。这无非是衡量这些词之间的概念距离。随着元素数量和集合数量的增加，底层计算保持不变，但理论可能性的范围扩大。还可以通过训练文档嵌入模型来计算这种距离/相似性度量，该模型在嵌入空间中为每个文档分配一个向量，其权重按照单词共现的相同逻辑进行训练（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B96">Le 和 Mikolov 2014</a>），将文档本身视为文档中的另一个单词，将这些单词用作与其共现的单词。</p>
<p>通过将概念相似性与衡量专利相似性的现有技术进行比较，我们可以一睹该衡量标准的潜力。首先，研究人员可以通过查看专利授予机构使用的官方分类来追踪专利的相似性，同一类别的专利被认为比不同类别的专利更相似（Singh 和 Marx 2013，Aharonson 和<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B157">Schilling</a> 2016 <a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B3">）</a>）。这种方法的局限性在于分类度量是粗粒度的，并且不太可能考虑所有相关的技术特征，特别是当类别边界必然滞后于技术进化时（Thompson 和 Melanie Fox-Kean 2005，Singh<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B172">和</a>Agrawal <a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B155">2011</a>，<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B7">Arts 等人，2018</a>）。其次，研究人员可以获取两项专利并测量它们之间的单词重叠（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B7">Arts et al. 2018</a>）。然而，这种方法是有限的，因为它仅适用于成对的文档，无法确定专利相对于整个知识体系的位置。</p>
<p>概念相似性解决了这些限制（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B184">Whalen 等人，2020</a>）。首先，它允许我们追踪专利在相关知识空间中的精确位置，从而访问知识系统中的所有相关的细粒度信息。其次，我们能够精确量化任何专利或专利组相对于任何其他专利或专利组的位置。<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#fn15">14</a>第三，随着新知识进入系统，知识的性质和结构不断演变，随着时间的推移重塑 <strong>概念边界</strong> 和关联。<strong>嵌入使我们能够衡量专利发布时存在的概念空间内的专利相似性，使我们能够摆脱使用滞后的、周期性偏离的类别，并可能对连续的发明概念空间强加类别差异</strong>。概念距离的所有这些优点都适用于其他知识和文化领域，在这些领域中，我们寻求测量思想、个人、群体或组织之间的距离或相似性，从而扩展现有的跨研究领域并开辟新的理论领域。</p>
<h4 id="522-应用">5.2.2 应用</h4>
<p>正如我们上面所做的那样，这种<strong>概念相似性的衡量方法最近被用来描述专利数据中的创新空间</strong>（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B184">Whalen 等人，2020</a>）。研究人员使用  <strong>doc2vec</strong>  框架计算了超过 6 亿个专利对的相似度。在生成这些知识相似性度量时，作者还使用这些分数提出了有趣的辅助度量，包括可操作的度量（a）现有技术接近度——专利引用与其自身相似或不相似的现有技术的程度，（b）现有技术同质性——一项专利引用知识空间领域彼此远离的程度，(c) 影响邻近性——一项专利被与其自身相似或不相似的未来专利引用的程度，以及(d) 影响同质性——一项专利通过其前向引用与一组不同的未来专利相关的程度。</p>
<p>学者们也使用了这一衡量标准，重点关注概念距离。<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B18">伯奇等人。(2021)</a>使用概念距离的 <strong>doc2vec</strong> 实现来调查同行奖励是否会影响在线社区内贡献的新颖性。这里的<strong>新颖性是根据社区成员获奖前后贡献的距离来衡量的</strong>。作者发现，获奖后，奖项会导致知识空间内的新颖性减少，剥削行为增多。同样，<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B74">霍夫斯特拉等人。（2020）</a>使用 Word2Vec 距离度量来捕获科学论文将新颖性引入科学文献的程度，发现来自代表性不足群体的学生负责将最具新颖性引入系统。</p>
<p>其他人则利用这一措施来实施公司差异化。在发展中国家微型企业的背景下，<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B20">Carlson（2022）</a>使用 BERT 架构（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B35">Devlin et al. 2019</a>，<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B142">Reimers and Gurevych 2020</a>）来计算其数据集中所有微型企业的成对余弦距离。通过这些距离，他们能够估计八个发展中国家的 10,000 家微型企业的差异化与收入和利润的增加相关。同样，<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B61">Guzman 和 Li（2023）</a>使用距离的 doc2vec 实现来使用 Crunchbase 数据来衡量初创公司的创始战略差异化。作者发现与<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B20">Carlson (2022)</a>类似的结果，即差异化经验的新公司在早期融资和股权结果方面有所增加。</p>
<br>
<h3 id="53-概念x">5.3 概念X</h3>
<h4 id="531-指标">5.3.1 指标</h4>
<p>文档距离的另一个用途是追踪语料库中的任何文档与捕获感兴趣的构念X的焦点(原型）的相似性， 这样的测量将捕获任何观察的 <strong>概念X性</strong>( Conceptual X-ness)。这种测量的第一步是描述与我们寻求尽可能精确测量的结构相关的概念信息。例如，如果我们想要捕获专利与 <strong>时间</strong> 或 <strong>几何</strong> 等概念相关的程度，我们可以构建一个我们认为映射到、定义或与这些概念相关的单词列表 。对于每个列表，我们计算其质心向量 (c#27)，然后测量任何给定专利距离 <strong>时间</strong> 和 <strong>几何</strong> 概念有多远。<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#fn16">15</a> 对于附录表 A2 中使用的专利，我们可以看到，与头颈约束装置专利相关的前两项专利更接近时间概念，正如所预期的那样光和时间在概念上交织的程度。概念性的<em>X</em>度度量可用于追踪思想、个人、团体、组织或任何其他相关聚集的组成。</p>
<h4 id="532-应用">5.3.2 应用</h4>
<p>最近在一篇论文中使用了这种方法，该论文追踪了雇用女性担任高级领导角色对女性在这些组织中意味着什么的影响（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B94">Lawson 等人，2022</a>）。作者首先使用 SEC 文件和财报电话会议记录训练了 Word2Vec 嵌入。然后，他们创建并验证了一组 100 个单词来捕捉 <strong>代理概念</strong> 的含义（例如，有能力、独立、主导），并观察了内部任命高级女性领导前后 <strong>代理概念</strong> 与 <strong>女性</strong> 概念之间的距离。该组织发现，在 <strong>女性</strong> 被任命为高层管理人员之后的一段时间内，女性的含义在概念空间中更加接近于机构职位。作者使用不同的嵌入超参数和维度大小复制了他们的结果，说明了嵌入模型的鲁棒性，条件是具有捕获概念空间内语义变化的最小必要维度。</p>
<p>这里有趣的理论机会包括更深入地参与理论传统的可能性，这些理论传统在组织科学以外的领域具有影响力，但由于缺乏可行的方法来以原则性的方式量化其理论构造，因此这些理论传统仍然处于我们的领域之外。依赖文学解释。正如我们所提出的，测量 <strong>概念X性</strong> 使得扩大与理想形式（* <a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B134">Plato Bloom 1968</a>）、理想类型（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B178">Weber 2011</a>）、家族相似性（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B186">Wittgenstein 2010</a>）和原型（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B147">Rosch 1973</a>）相关的理论构造的测量成为可能。以一致、有原则和可复制的方式。在这方面，概念性的<em>X</em>性代表着开放大量的认知和社会理论，以便在组织的背景下进行实证检验和扩展。</p>
<br>
<h3 id="54-语义转变和漂移">5.4 语义转变和漂移</h3>
<h4 id="541-指标">5.4.1 指标</h4>
<p>概念空间使我们能够识别术语的含义如何随着时间和空间的变化而变化。探索概念意义的一种方法是为不同的个人、公司、行业、地理位置或时间段创建独特的嵌入模型，以了解它们之间的含义有何不同（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B148">Roy 等人，2019 年</a>；<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B181">Welch 等人，2020a</a>，<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B182">b</a>）。一旦识别出相关的兴趣分歧，我们就可以采用相关的语料库（例如，专利、财报电话会议、报纸）并为数据中的每个语料库训练概念空间。<strong>在我们的专利示例中，我们可能会训练两种嵌入模型，一种是 1990 年功能性磁共振成像技术发明之前的时期，另一种是 1990 年之后的时期</strong>。然后我们可以探索与大脑和神经科学相关的概念的含义如何随着这一创新而改变。例如，在功能性磁共振成像发明之前和之后与不同大脑区域最相关的术语是什么。接下来，我们可以比较不同公司或国家的含义变化有何不同，以及这种变化的格局如何影响所涉及的公司和行业的组织和市场结果。显式动态词嵌入允许嵌入之间具有更大的可比性，但必然会忽略特殊的词和用途（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B63">Hamilton et al. 2016</a>、<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B192">Zhang et al. 2016</a>、<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B190">Yao et al. 2018</a>、<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B103">Liu et al. 2020</a>）。这些算法的输出带有时间戳词向量包含特定时期的语义信息，但在历史上保持可比性。</p>
<br>
<h4 id="54-2-应用">5.4. 2 应用</h4>
<p>第一篇在社会科学背景下使用词嵌入方法的主要论文就是使用这种方法来研究意义（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B63">Hamilton et al. 2016</a>、<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B19">Caliskan et al. 2017</a>、<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B49">Garg et al. 2018</a>、<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B89">Kozlowski et al. 2019</a>）。在第一篇论文中，研究人员使用四种语言的六个历史语料库，通过观察概念空间中最近的单词在过去几十年中如何变化来追踪单词含义随时间的变化（Hamilton et al. 2016 <a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B63">）</a>。使用 Word2Vec 嵌入，他们追踪了 <strong>同性恋</strong>  概念的含义如何从 1900 年代围绕 <strong>“愚蠢”</strong>、**“甜蜜” **和 **“开朗”  **等术语的含义转变为围绕 1950 年代 <strong>“嬉闹”</strong>、 <strong>“机智”</strong> 和 <strong>“聪明”</strong> 等术语的含义，并且最终以 20 世纪 90 年代女同性恋、双性恋和同性恋等术语的含义结束。在另一篇论文中，研究人员研究了词嵌入中的刻板关联之间的关系及其与当代社会经验数据的关系（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B19">Caliskan et al. 2017</a>）。例如，他们追踪了职业的性别刻板印象，发现职业具有女性意义，因为它们与女性参与劳动力市场相关。在另一项研究中，<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B49">Garg 等人。(2018)使用预先训练的 Google News Word2Vec 模型（ </a><a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B4">Google 2013</a> ）量化了美国 100 多年历史中的性别和种族刻板印象，阐明了不同的形容词和职业如何或多或少地与不同人群（例如，男性与女性）密切相关，白人与亚洲人与西班牙裔）随着时间的推移。</p>
<p>最近通过词嵌入追踪含义的工作已经使用这种方法更深入地研究了特定的上下文。一项研究使用 19 世纪第一人称叙述的语料库来追踪黑人和白人男性和女性的交叉身份如何映射到五个社会机构，包括政治、经济、文化、家庭领域和权威关系（Nelson 2021 <a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B120">）</a>。<strong>举论文中的一个例子，作者测量了与“精致”概念的距离，发现它与白人女性的联系最密切，而与黑人男性的联系最少</strong>。</p>
<p>在其他工作中，研究人员利用这种方法来衡量政治领导人的 <strong>集体意向性</strong> （人们参与集体推理和行动的能力），并比较共和党和民主党领导人如何以不同的方式动员集体意向性（Kirgil and Voyer 2022 <a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B85">）</a>。他们通过创建复数代词（我们，我们的）、复数常量（国家名称）和复数名词（人）的复合列表来测量集体意向性。然后，使用词嵌入模型，他们找到了各州集体意向向量最接近的术语，使他们能够比较不同领导人如何不同地动员集体意向。总的来说，这些意义研究表明，就语言为我们提供了解文化的窗口而言（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B55">Goldberg et al. 2016</a>、<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B165">Srivastava et al. 2018</a>、<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B26">Corritore et al. 2020</a>），嵌入模型为我们提供了一种独特的表达方式透过那扇窗户看到的照片。</p>
<br>
<h3 id="55-文化和知识连续性中的概念地位">5.5 文化和知识连续性中的概念地位</h3>
<h4 id="551-指标">5.5.1 指标</h4>
<p>另一种新颖的测量方法可以通过追踪概念相对于感兴趣的概念维度的位置来创建。如前所述，嵌入模型可用于解决类比推理任务，例如**“国王”-“男人”+“女人”=“女王”<strong>。 该架构可用于定义概念空间内任何感兴趣的维度。<strong>在国王-王后的例子中，性别维度通过“男人”-“女人”和“国王”-“女王”向量进行操作。</strong><a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B89">科兹洛夫斯基等人。（2019）</a>详细介绍了如何在概念空间内构建此类维度。首先，研究人员需要确定感兴趣的维度。对于我们这里的例子，我们将把不同的概念投射到男性-女性性别维度上。为此</strong>，我们首先确定定义性别维度的相关术语**。这里我们使用集合 [&lsquo;man&rsquo;, &lsquo;him&rsquo;, &lsquo;he&rsquo;, &lsquo;male&rsquo;, &lsquo;men&rsquo;] 和 [&lsquo;woman&rsquo;, &lsquo;her&rsquo;, &lsquo;she&rsquo;, &lsquo;female&rsquo;, &lsquo;women&rsquo;]。 <strong>然后我们计算不同概念在这个男性-女性概念轴(维度)上的正交投影。</strong> 在线附录中的图 A4 将每个概念投射到 <strong>男性-女性概念轴</strong>。 更消极的预测表明与女性气质的关联更强，而更积极的预测表明与男性气质的相关性相当。如图 A4 所示，这些预测与关于这些概念的性别状态的一般直觉一致，使我们能够明确说明每个概念相对于其他概念在这个维度中的位置。正如预期的那样，<strong>军事</strong> 和 <strong>农业</strong> 与 <strong>男性气质</strong> 的联系最为密切，而 <strong>卫生棉条</strong> 和 <strong>口红则</strong> 与 女性气质的联系最为密切。按照这个程序，学者们现在可以测量任何概念在任何感兴趣的维度和任何文本丰富的时空背景中的位置。此外，不同语言的语料库可以独立训练和对齐，或者同时训练和对齐，以方便国际分析（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B81">Johnson et al. 2017</a>，<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B116">Milbauer et al. 2021</a>）。</p>
<p><strong>双极概念维度的投影方法可以进一步扩展到锚定具有多种含义的低维子空间，其中单词和概念可以被绘制并理解为这些含义的混合</strong>。这可以通过理论上选择“原型”的集合来执行，即具有已知且广泛共享含义的极值点，并在这些极值锚定义的子空间中绘制所有相关单词或概念。[例如，在对一个新的基于信息技术的创业企业进行分类时，人们可能会问它在 Uber、亚马逊、谷歌或比特币所刻画的空间中适合什么位置(Breiman 1994，Eugster 2012，Damle 和 Sun 2017）。</p>
<br>
<h4 id="552-应用">5.5.2 应用</h4>
<p>这项措施的制定和运用是为了研究 20 世纪和 21 世纪社会阶层的演变（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B89">Kozlowski 等人，2019</a>）。他们研究了根据 20 世纪出版的数百万本书的文本训练的嵌入，按照上述程序操作了阶级的维度，试图了解社会阶级的底层维度在 20 世纪是如何变化的。为此，他们提出了以下理论上的<strong>概念轴(维度)</strong>：富裕程度（富人与穷人）、教育程度（受过教育与未受教育）、修养（有教养与未受教育）、地位（有声望与无声望）、道德（善与恶）、就业（雇主-雇员）和性别（男人-女人），分别嵌入 20 世纪的每个十年。然后，他们可以在这些维度上投射不同类别的概念，例如音乐风格、体育和职业，以了解这些概念在本世纪的过程中如何演变和发展。<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B6">研究人员应用这种方法来研究健康、道德（ Arseniev-Koehler et al. 2022</a>）、政治意识形态（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B171">Taylor and Stoltz 2021</a>）和地位（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B129">Peng et al. 2021</a> ）等背景下的其他类型的文化关联。</p>
<p>研究人员不仅将概念投射到这些概念轴(维度)上，而且将整个文档投射到这些维度上，从而推动了测量的可能性（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B170">Taylor 和 Stoltz 2020</a>）。此外，尽管以前的措施依赖于研究人员指定感兴趣的连续体的相关维度，但最近的工作已经转向自动识别这些连续体（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B116">Milbauer et al. 2021</a>）。<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B116">Milbauer 等人</a>利用 Reddit 社区的内容。<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B116">（2021）</a>创建了一个无监督的程序来识别社区中的多个意识形态极点，使他们能够超越静态的左右意识形态维度，发现现代话语中发挥作用的许多两极分化和意识形态差异的轴。人们可以想象在许多组织环境中使用这种方法来识别团队、小组、单位或部门之间存在的许多潜在冲突来源。</p>
<br>
<h3 id="56-概念维度">5.6 概念维度</h3>
<p>之前，我们讨论了研究人员如何调查关键术语在相关文化维度上的位置，描述概念的位置在性别维度上的差异。然而，这并不是概念轴(维度)的唯一用途，因为概念空间还允许我们测量和理解相关维度本身如何相互关联。该措施的扩展是使用空间内的编码维度并将它们相互比较。例如，<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B89">科兹洛夫斯基等人。（2019）</a>利用他们既定的阶级维度来追踪整个 20 世纪每个维度如何与其他维度相关，例如，表明随着世纪的发展，富裕与教育的关系变得更加密切，而与教育的关系无关。栽培。通过这种方式，组织学者可以理解相关维度之间的关系在相关概念空间中可能有何不同。例如，学者可以研究不同文化维度在组织或行业内部和之间紧密或松散联系的程度。</p>
<p><br><br></p>
<h2 id="六讨论">六、讨论</h2>
<p>最后，我们简要讨论了一些利用嵌入模型进行思考的新兴方法，然后讨论了我们认为理论、方法论和组织的有价值的机会，这些机会源于将这些模型理解为概念空间的细粒度表示。这个讨论必然是说明性的，但暗示了现在这些精致的意义模型的可操作性的广泛可能性。</p>
<h3 id="61-词嵌入方法的富有成果的扩展">6.1 词嵌入方法的富有成果的扩展</h3>
<p>词嵌入的底层计算架构最近经历了扩展，可以在与之前讨论的不同方向上动员组织研究。我们简要提到三个，并在在线附录中提供更详细的描述。首先，概念和语言的层次结构在“直线”、欧几里得几何中很难得到体现，需要许多难以理解的维度来用标准嵌入来捕获。然而，层次结构可以用负弯曲双曲嵌入来原生表示（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B90">Krioukov et al. 2010</a>，<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B126">Papadopoulos et al. 2012</a>，<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B22">Chamberlain et al. 2017</a>，<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B121">Nickel and Kiela 2017</a>），为探索复杂现代的交叉层次结构提供了新的测量可能性。组织。例如，将公司名称嵌入双曲空间中将能够直接发现典型的“中心公司”，并在商业新闻语料库中与所有其他公司进行比较。额外的双曲维度将揭示子层次结构，反映商业评论员所持有的概念和比较价值的不同维度。</p>
<p>其次，模型语言的深度学习方法为词嵌入增加了关键的上下文敏感性。考虑像 BERT ( <a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B35">Devlin et al. 2019</a> ) 和 GPT 系列模型 ( <a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B140">Radford et al. 2019</a> ) 这样的大规模模型，它们使用“注意力”的神经网络机制来识别影响焦点词含义的上下文词 ( <a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B176">Vaswani ) et al. 2017</a>），组装成一个称为转换器的架构，可以将问题转换为答案，将文本转换为翻译，将请求转换为响应。这种模型产生的内容可以被描述为上下文嵌入，这样每个单词不是由单个向量表示，而是由向量云表示，每个向量代表不同上下文中的该单词。“google”上下文中的“Apple”与“orange”上下文中的“apple”具有不同的值。这些模型极大地提高了预测能力，并进一步扩展了我们对概念空间进行精确建模的能力，但代价是复杂性和计算量更大。</p>
<p>最后，嵌入架构可以扩展到在序列或更高维上下文中排列的任意符号集。例如，图像已被用来衡量抽象艺术图像的新颖性和创造力（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B10">Banerjee and Ingram 2022</a>，<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B11">Banerjee and Kaplan 2022</a>），分析警察预约照片（大头照），并识别与司法拒绝保释相关的先前未概念化的紧急特征听证会（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B107">Ludwig 和 Mullainathan 2022</a>）。<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B101">音乐（ Liang et al. 2020</a>）、音频剪辑（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B70">Hershey et al. 2017</a>、<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B189">Xie and Virtanen 2019</a>）和视频（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B191">Zellers et al. 2021</a> ）的多维空间是使用audio2vec（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B168">Taglisacchi et al. 2020</a>）等工具构建的。 、signal2vec（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B118">Nalmpantis 和 Vrakas 2019</a>）和 video2vec（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B62">Habibian 等人 2017</a>），为组织学者接触代表组织生活视听体验的新型媒体打开了大门。</p>
<p>最近对双曲线、上下文、图像和音频嵌入的扩展表明，嵌入模型的底层计算框架的持续改进和扩展将继续下去，为组织科学中持续的实证、测量和理论创新奠定了基础。</p>
<br>
<h3 id="62-词嵌入和组织理论">6.2 词嵌入和组织理论</h3>
<p>在理论层面上，将嵌入模型理解为概念空间的有原则的、细粒度的表示有可能刺激新的理论发展并完善现有理论。例如，意义研究中的经典陈述影响了文学理论和文化社会学等其他领域，但未能在组织科学中站稳脚跟。自20 世纪初德索绪尔 ( <a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B32">de Saussure 1986</a> )的著作带来语言学的结构转向以来，许多人都试图将意义在组织和社会生活中的作用理论化。列维-斯特劳斯汇集了来自全球各地的多样而广泛的民族志，以向世界文化所特有的表面混乱提出深层的文化秩序，并认为复杂的意义是从有意义的元素的结合中产生的（列维-斯特劳斯 2016 <a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B99">）</a>。福柯理论化了话语和权力如何紧密相连，权力和知识如何以自我强化的联盟结合在一起（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B46">Foucault 2012</a>）。<em>布迪厄将惯习</em>的概念阐述为“持久的、可互换的处置系统，倾向于充当结构结构的结构化结构，即作为实践的生成和结构的原则”（Bourdieu 1977，第72页<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B16">）</a>。尽管这些理论很有吸引力，但迄今为止它们只能进行松散且间接的测试。如果没有可靠的实证立足点，他们就永远无法在管理和组织理论中取得突出地位。然而，概念空间的实证操作化现在使得这些文化理论基础著作的参与和扩展变得容易处理，其中的许多结构现在可以辩护地测量。嵌入模型将使这些理论与管理和组织理论相关。</p>
<p>我们还希望嵌入模型能够对现有理论框架进行更深入的研究和锐化。一组能够受益的文献是那些与知识相关的文献。鉴于组织学者可以获得的大部分知识都被编码在语言的符号概念系统中，现在可以通过更多可用的文本数据源来获取知识，并且可以通过嵌入模型的概念空间来表示。材料科学领域的最新工作已经使用此类模型来有效预测未来的知识发现，比科学家提出的知识发现早几十年（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B173">Tshitoyan 等人，2019 年</a>；<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B161">Sourati 和 Evans，2021 年</a>）。其他工作表明，这些发现可以推广到生物和物理科学（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B154">Shi 和 Evans 2023</a>）。概念空间的明确表示可以对整个社会系统中知识的特征和结构进行详细的调查。一方面，这些模型就像望远镜一样，打开了知识的天空，使其大规模结构变得可见，以供研究、理论发展和完善。另一方面，这些模型充当显微镜，使我们能够更深入地观察构成更大知识系统的意义原子结构。测量方面的这一进步将丰富对定义人类和组织经验的大型多维知识系统中的机制的测试。它还将使我们能够递归地评估管理和组织奖学金的知识，从而刺激创新。</p>
<br>
<h3 id="63-词嵌入和实证研究">6.3 词嵌入和实证研究</h3>
<p>在<em>实证层面</em>，词嵌入模型可以提高组织科学不同领域的测量保真度，从而在实证结果与理论主张和框架之间实现更好的映射。我们用团队和群体内部多样性研究的例子来说明这一点。据说，不同群体所获得的许多好处是由于群体中的个人代表问题和解决方案的方式不同而产生的（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B75">Hong 和 Page 2004</a>）。由具有不同方法的个人组成的小组将更好地执行各种任务，因为他们将拥有更广泛的知识、观点和可供借鉴的信息资源（Cox et al. 1991，<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B27">Williams</a> and <a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B185">O&rsquo;Reilly 1998</a>）。然而，由于测量困难，对团队多样性的研究很少测量问题和解决方案空间的不同概念。相反，它假设解决问题的团队成员的身份多样性（人口、文化、种族或经验）与其功能多样性（团队成员如何代表和解决问题）之间存在联系（Nisbett 和 Ross 1980，Hong<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B122">和</a>Page <a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B75">2004</a>，<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B175">van Dijk 等人，2017</a>）。</p>
<p>由于缺乏高保真方法来访问团队成员在问题和解决方案的概念空间中的位置，因此通常假定身份和功能多样性之间存在联系。用于操作研究的身份多样性和用于理论化的功能多样性之间脱节的一个重要后果是，虽然理论积极使用功能多样性的思想和术语（从根本上讲是几何和高维的），但测试依赖于集合-与身份成员资格相关的理论概念。<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B175">我们预测，这将解释团队多样性文献（ van Dijk et al. 2017</a> ）结果中的大部分歧义，因为研究设计忽视了功能多样性和身份多样性之间的同源性。然而，诸如概念广度之类的衡量标准可以阐明这一理论交叉点上的悬而未决的问题。我们现在可以指定（1）团队的基本概念广度，以及（2）这种基本广度可能驱动结果的程度。解决这些问题可以为许多分析层面的研究提供信息，从个人和团队的成功（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B164">Srikanth 等人，2016 年</a>，<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B175">van Dijk 等人，2017 年</a>）到公司和行业绩效（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B144">Roberson 等人，2017 年</a>）。我们希望我们的插图能够激发在组织研究领域生成细粒度意义测量的新可能性。</p>
<br>
<h3 id="64-组织内部的词嵌入">6.4 组织内部的词嵌入</h3>
<p><strong>最后，我们认为词嵌入方法将对我们研究的组织产生影响</strong>。我们说明了在劳动力市场背景下潜在的嵌入必须塑造组织行为。从招聘到工作设计，从培训到晋升，人力资源管理的一个核心挑战是有效地将个人与组织内的角色、工作、情况和任务相匹配（Weller et al. 2019 <a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B183">）</a>。随着比赛质量的提高，各种绩效指标也会提高，包括工作满意度（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B8">Ashforth 和 Saks 1996</a>）、个人生产力（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B125">Paauwe 2009</a>）和组织绩效（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B38">Dyer 和 Reeves 1995</a>）。有效匹配的一个问题是不同维度的匹配的重要性程度。在一家公司中，技能可能最为重要，而在其他公司中，技能可能是文化契合度、态度、技能和经验的相互作用。由于嵌入模型捕获了所有这些维度，管理者可以为每个相关维度嵌入不同的原型描述，同时还嵌入个人资料和其他相关通信（例如电子邮件、松弛消息等），以衡量每个人与每个相关维度之间的匹配接近度。这样做可以让管理者更好地识别高维匹配及其对员工、社区和公司绩效的影响。</p>
<p>嵌入模型旨在为人力资源的宏观管理提供新的视角（<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B183">Weller et al. 2019</a>）。来自大型组织的相关信息存储在人力资源经理、一线经理、员工、同事和外部招聘人员中。然而，无法集中访问这些信息。通过嵌入，组织可以从所有数字面包屑的文本（电子邮件、聊天、工作描述、正式报告、绩效管理记录等）构建概念空间。这样做并使用相似性分析将使公司能够绘制和了解相关人力资本的位置位于公司对面。管理人员可以利用这些系统来准确了解任何员工的概念职位与任何给定的公司要求的差距有多大。这不仅可以为招聘、雇用、员工流动和流动等流程提供信息，还可以为培训、社交、工作设计和公司重组提供信息。因此，在劳动力市场和组织适应的背景下，嵌入模型可以产生有用的创新。人们可以想象许多其他组织实践和结构可以从这些模型及其测量可能性中受益，包括产品设计、市场分析和战略生成。</p>
<p><br><br></p>
<h2 id="结论">结论</h2>
<p>我们同意<a href="https://pubsonline.informs.org/doi/full/10.1287/orsc.2023.1686#B65">Hanan等人的观点。（2019</a>，第 2 页）当他们观察到，考虑到概念和分类对几乎所有人类行为和社会互动的中心地位，人们对概念如何运作的关注如此之少，这是多么令人惊讶。现代组织内部及其周围进行的许多活动都需要概念信息的激活和传播。当一个人解决新问题、提出新想法或与他人合作时，就会发生这种情况。从围绕饮水机的良性闲聊到重新配置全球资本主义秩序或将人类登陆火星，概念及其所嵌入的概念空间发挥着核心、关键的作用。</p>
<p>正如本文所示，我们现在拥有一系列重要的工具，可以为广泛而深入的理论想象和实证研究打开<strong>概念世界</strong>和<strong>概念空间</strong>。我们希望本文能够激发对嵌入可以提供信息的大量问题和理论的学术探索。</p>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>MS2022 | 使用语言差异性测量团队认知差异性</title>
      <link>https://textdata.cn/blog/2023-11-02-measure-cognitive-diversity-through-language-discursive-diversity/</link>
      <pubDate>Thu, 02 Nov 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-11-02-measure-cognitive-diversity-through-language-discursive-diversity/</guid>
      <description>&lt;p&gt;词嵌入在经管中的应用很多，但大多数是训练词嵌入模型，依据词嵌入构建或扩展词典。 今天我们将分享一篇用词嵌入测量团队认知多样性。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/paper-cover-discursive-diversity.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;h2 id=&#34;一研究&#34;&gt;一、研究&lt;/h2&gt;
&lt;p&gt;Lix, Katharina, Amir Goldberg, Sameer B. Srivastava, and Melissa A. Valentine. &amp;ldquo;&lt;strong&gt;Aligning differences: Discursive diversity and team performance.&lt;/strong&gt;&amp;rdquo; &lt;em&gt;Management Science&lt;/em&gt; 68, no. 11 (2022): 8430-8448.&lt;/p&gt;
&lt;h3 id=&#34;11-摘要&#34;&gt;1.1 摘要&lt;/h3&gt;
&lt;p&gt;团队中的认知多样性如何影响其绩效？先前的研究表明，团队的认知多样性存在绩效权衡：多样性团队在创造力和创新方面表现出色，但在协调行动方面则有困难。基于团队认知不是静态的，而是动态互动产生的观点，我们引入了 &lt;strong&gt;话语多样性&lt;/strong&gt; 的概念，这是团队认知多样性的一种表现，反映了在一组互动中团队成员传达的含义在多大程度上相互不同。&lt;strong&gt;我们提出，高绩效团队是那些具有调节共享认知以适应不断变化的任务要求的集体能力的团队：在进行构思任务时，它们表现出更高的话语多样性，在执行协调任务时，表现出较低的话语多样性&lt;/strong&gt;。我们进一步认为，表现出一致调节的团队——即，在成员对不断变化的任务要求的个人语义变化中团队层面方差较低的团队——更有可能取得成功，而不是由成员之间存在不一致的调节。我们利用 &lt;strong&gt;计算语言学&lt;/strong&gt; 工具来衡量话语多样性，并借助一组新型纵向数据，包括117个在线平台 &lt;a href=&#34;http://www.gigster.com&#34;&gt;www.gigster.com&lt;/a&gt; 上的远程软件开发团队的团内电子通信和绩效结果，得出了对我们理论的支持。我们的研究结果表明，团队认知多样性的绩效权衡并非不可避免：团队可以通过将话语多样性水平与任务要求相匹配以及在进行这些调整时使成员保持一致来应对这一权衡。&lt;/p&gt;
&lt;h3 id=&#34;12-创新点&#34;&gt;1.2 创新点&lt;/h3&gt;
&lt;p&gt;这篇论文的创新点主要包括以下几个方面：&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;研究了团队内部的差异对团队绩效的影响&lt;/strong&gt;：该论文通过分析团队成员之间的差异，探讨了这些差异对团队绩效的影响。这一研究角度对于理解团队内部动态和绩效提升具有重要意义。&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;引入了阶段性的话语差异概念&lt;/strong&gt;：论文提出了阶段性的话语差异概念，即团队成员在不同阶段的沟通中所表现出的差异。这一概念有助于更好地理解团队内部沟通的动态过程。&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;探讨了团队内部沟通差异的调节作用&lt;/strong&gt;：论文研究了团队内部沟通差异与团队绩效之间的关系，并发现团队内部沟通差异在不同阶段对团队绩效的影响存在差异。这一发现为团队管理和绩效提升提供了重要的启示。&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;结合了多个学科领域的理论和方法&lt;/strong&gt;：该论文综合运用了心理学、经济学和组织学等多个学科领域的理论和方法，从多个角度深入研究了团队内部差异和绩效之间的关系，为相关领域的研究提供了新的视角和方法。&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二文献梳理&#34;&gt;二、文献梳理&lt;/h2&gt;
&lt;h3 id=&#34;21-认知多样性&#34;&gt;2.1 认知多样性&lt;/h3&gt;
&lt;p&gt;认知多样性(cognitive diversity)对团队绩效的影响是一个长期存在的问题。以往的研究表明，团队的认知多样性存在绩效权衡：多样性团队在创造力和创新方面表现出色，但在协调行动方面存在困难。然而，&lt;strong&gt;最近的研究提出了一种新的观点，即团队的「认知多样性」可以通过调节团队的「共享认知」来实现绩效的平衡。这意味着团队可以根据任务要求调整其认知多样性的水平，以在创造性任务和协调任务之间找到平衡点&lt;/strong&gt;。高绩效团队具备调节团队认知的能力，使其能够在创造性任务中展现较高的认知多样性，在协调任务中展现较低的认知多样性。这种能力使团队能够在创新和执行之间找到平衡，从而提高绩效。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;22-话语多样性&#34;&gt;2.2 话语多样性&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;话语多样性(discursive diversity) 是指团队成员在交流和讨论中表达的观点、意见和想法的多样性程度。它反映了团队成员在思考和表达上的差异程度。话语多样性可以包括词汇选择、句子结构、表达方式等方面的差异&lt;/strong&gt;。&lt;/p&gt;
&lt;p&gt;话语多样性对团队的协调行动有影响。在协调任务中，团队成员需要相互理解、协调行动，达成共识并共同努力实现共同目标。如果团队成员的话语多样性过高，意味着他们在表达观点和意见时存在较大的差异，这可能导致沟通困难、理解不一致和冲突的产生，从而影响团队的协调行动。&lt;/p&gt;
&lt;p&gt;因此，在协调任务中，团队成员的话语多样性应该相对较低，以便更好地理解和协调彼此的行动。相反，在创意和思考任务中，话语多样性可以促进团队成员的创新和思考，帮助他们从不同的角度和观点来解决问题，从而提高团队的创造力和创新能力。总之，话语多样性在团队中起着重要的作用，它需要根据任务的性质和要求进行调节，以实现团队的协调行动和创新能力。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;23-两者关系&#34;&gt;2.3 两者关系&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;在这篇论文中，话语多样性被用来衡量认知多样性&lt;/strong&gt;。研究人员使用计算语言学的工具来推导出话语多样性的度量，并将其应用于团队的电子沟通数据中。他们认为，团队的话语多样性可以反映成员之间的认知多样性，即在思维方式、知识和技能等方面的差异程度。通过分析团队的话语多样性，研究人员试图探索团队在不同任务要求下的表现，并研究团队如何调节共享认知以适应任务需求的变化。因此，话语多样性被视为一种衡量团队认知多样性的指标。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;三数据及方法&#34;&gt;三、数据及方法&lt;/h2&gt;
&lt;h3 id=&#34;31-数据&#34;&gt;3.1 数据&lt;/h3&gt;
&lt;p&gt;Gigster(&lt;a href=&#34;http://www.gigster.com&#34;&gt;www.gigster.com&lt;/a&gt;),是一个在线平台， 自由软件开发人员可以在该平台上为个人和企业客户制作按需软件。该平台将个人自由职业开发人员组装成由团队领导领导的临时团队，并将他们分配给需要复杂、相互依赖的长期项目。该平台上的自由职业者分布在全球各地，从事从移动到网络应用程序开发的各种项目。这些项目通常是知识密集型的，需要高水平的创造力、技术问题解决能力和人际协调能力。软件项目规模巨大，成本从数万美元到数十万美元不等（极端情况下可达一百万美元以上）。&lt;/p&gt;
&lt;p&gt;我们的数据集由 117 个团队组成，代表 421 个不同的个体（36% 为女性），时间跨度从 2015 年初到 2017 年底。一个典型的团队有 5 名成员，其中包括一名项目经理；至少一名后端、前端或“全栈”工程师；设计师；和用户界面专家。根据项目类型，团队有时还包括作家、自然语言处理工程师和其他类型的专业人士。在我们数据中的团队中，项目平均持续 159 天（中位数：150 天），并分为平均持续两周的里程碑阶段（平均：14 天；中位数：14 天）。要加入该平台，专业人士必须通过旨在验证其专业知识的各种技术面试。平均而言，单个团队的成员代表 3.6 个国家/地区（中位数：3 个）。在我们的样本中，42% 的人将其原籍国列为北美。另外 13% 来自亚洲，其次是 12% 来自欧洲。其余 23% 居住在拉丁美洲、非洲和世界其他地区。&lt;/p&gt;
&lt;p&gt;由于地理位置分散且缺乏实体办公空间，团队成员几乎完全通过名为 Slack 的在线即时通讯工具进行沟通。我们可以访问整个团队的 Slack 档案——超过 800,000 条消息。每条消息都带有时间戳并可归因（通过匿名标识符）其作者。团队在整个生命周期中平均在公共渠道中交换 1,873 条 Slack 消息（中位数：1,220 条）。我们对 Gigster 的高级领导和团队领导进行了非正式采访，他们一致表示团队沟通几乎完全通过 Slack 进行。一位高级领导描述了其中的原因：“几乎所有团队对话都发生在 Slack 上。这是一个有用的工具，因为我们运营全球团队，而且 Slack 允许在一个平台内进行实时和异步通信。它还允许轻松地共享项目文件。” 多位知情人士强调，团队成员始终依赖 Slack，而不是其他工具，因为“一切都在一个地方”对于促进团队协作非常重要。知情人士还表示，团队成员有动力使用 Slack，因为它提供了团队流程和事件的透明档案，可用于对一些罕见的争议案例进行分类。&lt;/p&gt;
&lt;p&gt;除了 Slack 消息之外，我们还可以获得有关团队成员特征（职能角色、性别和原籍国）的数据，以及团队在实现各个项目里程碑方面的整体绩效。这些数据共同构成了团队内部动态和结果的丰富且连续的历史记录。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;32-计算话语多样性&#34;&gt;3.2 计算话语多样性&lt;/h3&gt;
&lt;p&gt;之前的工作表明，词嵌入模型对于捕获单词之间的语义关系特别有用， 例如，(2018) 证明，根据应用于 20 世纪出版的英语书籍的词嵌入模型推断出的不同职业的语义性别关联与这些职业的历史性别构成相对应。同样，科兹洛夫斯基等人(2019)说明了不同的生活方式活动如何与阶级、种族和性别认同相关。因此，词嵌入为语言中包含的众多意义维度提供了全面且有意义的见解，而这是以前的方法无法捕获的。因此，本论文使用词嵌入模型开发了话语多样性度量。&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;我们首先对 Slack 数据进行预处理，并使用 Word2Vec（连续词袋词嵌入模型的流行实现）来训练词嵌入模型。&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;按照标准实践，窗口大小设置为10， 维度设置成200来训练word2vec模型（&lt;a href=&#34;https://pubsonline.informs.org/doi/full/10.1287/mnsc.2021.4274#B54&#34;&gt;Mikolov 等人，2013&lt;/a&gt;）。从这个训练过程中，我们获得了语料库中每个单词的一个 200 维坐标向量，表示该单词在语义空间中的位置。&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;窗口大小&lt;/strong&gt;: 每个词的上下文范围。 人阅读书籍，一般视野只有十来个词，逐行阅读。 跟人类似， 在计算机中训练词嵌入模型时候，数据不是一次性灌入习得词语的向量，而是像人一样是有上下文范围的，这个范围叫做窗口。&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;如前所述，嵌入空间的维度表示训练语料库中语言使用的潜在特征。尽管这些维度本身不具有定性可解释的含义，但这些维度是提供信息的，因为具有更相似含义的单词彼此更接近。&lt;strong&gt;下图(a)是从聊天消息构建词嵌入向量的过程&lt;/strong&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/01-compute-discursive-diverisity-with-embeddings.jpeg&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;**使用词嵌入模型，我们就可以把词语、每句话、某人某时期的话、某团队某时期的话、所有团队所有的话，通过一定的计算，都表征为200维的向量。**上图 (b) 从聊天消息构建团队话语多样性得分(Discursive Diversity)。&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;通过公式1，计算出两个人差异性。通过公式2， 计算出团队话语多样性。&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/03-formular-1.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/04-formular-2.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;假设团队只有三个人， 低话语多样性 与 高话语多样性， 分别对应下图的左侧和右侧。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/02-low-and-high-examples-of-discursive-diversity.jpeg&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;四python伪码&#34;&gt;四、Python伪码&lt;/h2&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;numpy&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;np&lt;/span&gt;

&lt;span class=&#34;k&#34;&gt;def&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;discursive_diversity_score&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;wv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;words&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
    &lt;span class=&#34;c1&#34;&gt;# wv: 词嵌入模型; gensim.models.keyedvectors.KeyedVectors&lt;/span&gt;
    &lt;span class=&#34;c1&#34;&gt;# words: 一个时间窗口内的词语列表&lt;/span&gt;
    
    &lt;span class=&#34;c1&#34;&gt;# 计算词嵌入向量的平均值&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;embedding_vectors&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;wv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;word&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;word&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;words&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;centroid&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;np&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;mean&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;embedding_vectors&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;axis&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    
    &lt;span class=&#34;c1&#34;&gt;# 计算词嵌入向量之间的余弦相似度&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;pairwise_distances&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;np&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;dot&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;centroid&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;embedding&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;/&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;np&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;linalg&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;norm&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;centroid&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;*&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;np&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;linalg&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;norm&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;embedding&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;embedding&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;embedding_vectors&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
    
    &lt;span class=&#34;c1&#34;&gt;# 计算语言多样性得分&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;diversity_score&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;np&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;mean&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;pairwise_distances&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    
    &lt;span class=&#34;k&#34;&gt;return&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;diversity_score&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;函数 &lt;em&gt;&lt;strong&gt;discursive_diversity_score&lt;/strong&gt;&lt;/em&gt; 已内置到 cntext 中。 对cntext 感兴趣，可阅读 &lt;a href=&#34;https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/&#34;&gt;文本分析库cntext使用手册&lt;/a&gt;&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;h2 id=&#34;五词嵌入应用文献&#34;&gt;五、词嵌入应用文献&lt;/h2&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;# 使用词嵌入技术构建词典
[1]胡楠, 薛付婧 and 王昊楠, 2021. **管理者短视主义影响企业长期投资吗———基于文本分析和机器学习**. *管理世界*, *37*(5), pp.139-156.    
[2]Kai Li, Feng Mai, Rui Shen, Xinyan Yan, **Measuring Corporate Culture Using Machine Learning**, *The Review of Financial Studies*,2020


# 使用词嵌入测量偏见(刻板印象)、认知
[3]Lawson, M. Asher, Ashley E. Martin, Imrul Huda, and Sandra C. Matz. &amp;#34;**Hiring women into senior leadership positions is associated with a reduction in gender stereotypes in organizational language.**&amp;#34; _Proceedings of the National Academy of Sciences_ 119, no. 9 (2022): e2026443119.
[4]Lix, Katharina, Amir Goldberg, Sameer B. Srivastava, and Melissa A. Valentine. &amp;#34;**Aligning differences: Discursive diversity and team performance.**&amp;#34; *Management Science* 68, no. 11 (2022): 8430-8448.
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;cntext使用声明&#34;&gt;cntext使用声明&lt;/h2&gt;
&lt;p&gt;如在研究或项目中使用 cntext ，请在文中介绍并附引用声明。引用格式可参考 &lt;a href=&#34;https://textdata.cn/blog/2025-09-09-transform-text-data-into-structured-data-with-lmstudio-and-cntext/#%E4%BD%BF%E7%94%A8%E5%A3%B0%E6%98%8E&#34;&gt;cntext 推荐引用格式&lt;/a&gt;&lt;/p&gt;
</description>
      <content:encoded><![CDATA[<p>词嵌入在经管中的应用很多，但大多数是训练词嵌入模型，依据词嵌入构建或扩展词典。 今天我们将分享一篇用词嵌入测量团队认知多样性。</p>
<p><br><br></p>
<p><img loading="lazy" src="img/paper-cover-discursive-diversity.png" alt=""  />
</p>
<h2 id="一研究">一、研究</h2>
<p>Lix, Katharina, Amir Goldberg, Sameer B. Srivastava, and Melissa A. Valentine. &ldquo;<strong>Aligning differences: Discursive diversity and team performance.</strong>&rdquo; <em>Management Science</em> 68, no. 11 (2022): 8430-8448.</p>
<h3 id="11-摘要">1.1 摘要</h3>
<p>团队中的认知多样性如何影响其绩效？先前的研究表明，团队的认知多样性存在绩效权衡：多样性团队在创造力和创新方面表现出色，但在协调行动方面则有困难。基于团队认知不是静态的，而是动态互动产生的观点，我们引入了 <strong>话语多样性</strong> 的概念，这是团队认知多样性的一种表现，反映了在一组互动中团队成员传达的含义在多大程度上相互不同。<strong>我们提出，高绩效团队是那些具有调节共享认知以适应不断变化的任务要求的集体能力的团队：在进行构思任务时，它们表现出更高的话语多样性，在执行协调任务时，表现出较低的话语多样性</strong>。我们进一步认为，表现出一致调节的团队——即，在成员对不断变化的任务要求的个人语义变化中团队层面方差较低的团队——更有可能取得成功，而不是由成员之间存在不一致的调节。我们利用 <strong>计算语言学</strong> 工具来衡量话语多样性，并借助一组新型纵向数据，包括117个在线平台 <a href="http://www.gigster.com">www.gigster.com</a> 上的远程软件开发团队的团内电子通信和绩效结果，得出了对我们理论的支持。我们的研究结果表明，团队认知多样性的绩效权衡并非不可避免：团队可以通过将话语多样性水平与任务要求相匹配以及在进行这些调整时使成员保持一致来应对这一权衡。</p>
<h3 id="12-创新点">1.2 创新点</h3>
<p>这篇论文的创新点主要包括以下几个方面：</p>
<ol>
<li>
<p><strong>研究了团队内部的差异对团队绩效的影响</strong>：该论文通过分析团队成员之间的差异，探讨了这些差异对团队绩效的影响。这一研究角度对于理解团队内部动态和绩效提升具有重要意义。</p>
</li>
<li>
<p><strong>引入了阶段性的话语差异概念</strong>：论文提出了阶段性的话语差异概念，即团队成员在不同阶段的沟通中所表现出的差异。这一概念有助于更好地理解团队内部沟通的动态过程。</p>
</li>
<li>
<p><strong>探讨了团队内部沟通差异的调节作用</strong>：论文研究了团队内部沟通差异与团队绩效之间的关系，并发现团队内部沟通差异在不同阶段对团队绩效的影响存在差异。这一发现为团队管理和绩效提升提供了重要的启示。</p>
</li>
<li>
<p><strong>结合了多个学科领域的理论和方法</strong>：该论文综合运用了心理学、经济学和组织学等多个学科领域的理论和方法，从多个角度深入研究了团队内部差异和绩效之间的关系，为相关领域的研究提供了新的视角和方法。</p>
</li>
</ol>
<p><br><br></p>
<h2 id="二文献梳理">二、文献梳理</h2>
<h3 id="21-认知多样性">2.1 认知多样性</h3>
<p>认知多样性(cognitive diversity)对团队绩效的影响是一个长期存在的问题。以往的研究表明，团队的认知多样性存在绩效权衡：多样性团队在创造力和创新方面表现出色，但在协调行动方面存在困难。然而，<strong>最近的研究提出了一种新的观点，即团队的「认知多样性」可以通过调节团队的「共享认知」来实现绩效的平衡。这意味着团队可以根据任务要求调整其认知多样性的水平，以在创造性任务和协调任务之间找到平衡点</strong>。高绩效团队具备调节团队认知的能力，使其能够在创造性任务中展现较高的认知多样性，在协调任务中展现较低的认知多样性。这种能力使团队能够在创新和执行之间找到平衡，从而提高绩效。</p>
<br>
<h3 id="22-话语多样性">2.2 话语多样性</h3>
<p><strong>话语多样性(discursive diversity) 是指团队成员在交流和讨论中表达的观点、意见和想法的多样性程度。它反映了团队成员在思考和表达上的差异程度。话语多样性可以包括词汇选择、句子结构、表达方式等方面的差异</strong>。</p>
<p>话语多样性对团队的协调行动有影响。在协调任务中，团队成员需要相互理解、协调行动，达成共识并共同努力实现共同目标。如果团队成员的话语多样性过高，意味着他们在表达观点和意见时存在较大的差异，这可能导致沟通困难、理解不一致和冲突的产生，从而影响团队的协调行动。</p>
<p>因此，在协调任务中，团队成员的话语多样性应该相对较低，以便更好地理解和协调彼此的行动。相反，在创意和思考任务中，话语多样性可以促进团队成员的创新和思考，帮助他们从不同的角度和观点来解决问题，从而提高团队的创造力和创新能力。总之，话语多样性在团队中起着重要的作用，它需要根据任务的性质和要求进行调节，以实现团队的协调行动和创新能力。</p>
<br>
<h3 id="23-两者关系">2.3 两者关系</h3>
<p><strong>在这篇论文中，话语多样性被用来衡量认知多样性</strong>。研究人员使用计算语言学的工具来推导出话语多样性的度量，并将其应用于团队的电子沟通数据中。他们认为，团队的话语多样性可以反映成员之间的认知多样性，即在思维方式、知识和技能等方面的差异程度。通过分析团队的话语多样性，研究人员试图探索团队在不同任务要求下的表现，并研究团队如何调节共享认知以适应任务需求的变化。因此，话语多样性被视为一种衡量团队认知多样性的指标。</p>
<p><br><br></p>
<h2 id="三数据及方法">三、数据及方法</h2>
<h3 id="31-数据">3.1 数据</h3>
<p>Gigster(<a href="http://www.gigster.com">www.gigster.com</a>),是一个在线平台， 自由软件开发人员可以在该平台上为个人和企业客户制作按需软件。该平台将个人自由职业开发人员组装成由团队领导领导的临时团队，并将他们分配给需要复杂、相互依赖的长期项目。该平台上的自由职业者分布在全球各地，从事从移动到网络应用程序开发的各种项目。这些项目通常是知识密集型的，需要高水平的创造力、技术问题解决能力和人际协调能力。软件项目规模巨大，成本从数万美元到数十万美元不等（极端情况下可达一百万美元以上）。</p>
<p>我们的数据集由 117 个团队组成，代表 421 个不同的个体（36% 为女性），时间跨度从 2015 年初到 2017 年底。一个典型的团队有 5 名成员，其中包括一名项目经理；至少一名后端、前端或“全栈”工程师；设计师；和用户界面专家。根据项目类型，团队有时还包括作家、自然语言处理工程师和其他类型的专业人士。在我们数据中的团队中，项目平均持续 159 天（中位数：150 天），并分为平均持续两周的里程碑阶段（平均：14 天；中位数：14 天）。要加入该平台，专业人士必须通过旨在验证其专业知识的各种技术面试。平均而言，单个团队的成员代表 3.6 个国家/地区（中位数：3 个）。在我们的样本中，42% 的人将其原籍国列为北美。另外 13% 来自亚洲，其次是 12% 来自欧洲。其余 23% 居住在拉丁美洲、非洲和世界其他地区。</p>
<p>由于地理位置分散且缺乏实体办公空间，团队成员几乎完全通过名为 Slack 的在线即时通讯工具进行沟通。我们可以访问整个团队的 Slack 档案——超过 800,000 条消息。每条消息都带有时间戳并可归因（通过匿名标识符）其作者。团队在整个生命周期中平均在公共渠道中交换 1,873 条 Slack 消息（中位数：1,220 条）。我们对 Gigster 的高级领导和团队领导进行了非正式采访，他们一致表示团队沟通几乎完全通过 Slack 进行。一位高级领导描述了其中的原因：“几乎所有团队对话都发生在 Slack 上。这是一个有用的工具，因为我们运营全球团队，而且 Slack 允许在一个平台内进行实时和异步通信。它还允许轻松地共享项目文件。” 多位知情人士强调，团队成员始终依赖 Slack，而不是其他工具，因为“一切都在一个地方”对于促进团队协作非常重要。知情人士还表示，团队成员有动力使用 Slack，因为它提供了团队流程和事件的透明档案，可用于对一些罕见的争议案例进行分类。</p>
<p>除了 Slack 消息之外，我们还可以获得有关团队成员特征（职能角色、性别和原籍国）的数据，以及团队在实现各个项目里程碑方面的整体绩效。这些数据共同构成了团队内部动态和结果的丰富且连续的历史记录。</p>
<br>
<h3 id="32-计算话语多样性">3.2 计算话语多样性</h3>
<p>之前的工作表明，词嵌入模型对于捕获单词之间的语义关系特别有用， 例如，(2018) 证明，根据应用于 20 世纪出版的英语书籍的词嵌入模型推断出的不同职业的语义性别关联与这些职业的历史性别构成相对应。同样，科兹洛夫斯基等人(2019)说明了不同的生活方式活动如何与阶级、种族和性别认同相关。因此，词嵌入为语言中包含的众多意义维度提供了全面且有意义的见解，而这是以前的方法无法捕获的。因此，本论文使用词嵌入模型开发了话语多样性度量。</p>
<p><strong>我们首先对 Slack 数据进行预处理，并使用 Word2Vec（连续词袋词嵌入模型的流行实现）来训练词嵌入模型。</strong></p>
<p>按照标准实践，窗口大小设置为10， 维度设置成200来训练word2vec模型（<a href="https://pubsonline.informs.org/doi/full/10.1287/mnsc.2021.4274#B54">Mikolov 等人，2013</a>）。从这个训练过程中，我们获得了语料库中每个单词的一个 200 维坐标向量，表示该单词在语义空间中的位置。</p>
<blockquote>
<p><strong>窗口大小</strong>: 每个词的上下文范围。 人阅读书籍，一般视野只有十来个词，逐行阅读。 跟人类似， 在计算机中训练词嵌入模型时候，数据不是一次性灌入习得词语的向量，而是像人一样是有上下文范围的，这个范围叫做窗口。</p>
</blockquote>
<p>如前所述，嵌入空间的维度表示训练语料库中语言使用的潜在特征。尽管这些维度本身不具有定性可解释的含义，但这些维度是提供信息的，因为具有更相似含义的单词彼此更接近。<strong>下图(a)是从聊天消息构建词嵌入向量的过程</strong>:</p>
<p><img loading="lazy" src="img/01-compute-discursive-diverisity-with-embeddings.jpeg" alt=""  />
</p>
<p>**使用词嵌入模型，我们就可以把词语、每句话、某人某时期的话、某团队某时期的话、所有团队所有的话，通过一定的计算，都表征为200维的向量。**上图 (b) 从聊天消息构建团队话语多样性得分(Discursive Diversity)。</p>
<p><strong>通过公式1，计算出两个人差异性。通过公式2， 计算出团队话语多样性。</strong></p>
<p><img loading="lazy" src="img/03-formular-1.png" alt=""  />
</p>
<p><img loading="lazy" src="img/04-formular-2.png" alt=""  />
</p>
<br>
<p>假设团队只有三个人， 低话语多样性 与 高话语多样性， 分别对应下图的左侧和右侧。</p>
<p><img loading="lazy" src="img/02-low-and-high-examples-of-discursive-diversity.jpeg" alt=""  />
</p>
<p><br><br></p>
<h2 id="四python伪码">四、Python伪码</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>

<span class="k">def</span> <span class="nf">discursive_diversity_score</span><span class="p">(</span><span class="n">wv</span><span class="p">,</span> <span class="n">words</span><span class="p">):</span>
    <span class="c1"># wv: 词嵌入模型; gensim.models.keyedvectors.KeyedVectors</span>
    <span class="c1"># words: 一个时间窗口内的词语列表</span>
    
    <span class="c1"># 计算词嵌入向量的平均值</span>
    <span class="n">embedding_vectors</span> <span class="o">=</span> <span class="p">[</span><span class="n">wv</span><span class="p">[</span><span class="n">word</span><span class="p">]</span> <span class="k">for</span> <span class="n">word</span> <span class="ow">in</span> <span class="n">words</span><span class="p">]</span>
    <span class="n">centroid</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">embedding_vectors</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
    
    <span class="c1"># 计算词嵌入向量之间的余弦相似度</span>
    <span class="n">pairwise_distances</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">centroid</span><span class="p">,</span> <span class="n">embedding</span><span class="p">)</span> <span class="o">/</span> <span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">linalg</span><span class="o">.</span><span class="n">norm</span><span class="p">(</span><span class="n">centroid</span><span class="p">)</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">linalg</span><span class="o">.</span><span class="n">norm</span><span class="p">(</span><span class="n">embedding</span><span class="p">))</span> <span class="k">for</span> <span class="n">embedding</span> <span class="ow">in</span> <span class="n">embedding_vectors</span><span class="p">]</span>
    
    <span class="c1"># 计算语言多样性得分</span>
    <span class="n">diversity_score</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">pairwise_distances</span><span class="p">)</span>
    
    <span class="k">return</span> <span class="n">diversity_score</span>
</code></pre></div><p>函数 <em><strong>discursive_diversity_score</strong></em> 已内置到 cntext 中。 对cntext 感兴趣，可阅读 <a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/">文本分析库cntext使用手册</a></p>
<br>
<br>
<h2 id="五词嵌入应用文献">五、词嵌入应用文献</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"># 使用词嵌入技术构建词典
[1]胡楠, 薛付婧 and 王昊楠, 2021. **管理者短视主义影响企业长期投资吗———基于文本分析和机器学习**. *管理世界*, *37*(5), pp.139-156.    
[2]Kai Li, Feng Mai, Rui Shen, Xinyan Yan, **Measuring Corporate Culture Using Machine Learning**, *The Review of Financial Studies*,2020


# 使用词嵌入测量偏见(刻板印象)、认知
[3]Lawson, M. Asher, Ashley E. Martin, Imrul Huda, and Sandra C. Matz. &#34;**Hiring women into senior leadership positions is associated with a reduction in gender stereotypes in organizational language.**&#34; _Proceedings of the National Academy of Sciences_ 119, no. 9 (2022): e2026443119.
[4]Lix, Katharina, Amir Goldberg, Sameer B. Srivastava, and Melissa A. Valentine. &#34;**Aligning differences: Discursive diversity and team performance.**&#34; *Management Science* 68, no. 11 (2022): 8430-8448.
</code></pre></div><p><br><br></p>
<h2 id="cntext使用声明">cntext使用声明</h2>
<p>如在研究或项目中使用 cntext ，请在文中介绍并附引用声明。引用格式可参考 <a href="https://textdata.cn/blog/2025-09-09-transform-text-data-into-structured-data-with-lmstudio-and-cntext/#%E4%BD%BF%E7%94%A8%E5%A3%B0%E6%98%8E">cntext 推荐引用格式</a></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>案例代码 | 使用正则表达式判别微博用户mbti类型</title>
      <link>https://textdata.cn/blog/2023-10-30-raw-mbti-users/</link>
      <pubDate>Mon, 30 Oct 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-10-30-raw-mbti-users/</guid>
      <description>&lt;p&gt;使用Python爬虫采集「微博搜索」中含mbti信息的推文， 使用正则表达式判别用户mbti类型。 相比实验室做实验或者发调查问卷，这种方式收集到的用户类别是非常自然且真实的。今日爬虫不是今日主题，就不做分享了。&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;code.ipynb&#34;&gt;&lt;strong&gt;点击下载code.ipynb&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;mbti_test.csv&#34;&gt;&lt;strong&gt;点击下载mbti_test.csv&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;br&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/weibo.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#采集自微博搜索中含mbti类型的推文&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;mbti_test.csv&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#剔除content列中的nan数据&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;dropna&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;inplace&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;subset&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;content&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/1.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;正则练习&#34;&gt;正则练习&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;提取含有mbti的记录&lt;/li&gt;
&lt;li&gt;提取出含mbti类型出现的前后5个字符的文本(前5个字符，后5个字符， 含mbti本身， 窗体最长的长度是14)&lt;/li&gt;
&lt;li&gt;识别出含mbti的记录中对应的mbti类型， 未识别的标记为&amp;quot;未识别&amp;quot;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;一-提取含有mbti的记录&#34;&gt;一、 提取含有mbti的记录&lt;/h2&gt;
&lt;p&gt;实现方法有两种&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;pd.Series.str.contains(regex_pattern)&lt;/li&gt;
&lt;li&gt;定义一个正则处理函数regex_func， 使用pd.Series.apply(regex_func)&lt;/li&gt;
&lt;/ol&gt;
&lt;br&gt;
&lt;p&gt;&lt;strong&gt;正则表达式含义&lt;/strong&gt;&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;mbtis&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;[infj|entp|intp|intj|entj|enfj|infp|enfp|isfp|istp|isfj|istj|estp|esfp|estj|esfj]&amp;#39;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;[ 和 ]&lt;/code&gt;：这是字符类（character class）的起始和结束标记，表示要匹配方括号内的任何字符。&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;infj|entp|intp|intj|entj|enfj|infp|enfp|isfp|istp|isfj|istj|estp|esfp|estj|esfj&lt;/code&gt;：这是一个字符类内的字符集合，用于匹配MBTI类型词汇。每个MBTI类型词汇都以竖线 | 分隔，表示“或”的关系。这意味着正则表达式会匹配其中任何一个MBTI类型词汇。&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;+&lt;/code&gt;：这是一个量词，表示匹配前面的字符集合（MBTI类型词汇）一次或多次。它使正则表达式可以匹配包含一个或多个MBTI类型词汇的文本。&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;mbtis&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;[infj|entp|intp|intj|entj|enfj|infp|enfp|isfp|istp|isfj|istj|estp|esfp|estj|esfj]&amp;#39;&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;content&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;contains&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;mbtis&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;pre&gt;&lt;code&gt;0       True
1       True
2       True
3       True
4       True
       ...  
495    False
496    False
497    False
498    False
499    False
Name: content, Length: 497, dtype: bool
&lt;/code&gt;&lt;/pre&gt;
&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;re&lt;/span&gt;


&lt;span class=&#34;k&#34;&gt;def&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;has_mbti&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;mbtis&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;[infj|entp|intp|intj|entj|enfj|infp|enfp|isfp|istp|isfj|istj|estp|esfp|estj|esfj]+&amp;#39;&lt;/span&gt;

    &lt;span class=&#34;k&#34;&gt;if&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;re&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;findall&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;mbtis&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
        &lt;span class=&#34;k&#34;&gt;return&lt;/span&gt; &lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;
    &lt;span class=&#34;k&#34;&gt;else&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
        &lt;span class=&#34;k&#34;&gt;return&lt;/span&gt; &lt;span class=&#34;kc&#34;&gt;False&lt;/span&gt;
    
    
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;content&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;apply&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;has_mbti&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;pre&gt;&lt;code&gt;0       True
1       True
2       True
3       True
4       True
       ...  
495    False
496    False
497     True
498    False
499     True
Name: content, Length: 497, dtype: bool
&lt;/code&gt;&lt;/pre&gt;
&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;#将结果存储到df中&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;hasMBTI&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;content&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;apply&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;has_mbti&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/2.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二mbti前后内容&#34;&gt;二、mbti前后内容&lt;/h2&gt;
&lt;p&gt;提取出含mbti类型出现的前后5个字符的文本(前5个字符，后5个字符， 含mbti本身， 窗体最长的长度是14)。&lt;/p&gt;
&lt;p&gt;这样后续的分析任务，就可以通过查看mbti字眼前后出现的字符，来更新正则表达式。&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;&lt;strong&gt;正则表达式含义&lt;/strong&gt;&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;mbti_win = &amp;#34;(.{0,5}(?:infj|entp|intp|intj|entj|enfj|infp|enfp|isfp|istp|isfj|istj|estp|esfp|estj|esfj).{0,5})&amp;#34;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;ul&gt;
&lt;li&gt;&lt;code&gt;(&lt;/code&gt; 和 &lt;code&gt;)&lt;/code&gt;这些括号用于将整个匹配结果捕获为一个分组&lt;/li&gt;
&lt;li&gt;&lt;code&gt;.{0,5}&lt;/code&gt; ：这是一个量词，表示匹配前面的字符（.表示匹配任意字符）零次到五次。这部分用于匹配前面的文本，确保最多匹配前面的五个字符。&lt;/li&gt;
&lt;li&gt;&lt;code&gt;(?:infj|entp|intp|intj|entj|enfj|infp|enfp|isfp|istp|isfj|istj|estp|esfp|estj|esfj)&lt;/code&gt;：这是一个非捕获分组，用于将多个MBTI类型词汇用 | 连接起来，表示匹配其中任何一个。&lt;/li&gt;
&lt;li&gt;&lt;code&gt;.{0,5}&lt;/code&gt; ：这部分同样是一个量词，表示匹配后面的字符，确保最多匹配后面的五个字符。&lt;/li&gt;
&lt;/ul&gt;
&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;k&#34;&gt;def&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;mbti_window&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
    &lt;span class=&#34;c1&#34;&gt;#识别mbti的正则表达式 &lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;mbti_win&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;(.{0,5}(?:infj|entp|intp|intj|entj|enfj|infp|enfp|isfp|istp|isfj|istj|estp|esfp|estj|esfj).{0,5})&amp;#34;&lt;/span&gt;

    &lt;span class=&#34;k&#34;&gt;try&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
        &lt;span class=&#34;k&#34;&gt;return&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;re&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;findall&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;mbti_win&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)[&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
    &lt;span class=&#34;k&#34;&gt;except&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
        &lt;span class=&#34;k&#34;&gt;return&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;未识别&amp;#34;&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;MBTI_win&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;content&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;apply&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;mbti_window&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/3.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;三识别mbti类型&#34;&gt;三、识别mbti类型&lt;/h2&gt;
&lt;p&gt;刚刚的代码比较粗糙，只能判断文本中是否有mbti信息，但并不能判断该用户是否为某种mbti类型。&lt;/p&gt;
&lt;p&gt;微博文本中，只有 &lt;code&gt;//@&lt;/code&gt; 前字符内容是微博用户所写内容。为了识别用户的mbti类型，可以先将我们看到的表达方式列举一下&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;``我是[mbti]&lt;/li&gt;
&lt;li&gt;&lt;code&gt;自己是[mbti]&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;从[mbti]变为[mbti]&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;一直是[mbti]&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;[mbti]我&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;本[mbti]&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&amp;hellip;&amp;hellip;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;可以基于此设计一个严格的正则表达式，能识别到的记录，肯定能判断该用户的mbti类型。 未识别到的标记为 &amp;ldquo;未识别&amp;rdquo;&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;&lt;strong&gt;正则表达式含义&lt;/strong&gt;&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;mbti_regex&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;[我|自己|变成|一直|是|本]*(infj|entp|intp|intj|entj|enfj|infp|enfp|isfp|istp|isfj|istj|estp|esfp|estj|esfj)[我|俺|本|自己]*&amp;#34;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;ul&gt;
&lt;li&gt;&lt;code&gt;[我|自己|变成|一直|是|本]*&lt;/code&gt;：这部分是一个字符集合，用于匹配前面的字符（关键词）。方括号 &lt;code&gt;[...]&lt;/code&gt; 表示字符类，其中的字符是可选的，并且 * 表示匹配零次或多次。这意味着它可以匹配零个或多个出现在方括号中的字符，例如可以匹配&amp;quot;我&amp;quot;、&amp;ldquo;自己&amp;rdquo;、&amp;ldquo;变成&amp;rdquo;、&amp;ldquo;一直&amp;rdquo;、&amp;ldquo;是&amp;rdquo;、&amp;ldquo;本&amp;quot;等这些关键词。&lt;/li&gt;
&lt;li&gt;&lt;code&gt;(infj|entp|intp|intj|entj|enfj|infp|enfp|isfp|istp|isfj|istj|estp|esfp|estj|esfj)&lt;/code&gt; ：这是一个分组，其中包含了MBTI类型词汇，用竖线 &lt;code&gt;|&lt;/code&gt; 分隔，表示&amp;quot;或&amp;quot;的关系。这部分用于匹配任意一个MBTI类型词汇。&lt;/li&gt;
&lt;li&gt;&lt;code&gt;[我|俺|本|自己]*&lt;/code&gt; ：这部分与第1部分类似，是一个字符集合，用于匹配后面的字符（关键词）。同样，方括号 &lt;code&gt;[...]&lt;/code&gt; 表示字符类，其中的字符是可选的，并且 &lt;code&gt;*&lt;/code&gt; 表示匹配零次或多次。&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;
&lt;span class=&#34;k&#34;&gt;def&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;identify_mbti&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
    &lt;span class=&#34;k&#34;&gt;if&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;//@&amp;#39;&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
        &lt;span class=&#34;n&#34;&gt;new_text&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;split&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;//@&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)[&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
    &lt;span class=&#34;k&#34;&gt;else&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
        &lt;span class=&#34;n&#34;&gt;new_text&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;

    &lt;span class=&#34;c1&#34;&gt;#识别mbti的正则表达式 &lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;mbti_regex&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;[我|自己|变成|一直|是|本]*(infj|entp|intp|intj|entj|enfj|infp|enfp|isfp|istp|isfj|istj|estp|esfp|estj|esfj)[我|俺|本|自己]*&amp;#34;&lt;/span&gt;

    &lt;span class=&#34;k&#34;&gt;try&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
        &lt;span class=&#34;k&#34;&gt;return&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;re&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;findall&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;mbti_regex&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)[&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
    &lt;span class=&#34;k&#34;&gt;except&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
        &lt;span class=&#34;k&#34;&gt;return&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;未识别&amp;#34;&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#mbti类型&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;MBTI_Cat&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;content&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;apply&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;identify_mbti&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/4.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;#各类型记录数&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;MBTI_Cat&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;value_counts&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;pre&gt;&lt;code&gt;MBTI_Cat
未识别     297
infp     35
isfj     20
enfp     18
intp     17
isfp     16
intj     14
entp     12
entj     11
infj     11
enfj      8
estj      8
istp      8
istj      7
esfp      6
estp      5
esfj      4
Name: count, dtype: int64
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
</description>
      <content:encoded><![CDATA[<p>使用Python爬虫采集「微博搜索」中含mbti信息的推文， 使用正则表达式判别用户mbti类型。 相比实验室做实验或者发调查问卷，这种方式收集到的用户类别是非常自然且真实的。今日爬虫不是今日主题，就不做分享了。</p>
<ul>
<li><a href="code.ipynb"><strong>点击下载code.ipynb</strong></a></li>
<li><a href="mbti_test.csv"><strong>点击下载mbti_test.csv</strong></a></li>
</ul>
<br>
<p><img loading="lazy" src="img/weibo.png" alt=""  />
</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="c1">#采集自微博搜索中含mbti类型的推文</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;mbti_test.csv&#39;</span><span class="p">)</span>
<span class="c1">#剔除content列中的nan数据</span>
<span class="n">df</span><span class="o">.</span><span class="n">dropna</span><span class="p">(</span><span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">subset</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;content&#39;</span><span class="p">])</span>
<span class="n">df</span>
</code></pre></div><p><img loading="lazy" src="img/1.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="正则练习">正则练习</h2>
<ol>
<li>提取含有mbti的记录</li>
<li>提取出含mbti类型出现的前后5个字符的文本(前5个字符，后5个字符， 含mbti本身， 窗体最长的长度是14)</li>
<li>识别出含mbti的记录中对应的mbti类型， 未识别的标记为&quot;未识别&quot;</li>
</ol>
<p><br><br></p>
<h2 id="一-提取含有mbti的记录">一、 提取含有mbti的记录</h2>
<p>实现方法有两种</p>
<ol>
<li>pd.Series.str.contains(regex_pattern)</li>
<li>定义一个正则处理函数regex_func， 使用pd.Series.apply(regex_func)</li>
</ol>
<br>
<p><strong>正则表达式含义</strong></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">mbtis</span> <span class="o">=</span> <span class="s1">&#39;[infj|entp|intp|intj|entj|enfj|infp|enfp|isfp|istp|isfj|istj|estp|esfp|estj|esfj]&#39;</span>
</code></pre></div><ul>
<li>
<p><code>[ 和 ]</code>：这是字符类（character class）的起始和结束标记，表示要匹配方括号内的任何字符。</p>
</li>
<li>
<p><code>infj|entp|intp|intj|entj|enfj|infp|enfp|isfp|istp|isfj|istj|estp|esfp|estj|esfj</code>：这是一个字符类内的字符集合，用于匹配MBTI类型词汇。每个MBTI类型词汇都以竖线 | 分隔，表示“或”的关系。这意味着正则表达式会匹配其中任何一个MBTI类型词汇。</p>
</li>
<li>
<p><code>+</code>：这是一个量词，表示匹配前面的字符集合（MBTI类型词汇）一次或多次。它使正则表达式可以匹配包含一个或多个MBTI类型词汇的文本。</p>
</li>
</ul>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">mbtis</span> <span class="o">=</span> <span class="s1">&#39;[infj|entp|intp|intj|entj|enfj|infp|enfp|isfp|istp|isfj|istj|estp|esfp|estj|esfj]&#39;</span>

<span class="n">df</span><span class="o">.</span><span class="n">content</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="n">mbtis</span><span class="p">)</span>
</code></pre></div><pre><code>0       True
1       True
2       True
3       True
4       True
       ...  
495    False
496    False
497    False
498    False
499    False
Name: content, Length: 497, dtype: bool
</code></pre>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">re</span>


<span class="k">def</span> <span class="nf">has_mbti</span><span class="p">(</span><span class="n">text</span><span class="p">):</span>
    <span class="n">mbtis</span> <span class="o">=</span> <span class="s1">&#39;[infj|entp|intp|intj|entj|enfj|infp|enfp|isfp|istp|isfj|istj|estp|esfp|estj|esfj]+&#39;</span>

    <span class="k">if</span> <span class="n">re</span><span class="o">.</span><span class="n">findall</span><span class="p">(</span><span class="n">mbtis</span><span class="p">,</span> <span class="n">text</span><span class="p">):</span>
        <span class="k">return</span> <span class="kc">True</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="k">return</span> <span class="kc">False</span>
    
    
<span class="n">df</span><span class="o">.</span><span class="n">content</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">has_mbti</span><span class="p">)</span>
</code></pre></div><pre><code>0       True
1       True
2       True
3       True
4       True
       ...  
495    False
496    False
497     True
498    False
499     True
Name: content, Length: 497, dtype: bool
</code></pre>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#将结果存储到df中</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;hasMBTI&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;content&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">has_mbti</span><span class="p">)</span>
<span class="n">df</span>
</code></pre></div><p><img loading="lazy" src="img/2.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="二mbti前后内容">二、mbti前后内容</h2>
<p>提取出含mbti类型出现的前后5个字符的文本(前5个字符，后5个字符， 含mbti本身， 窗体最长的长度是14)。</p>
<p>这样后续的分析任务，就可以通过查看mbti字眼前后出现的字符，来更新正则表达式。</p>
<br>
<p><strong>正则表达式含义</strong></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">mbti_win = &#34;(.{0,5}(?:infj|entp|intp|intj|entj|enfj|infp|enfp|isfp|istp|isfj|istj|estp|esfp|estj|esfj).{0,5})&#34;
</code></pre></div><ul>
<li><code>(</code> 和 <code>)</code>这些括号用于将整个匹配结果捕获为一个分组</li>
<li><code>.{0,5}</code> ：这是一个量词，表示匹配前面的字符（.表示匹配任意字符）零次到五次。这部分用于匹配前面的文本，确保最多匹配前面的五个字符。</li>
<li><code>(?:infj|entp|intp|intj|entj|enfj|infp|enfp|isfp|istp|isfj|istj|estp|esfp|estj|esfj)</code>：这是一个非捕获分组，用于将多个MBTI类型词汇用 | 连接起来，表示匹配其中任何一个。</li>
<li><code>.{0,5}</code> ：这部分同样是一个量词，表示匹配后面的字符，确保最多匹配后面的五个字符。</li>
</ul>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">mbti_window</span><span class="p">(</span><span class="n">text</span><span class="p">):</span>
    <span class="c1">#识别mbti的正则表达式 </span>
    <span class="n">mbti_win</span> <span class="o">=</span> <span class="s2">&#34;(.{0,5}(?:infj|entp|intp|intj|entj|enfj|infp|enfp|isfp|istp|isfj|istj|estp|esfp|estj|esfj).{0,5})&#34;</span>

    <span class="k">try</span><span class="p">:</span>
        <span class="k">return</span> <span class="n">re</span><span class="o">.</span><span class="n">findall</span><span class="p">(</span><span class="n">mbti_win</span><span class="p">,</span> <span class="n">text</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
    <span class="k">except</span><span class="p">:</span>
        <span class="k">return</span> <span class="s2">&#34;未识别&#34;</span>

<span class="n">df</span><span class="p">[</span><span class="s1">&#39;MBTI_win&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;content&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">mbti_window</span><span class="p">)</span>
<span class="n">df</span>
</code></pre></div><p><img loading="lazy" src="img/3.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="三识别mbti类型">三、识别mbti类型</h2>
<p>刚刚的代码比较粗糙，只能判断文本中是否有mbti信息，但并不能判断该用户是否为某种mbti类型。</p>
<p>微博文本中，只有 <code>//@</code> 前字符内容是微博用户所写内容。为了识别用户的mbti类型，可以先将我们看到的表达方式列举一下</p>
<ul>
<li>``我是[mbti]</li>
<li><code>自己是[mbti]</code></li>
<li><code>从[mbti]变为[mbti]</code></li>
<li><code>一直是[mbti]</code></li>
<li><code>[mbti]我</code></li>
<li><code>本[mbti]</code></li>
<li>&hellip;&hellip;</li>
</ul>
<p>可以基于此设计一个严格的正则表达式，能识别到的记录，肯定能判断该用户的mbti类型。 未识别到的标记为 &ldquo;未识别&rdquo;</p>
<br>
<p><strong>正则表达式含义</strong></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">mbti_regex</span> <span class="o">=</span> <span class="s2">&#34;[我|自己|变成|一直|是|本]*(infj|entp|intp|intj|entj|enfj|infp|enfp|isfp|istp|isfj|istj|estp|esfp|estj|esfj)[我|俺|本|自己]*&#34;</span>
</code></pre></div><ul>
<li><code>[我|自己|变成|一直|是|本]*</code>：这部分是一个字符集合，用于匹配前面的字符（关键词）。方括号 <code>[...]</code> 表示字符类，其中的字符是可选的，并且 * 表示匹配零次或多次。这意味着它可以匹配零个或多个出现在方括号中的字符，例如可以匹配&quot;我&quot;、&ldquo;自己&rdquo;、&ldquo;变成&rdquo;、&ldquo;一直&rdquo;、&ldquo;是&rdquo;、&ldquo;本&quot;等这些关键词。</li>
<li><code>(infj|entp|intp|intj|entj|enfj|infp|enfp|isfp|istp|isfj|istj|estp|esfp|estj|esfj)</code> ：这是一个分组，其中包含了MBTI类型词汇，用竖线 <code>|</code> 分隔，表示&quot;或&quot;的关系。这部分用于匹配任意一个MBTI类型词汇。</li>
<li><code>[我|俺|本|自己]*</code> ：这部分与第1部分类似，是一个字符集合，用于匹配后面的字符（关键词）。同样，方括号 <code>[...]</code> 表示字符类，其中的字符是可选的，并且 <code>*</code> 表示匹配零次或多次。</li>
</ul>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python">
<span class="k">def</span> <span class="nf">identify_mbti</span><span class="p">(</span><span class="n">text</span><span class="p">):</span>
    <span class="k">if</span> <span class="s1">&#39;//@&#39;</span> <span class="ow">in</span> <span class="n">text</span><span class="p">:</span>
        <span class="n">new_text</span> <span class="o">=</span> <span class="n">text</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">&#39;//@&#39;</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="n">new_text</span> <span class="o">=</span> <span class="n">text</span>

    <span class="c1">#识别mbti的正则表达式 </span>
    <span class="n">mbti_regex</span> <span class="o">=</span> <span class="s2">&#34;[我|自己|变成|一直|是|本]*(infj|entp|intp|intj|entj|enfj|infp|enfp|isfp|istp|isfj|istj|estp|esfp|estj|esfj)[我|俺|本|自己]*&#34;</span>

    <span class="k">try</span><span class="p">:</span>
        <span class="k">return</span> <span class="n">re</span><span class="o">.</span><span class="n">findall</span><span class="p">(</span><span class="n">mbti_regex</span><span class="p">,</span> <span class="n">text</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
    <span class="k">except</span><span class="p">:</span>
        <span class="k">return</span> <span class="s2">&#34;未识别&#34;</span>

<span class="c1">#mbti类型</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;MBTI_Cat&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;content&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">identify_mbti</span><span class="p">)</span>
<span class="n">df</span>
</code></pre></div><p><img loading="lazy" src="img/4.png" alt=""  />
</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#各类型记录数</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;MBTI_Cat&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">value_counts</span><span class="p">()</span>
</code></pre></div><pre><code>MBTI_Cat
未识别     297
infp     35
isfj     20
enfp     18
intp     17
isfp     16
intj     14
entp     12
entj     11
infj     11
enfj      8
estj      8
istp      8
istj      7
esfp      6
estp      5
esfj      4
Name: count, dtype: int64
</code></pre>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>GTE中文通用文本向量表示模型</title>
      <link>https://textdata.cn/blog/2023-10-27-nlp_gte_sentence-embedding_chinese/</link>
      <pubDate>Fri, 27 Oct 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-10-27-nlp_gte_sentence-embedding_chinese/</guid>
      <description>&lt;h2 id=&#34;gte中文通用文本表示模型&#34;&gt;GTE中文通用文本表示模型&lt;/h2&gt;
&lt;p&gt;文本表示是自然语言处理(NLP)领域的核心问题, 其在很多NLP、信息检索的下游任务中发挥着非常重要的作用。近几年, 随着深度学习的发展，尤其是预训练语言模型的出现极大的推动了文本表示技术的效果, 基于预训练语言模型的文本表示模型在学术研究数据、工业实际应用中都明显优于传统的基于统计模型(词袋法、TF-IDF) 或者浅层神经网络的文本表示模型。这里, 我们主要关注基于预训练语言模型的文本表示。GTE项目地址&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;https://modelscope.cn/models/damo/nlp_gte_sentence-embedding_chinese-small/summary
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;p&gt;文本表示示例, 输入一个句子, 输入一个固定维度的连续向量:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;输入: &lt;code&gt;吃完海鲜可以喝牛奶吗?&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;输出: &lt;code&gt;[0.27162,-0.66159,0.33031,0.24121,0.46122,...]&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;文本的向量表示通常可以用于&lt;strong&gt;文本聚类&lt;/strong&gt;、&lt;strong&gt;文本相似度计算&lt;/strong&gt;等下游任务中。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二文本表示模型&#34;&gt;二、文本表示模型&lt;/h2&gt;
&lt;p&gt;基于监督数据训练的文本表示模型通常采用Dual Encoder框架, 如下图所示。在Dual Encoder框架中, Query和Document文本通过预训练语言模型编码后, 通常采用预训练语言模型[CLS]位置的向量作为最终的文本向量表示。基于标注数据的标签, 通过计算query-document之间的cosine距离度量两者之间的相关性。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/repo.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;GTE-zh模型使用&lt;a href=&#34;https://arxiv.org/abs/2205.12035&#34;&gt;retromae&lt;/a&gt;初始化训练模型，之后利用两阶段训练方法训练模型：第一阶段利用大规模弱弱监督文本对数据训练模型，第二阶段利用高质量精标文本对数据以及挖掘的难负样本数据训练模型。具体训练方法请参考论文&lt;a href=&#34;https://arxiv.org/abs/2308.03281&#34;&gt;Towards General Text Embeddings with Multi-stage Contrastive Learning&lt;/a&gt;。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二使用方式和范围&#34;&gt;二、使用方式和范围&lt;/h2&gt;
&lt;p&gt;使用方式:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;直接推理, 对给定文本计算其对应的文本向量表示，向量维度512&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;使用范围:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;本模型可以使用在通用领域的文本向量表示及其下游应用场景, 包括 &lt;strong&gt;两文档间文本相似度计算&lt;/strong&gt;、&lt;strong&gt;query&amp;amp;多doc候选的相似度排序&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;21-如何使用&#34;&gt;2.1 如何使用&lt;/h3&gt;
&lt;p&gt;在ModelScope框架上，提供输入文本(默认最长文本长度为128)，即可以通过简单的Pipeline调用来使用GTE文本向量表示模型。ModelScope封装了统一的接口对外提供&lt;strong&gt;单文档向量表示&lt;/strong&gt;、&lt;strong&gt;双文档文本相似度&lt;/strong&gt;、&lt;strong&gt;多候选相似度计算&lt;/strong&gt;等功能&lt;/p&gt;
&lt;h3 id=&#34;22-安装&#34;&gt;2.2 安装&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;pip3 install torch
pip3 install transformers
pip3 install modelscope
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;三代码示例&#34;&gt;三、代码示例&lt;/h2&gt;
&lt;p&gt;为方便实验，选择体积较小的模型文件 &lt;strong&gt;damo/nlp_gte_sentence-embedding_chinese-small&lt;/strong&gt;&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;modelscope.models&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Model&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;modelscope.pipelines&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pipeline&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;modelscope.utils.constant&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Tasks&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;model_id&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;damo/nlp_gte_sentence-embedding_chinese-small&amp;#34;&lt;/span&gt; &lt;span class=&#34;c1&#34;&gt;#57M&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;model_id&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;damo/nlp_gte_sentence-embedding_chinese-large&amp;#34;&lt;/span&gt; &lt;span class=&#34;c1&#34;&gt;#621M&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;pipeline_se&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pipeline&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Tasks&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;sentence_embedding&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
                       &lt;span class=&#34;n&#34;&gt;model&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;model_id&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;# 当输入包含“soure_sentence”与“sentences_to_compare”时，会输出source_sentence中首个句子与sentences_to_compare中每个句子的向量表示，以及source_sentence中首个句子与sentences_to_compare中每个句子的相似度。&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;inputs&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;
        &lt;span class=&#34;s2&#34;&gt;&amp;#34;source_sentence&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;吃完海鲜可以喝牛奶吗?&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt;
        &lt;span class=&#34;s2&#34;&gt;&amp;#34;sentences_to_compare&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;
            &lt;span class=&#34;s2&#34;&gt;&amp;#34;不可以，早晨喝牛奶不科学&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
            &lt;span class=&#34;s2&#34;&gt;&amp;#34;吃了海鲜后是不能再喝牛奶的，因为牛奶中含得有维生素C，如果海鲜喝牛奶一起服用会对人体造成一定的伤害&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
            &lt;span class=&#34;s2&#34;&gt;&amp;#34;吃海鲜是不能同时喝牛奶吃水果，这个至少间隔6小时以上才可以。&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
            &lt;span class=&#34;s2&#34;&gt;&amp;#34;吃海鲜是不可以吃柠檬的因为其中的维生素C会和海鲜中的矿物质形成砷&amp;#34;&lt;/span&gt;
        &lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
    &lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;result&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pipeline_se&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;input&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;inputs&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;result&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;{&amp;#39;text_embedding&amp;#39;: array([[-0.03317244, -0.0419106 , -0.03626636, ..., -0.0132677 ,
         -0.02028614, -0.01542077],
        [-0.04563809, -0.06220782, -0.03775004, ...,  0.01267119,
         -0.01111769, -0.03390383],
        [-0.02073098, -0.04639562, -0.04818704, ..., -0.00754705,
         -0.00731624, -0.02740852],
        [-0.00037597, -0.05922904, -0.0459275 , ..., -0.00697823,
         -0.02154762, -0.02951157],
        [-0.00491675, -0.02552056, -0.03427778, ..., -0.00760836,
         -0.00404084, -0.0509829 ]], dtype=float32),
 &amp;#39;scores&amp;#39;: [0.8542333245277405,
  0.9613471031188965,
  0.947378396987915,
  0.8620702028274536]}
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;strong&gt;默认向量维度512,  两个向量做内积距离计算得到score&lt;/strong&gt;
&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# 当输入仅含有soure_sentence时，会输出source_sentence中每个句子的向量表示。&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;inputs2&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;
        &lt;span class=&#34;s2&#34;&gt;&amp;#34;source_sentence&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;
            &lt;span class=&#34;s2&#34;&gt;&amp;#34;不可以，早晨喝牛奶不科学&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
            &lt;span class=&#34;s2&#34;&gt;&amp;#34;吃了海鲜后是不能再喝牛奶的，因为牛奶中含得有维生素C，如果海鲜喝牛奶一起服用会对人体造成一定的伤害&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
            &lt;span class=&#34;s2&#34;&gt;&amp;#34;吃海鲜是不能同时喝牛奶吃水果，这个至少间隔6小时以上才可以。&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
            &lt;span class=&#34;s2&#34;&gt;&amp;#34;吃海鲜是不可以吃柠檬的因为其中的维生素C会和海鲜中的矿物质形成砷&amp;#34;&lt;/span&gt;
        &lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;result&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pipeline_se&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;input&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;inputs2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;result&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;{&amp;#39;text_embedding&amp;#39;: array([[-0.04563809, -0.06220782, -0.03775004, ...,  0.01267119,
        -0.01111769, -0.03390383],
       [-0.02073098, -0.04639562, -0.04818704, ..., -0.00754705,
        -0.00731624, -0.02740852],
       [-0.00037597, -0.05922904, -0.0459275 , ..., -0.00697823,
        -0.02154762, -0.02951157],
       [-0.00491675, -0.02552056, -0.03427778, ..., -0.00760836,
        -0.00404084, -0.0509829 ]], dtype=float32), &amp;#39;scores&amp;#39;: []}
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
</description>
      <content:encoded><![CDATA[<h2 id="gte中文通用文本表示模型">GTE中文通用文本表示模型</h2>
<p>文本表示是自然语言处理(NLP)领域的核心问题, 其在很多NLP、信息检索的下游任务中发挥着非常重要的作用。近几年, 随着深度学习的发展，尤其是预训练语言模型的出现极大的推动了文本表示技术的效果, 基于预训练语言模型的文本表示模型在学术研究数据、工业实际应用中都明显优于传统的基于统计模型(词袋法、TF-IDF) 或者浅层神经网络的文本表示模型。这里, 我们主要关注基于预训练语言模型的文本表示。GTE项目地址</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">https://modelscope.cn/models/damo/nlp_gte_sentence-embedding_chinese-small/summary
</code></pre></div><br>
<p>文本表示示例, 输入一个句子, 输入一个固定维度的连续向量:</p>
<ul>
<li>输入: <code>吃完海鲜可以喝牛奶吗?</code></li>
<li>输出: <code>[0.27162,-0.66159,0.33031,0.24121,0.46122,...]</code></li>
</ul>
<p>文本的向量表示通常可以用于<strong>文本聚类</strong>、<strong>文本相似度计算</strong>等下游任务中。</p>
<p><br><br></p>
<h2 id="二文本表示模型">二、文本表示模型</h2>
<p>基于监督数据训练的文本表示模型通常采用Dual Encoder框架, 如下图所示。在Dual Encoder框架中, Query和Document文本通过预训练语言模型编码后, 通常采用预训练语言模型[CLS]位置的向量作为最终的文本向量表示。基于标注数据的标签, 通过计算query-document之间的cosine距离度量两者之间的相关性。</p>
<p><img loading="lazy" src="img/repo.png" alt=""  />
</p>
<p>GTE-zh模型使用<a href="https://arxiv.org/abs/2205.12035">retromae</a>初始化训练模型，之后利用两阶段训练方法训练模型：第一阶段利用大规模弱弱监督文本对数据训练模型，第二阶段利用高质量精标文本对数据以及挖掘的难负样本数据训练模型。具体训练方法请参考论文<a href="https://arxiv.org/abs/2308.03281">Towards General Text Embeddings with Multi-stage Contrastive Learning</a>。</p>
<p><br><br></p>
<h2 id="二使用方式和范围">二、使用方式和范围</h2>
<p>使用方式:</p>
<ul>
<li>直接推理, 对给定文本计算其对应的文本向量表示，向量维度512</li>
</ul>
<p>使用范围:</p>
<ul>
<li>本模型可以使用在通用领域的文本向量表示及其下游应用场景, 包括 <strong>两文档间文本相似度计算</strong>、<strong>query&amp;多doc候选的相似度排序</strong></li>
</ul>
<h3 id="21-如何使用">2.1 如何使用</h3>
<p>在ModelScope框架上，提供输入文本(默认最长文本长度为128)，即可以通过简单的Pipeline调用来使用GTE文本向量表示模型。ModelScope封装了统一的接口对外提供<strong>单文档向量表示</strong>、<strong>双文档文本相似度</strong>、<strong>多候选相似度计算</strong>等功能</p>
<h3 id="22-安装">2.2 安装</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">pip3 install torch
pip3 install transformers
pip3 install modelscope
</code></pre></div><p><br><br></p>
<h2 id="三代码示例">三、代码示例</h2>
<p>为方便实验，选择体积较小的模型文件 <strong>damo/nlp_gte_sentence-embedding_chinese-small</strong></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">modelscope.models</span> <span class="kn">import</span> <span class="n">Model</span>
<span class="kn">from</span> <span class="nn">modelscope.pipelines</span> <span class="kn">import</span> <span class="n">pipeline</span>
<span class="kn">from</span> <span class="nn">modelscope.utils.constant</span> <span class="kn">import</span> <span class="n">Tasks</span>

<span class="c1">#</span>
<span class="n">model_id</span> <span class="o">=</span> <span class="s2">&#34;damo/nlp_gte_sentence-embedding_chinese-small&#34;</span> <span class="c1">#57M</span>
<span class="n">model_id</span> <span class="o">=</span> <span class="s2">&#34;damo/nlp_gte_sentence-embedding_chinese-large&#34;</span> <span class="c1">#621M</span>
<span class="n">pipeline_se</span> <span class="o">=</span> <span class="n">pipeline</span><span class="p">(</span><span class="n">Tasks</span><span class="o">.</span><span class="n">sentence_embedding</span><span class="p">,</span>
                       <span class="n">model</span><span class="o">=</span><span class="n">model_id</span><span class="p">)</span>

<span class="c1"># 当输入包含“soure_sentence”与“sentences_to_compare”时，会输出source_sentence中首个句子与sentences_to_compare中每个句子的向量表示，以及source_sentence中首个句子与sentences_to_compare中每个句子的相似度。</span>
<span class="n">inputs</span> <span class="o">=</span> <span class="p">{</span>
        <span class="s2">&#34;source_sentence&#34;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&#34;吃完海鲜可以喝牛奶吗?&#34;</span><span class="p">],</span>
        <span class="s2">&#34;sentences_to_compare&#34;</span><span class="p">:</span> <span class="p">[</span>
            <span class="s2">&#34;不可以，早晨喝牛奶不科学&#34;</span><span class="p">,</span>
            <span class="s2">&#34;吃了海鲜后是不能再喝牛奶的，因为牛奶中含得有维生素C，如果海鲜喝牛奶一起服用会对人体造成一定的伤害&#34;</span><span class="p">,</span>
            <span class="s2">&#34;吃海鲜是不能同时喝牛奶吃水果，这个至少间隔6小时以上才可以。&#34;</span><span class="p">,</span>
            <span class="s2">&#34;吃海鲜是不可以吃柠檬的因为其中的维生素C会和海鲜中的矿物质形成砷&#34;</span>
        <span class="p">]</span>
    <span class="p">}</span>

<span class="n">result</span> <span class="o">=</span> <span class="n">pipeline_se</span><span class="p">(</span><span class="nb">input</span><span class="o">=</span><span class="n">inputs</span><span class="p">)</span>
<span class="nb">print</span> <span class="p">(</span><span class="n">result</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">{&#39;text_embedding&#39;: array([[-0.03317244, -0.0419106 , -0.03626636, ..., -0.0132677 ,
         -0.02028614, -0.01542077],
        [-0.04563809, -0.06220782, -0.03775004, ...,  0.01267119,
         -0.01111769, -0.03390383],
        [-0.02073098, -0.04639562, -0.04818704, ..., -0.00754705,
         -0.00731624, -0.02740852],
        [-0.00037597, -0.05922904, -0.0459275 , ..., -0.00697823,
         -0.02154762, -0.02951157],
        [-0.00491675, -0.02552056, -0.03427778, ..., -0.00760836,
         -0.00404084, -0.0509829 ]], dtype=float32),
 &#39;scores&#39;: [0.8542333245277405,
  0.9613471031188965,
  0.947378396987915,
  0.8620702028274536]}
</code></pre></div><p><strong>默认向量维度512,  两个向量做内积距离计算得到score</strong>
<br><br></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1"># 当输入仅含有soure_sentence时，会输出source_sentence中每个句子的向量表示。</span>
<span class="n">inputs2</span> <span class="o">=</span> <span class="p">{</span>
        <span class="s2">&#34;source_sentence&#34;</span><span class="p">:</span> <span class="p">[</span>
            <span class="s2">&#34;不可以，早晨喝牛奶不科学&#34;</span><span class="p">,</span>
            <span class="s2">&#34;吃了海鲜后是不能再喝牛奶的，因为牛奶中含得有维生素C，如果海鲜喝牛奶一起服用会对人体造成一定的伤害&#34;</span><span class="p">,</span>
            <span class="s2">&#34;吃海鲜是不能同时喝牛奶吃水果，这个至少间隔6小时以上才可以。&#34;</span><span class="p">,</span>
            <span class="s2">&#34;吃海鲜是不可以吃柠檬的因为其中的维生素C会和海鲜中的矿物质形成砷&#34;</span>
        <span class="p">]</span>
<span class="p">}</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">pipeline_se</span><span class="p">(</span><span class="nb">input</span><span class="o">=</span><span class="n">inputs2</span><span class="p">)</span>
<span class="nb">print</span> <span class="p">(</span><span class="n">result</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">{&#39;text_embedding&#39;: array([[-0.04563809, -0.06220782, -0.03775004, ...,  0.01267119,
        -0.01111769, -0.03390383],
       [-0.02073098, -0.04639562, -0.04818704, ..., -0.00754705,
        -0.00731624, -0.02740852],
       [-0.00037597, -0.05922904, -0.0459275 , ..., -0.00697823,
        -0.02154762, -0.02951157],
       [-0.00491675, -0.02552056, -0.03427778, ..., -0.00760836,
        -0.00404084, -0.0509829 ]], dtype=float32), &#39;scores&#39;: []}
</code></pre></div><p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>心理科学进展 | 语义距离与创造性思维关系的元分析</title>
      <link>https://textdata.cn/blog/2023-10-18-the-relationship-between-semantic-distance-with-creativity/</link>
      <pubDate>Wed, 18 Oct 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-10-18-the-relationship-between-semantic-distance-with-creativity/</guid>
      <description>自然语言处理的发展为探究语义距离与创造性思维的关系提供了可靠且有效的研究方法。 近些年关于两者之间关系的研究逐渐增多, 但研究结论并不一致。本研究基于创造力联想理论及扩散激活模型, 通过元分析的方法探讨了语义距离与创造性思维的整体关系, 并且分析了以往研究结论不一致的原因。 结果显示：语义距离与创造性思维存在中等程度的正相关，二者的相关强度受到被试年龄和创造性思维不同测量指标的调节。 研究结果表明语义距离与创造性思维关系 密切, 同时解释了以往研究结论不一致的原因。 上述结果不仅能为更深入地探讨创造性思维的认知神经机制 提供新的研究视角和理论解释, 而且有助于更全面地理解语义距离与创造性思维二者的关系及其边界条件, 为更好地解释、预测和提升创造力提供科学依据和重要启示。</description>
      <content:encoded><![CDATA[<p>博客之前分享过 <a href="https://textdata.cn/blog/2022-11-14-pnas_naming_unrelated_words_predicts_creativity/"><strong>PNAS(含代码) | 使用语义距离测量一个人的创新力(发散思维)得分</strong></a>  , 通过语义距离测量创新力， 该教程含Python代码。今天摘抄一篇&lt;心理科学进展&gt;的论文， 帮助大家更深入了解语义距离与创造性思维之间的关系。</p>
<p><br><br></p>
<h2 id="一文献">一、文献</h2>
<p>李亚丹,杜颖,谢聪,刘春宇,杨毅隆,李阳萍,邱江.<strong>语义距离与创造性思维关系的元分析</strong>[J].心理科学进展,2023,31(04):519-534.</p>
<p>摘要:  自然语言处理的发展为探究 <strong>语义距离</strong> 与 <strong>创造性思维</strong> 的关系提供了可靠且有效的研究方法。近些年关于两者之间关系的研究逐渐增多,但研究结论并不一致。本研究基于 <strong>创造力联想理论</strong> 及扩散激活模型, 通过元分析的方法探讨了语义距离与创造性思维的整体关系,并且分析了以往研究结论不一致的原因。本文经过文献检索和筛选后获得14项研究,提取r值作为效应值(共53个效应值,4729个独立样本),并使用随机效应模型进行了元分析。**结果显示：语义距离与创造性思维存在中等程度的正相关(r=0.379, 95%CI [0.300, 0.452]); 二者的相关强度受到被试年龄和创造性思维不同测量指标的调节。**研究结果表明语义距离与创造性思维关系密切, 同时解释了以往研究结论不一致的原因。上述结果不仅能为更深入地探讨创造性思维的认知神经机制提供新的研究视角和理论解释,而且有助于更全面地理解语义距离与创造性思维二者的关系及其边界条件,为更好地解释、预测和提升创造力提供科学依据和重要启示。</p>
<p><br><br></p>
<p><strong>创造性思维</strong> 是一种高层次的思维活动, 对科学进步和社会发展具有深远的影响, 其核心的认知成分之一就是基于语义记忆的 <strong>联想能力</strong>(Acar &amp; Runco, 2014; Marron et al., 2018)。 个体的联想能力及其在进行创造性活动时的联想过程均可以通过语义距离(semantic distance)表现出来(Beaty et al., 2014; Benedek &amp; Neubauer, 2013)。因此, 语义距离是帮助我们理解创造性思维和创造性认知过程的重要手段。</p>
<p>在认知科学领域, 通常利用 <strong>心理词典</strong> (mental lexicon) 所构成的语义网络 (semantic network) 来表征语义记忆结构 (Christensen &amp; Kenett, 2021)。在语义网络中, 概念被表示为通过 “边(edge)”相互连结的“节点(node)”, 语义距离则用来表示概念与概念之间的距离, 即语义相似性 (Paulsen et al., 1996)。</p>
<p>在实证研究中研究者们常用发散思维测验(Divergent Thinking Test)来衡量创造性思维, 但发散思维测验的评分存在着一些不足, 如流畅性和独特性有较高程度的相关致使得分极易混淆、独特性评分依赖于样本等问题(Silvia et al.,2008)。因此, 除了对原有测量技术的优化和改进, 还需要提升创造力测量的客观性和准确性。 目前 已有学者提出使用语义距离来测量创造性思维, <strong>但是使用语义距离测量创造性思维这一方法的有效性还存在着争议</strong>(Marron et al., 2018; Wang et al., 2018)。</p>
<br>
<br>
<h2 id="二创造性思维及其度量">二、创造性思维及其度量</h2>
<p>创造力(creativity)是指产生新颖(original)且 适 宜 (appropriate) 产品的能力 (Kaufman &amp; Sternberg, 2010; Runco, 2002)。发散思维(Divergent Thinking)是个体针对给 定问题或提示产生多个原创想法的心理能力 (Acar &amp; Runco, 2019; Forthmann, Wilken et al., 2019), 长期以来一直是创造性思维研究中的一个 重要内容(Hocevar, 1980)。发散思维测验是迄今为止创造力研究中使用最多、应用最为广泛的主流测验形式(Plucker &amp; Makel, 2010; Reiter-Palmon et al., 2019)。</p>
<p>在以往研究中, Guilford (1950)的 <strong>多用途任务</strong> (Alternate Use Task, AUT)和 Torrance (1972) 的 <strong>创造性思维测试</strong> (Torrance Tests of Creative Thinking, TTCT)使用频率较高。<strong>发散思维通常包括 4 个维度, 即流畅性、灵活性、独特性(或 独创性)和精致性</strong>。其中, 流畅性(fluency)指给出 的想法或解决方案的数量; 灵活性(flexibility)指 想法的多样性; 独特性(originality)指想法的不寻 常或唯一性; 精致性(elaboration)指给出想法或答案的详细程度(Torrance, 1965, 1988)。 在评分时, 发散思维测验也常从这四个维度来计分, 并由此衡量被试答案的创造性水平。但发散思维测试存在一些潜在问题，如无法进一步探讨创造性思维过程的手段(Hass, 2017; Marron et al., 2018)。其次， 四个主要的评价指标， 除了流畅性能够被客观测量，其余三个指标的传统评分方法存在一定弊端。第三， 在发散思维测验计分时， 流畅性和独创性的得分容易混淆。以上三点导致发散思维测验的客观性、信度饱受争议(Benedek &amp; Neubauer, 2013)。</p>
<p><br><br></p>
<h2 id="三语义距离与创造性思维的关系">三、语义距离与创造性思维的关系</h2>
<p>语义距离这个概念来源于 Collins 和 Loftus (1975)提出的<strong>扩散激活模型</strong>(Spreading-Activation Model)。 在概念与概念之间, 共同的定义性特征 越多, 它们之间的关系就越近, 这个关系就称为 语义距离(Volle, 2018)。例如, “雪”和“白”经常共同出现在文本中, 所以语义距离较小; 相反, “雪”和 “石油”很少同时出现, 因此二者之间具有较大的 语义距离。</p>
<p>Mednick 在 1962 年提出了 <strong>创造力联想理论</strong> (Associative Theory of Creativity), 该理论解释了创造性思维与语义记忆结构之间的关系(Mednick, 1962)。 该理论认为,  创造性思维涉及将弱相关或远距离概念联接成新颖且有用的概念的认知过程。 如果某些概念在语义层面相距越远, 由它们所产生的新的组合就越有创意, 新颖度越高。</p>
<p>Benedek 等人(2012)在 Mednick (1962)的理论 基础上提出, 解离能力(dissociative ability)和联想整合能力(associative combination ability)是与创造性思维密切相关的基本认知能力。</p>
<ul>
<li><strong>解离能力</strong> 是指生成不相关的概念的能力, 也可以被理解为一 种语义抑制能力, 它有助于人们获得新的语义距离遥远的概念。</li>
<li><strong>联想整合能力</strong> 指的是对看似不相关的概念形成合理联想的能力。</li>
</ul>
<p>据此我们可以推断, 语义距离作为概念与概念之间关系的量化指 标(Volle, 2018), 也即衡量个体联想能力的指标, 可以有效反映个体以联想过程为基础的创造性思维。 扩散激活模型也提到, 有创造力的人拥有更加复杂(Collins &amp; Loftus, 1975; Gruszka &amp; Necka, 2002; Kenett, 2019) 、 更加灵活的语义网络 (Schilling, 2005)。</p>
<p>近年来语义距离也开始作为测量创造性表现 的指标。 研究者们(Green et al., 2012; Prabhakaran et al., 2014; Weinberger et al., 2016)在探究状态创 造力(state creativity)时, 通常将语义距离作为创 造力水平高低的测量指标。 状态创造力即被试在 不同指导语或线索提示下所表现出的不同创造力 水平。也有研究利用语义距离来测量创造性思维, 结果显示相比较传统测量方法, 基于语义距离的 测量方法在创造性思维各指标间有着更好的区分 效度和结构信度(Dumas &amp; Dunbar, 2014)。</p>
<p>此外, 通过对语义距离的应用,  研究者能够对创造性思维的质量有更为客观的认识, 从而更 好地探讨创造性思维的认知神经机制。 人们普遍认为, 创造性思维认知过程需要 <strong>联想过程</strong> (associative processes) 与 <strong>执行过程</strong> (executive processes)的耦合(Silvia et al., 2013)。 目前, 大多 数创造性思维任务并未区分这两种认知过程 (Mednick, 1968; Runco et al., 2016), 而对这两种 认知过程的细分有助于我们更深入地理解创造性思维和创造性认知过程(Fox et al., 2015)。 而语义距离作为联想能力的衡量指标, 可以更好地反映 出个体在进行创造性思维任务时的联想过程 (Beaty, Nusbaum et al., 2014; Beaty, Silvia et al., 2014; Marron et al., 2018)。因此, 语义距离也被用来作为认知神经科学研究中创造性思维的测量指标, 它不仅可以用于比较个体在产生不同创造性 水平的答案时其大脑激活模式的差异(Beaty et al., 2017; Green et al., 2015; Tempest &amp; Radel, 2019)及个体的创造性表现随时间动态变化 (Green, 2016), 还可以用来研究不同个体之间的创造力水平差异(Green, 2016)。</p>
<p><br><br></p>
<h2 id="四年龄可能调节语义距离与创造性思维关系">四、年龄可能调节语义距离与创造性思维关系</h2>
<p>近几年, 国内外开展了一些语义距离与创造性思维关系的研究, 但是研究结果却不尽相同。 这可能与研究对象的人口学因素(年龄)和评估创 造性思维时所使用的测量指标有关。</p>
<p>根据已有研究, 年龄可能会影响语义距离与创造性思维之间的关系。</p>
<p><strong>首先</strong>, 年龄与语言能力和词汇量有关。 老年人的词汇量及语义知识存储与年轻人相比更加丰富(Kavé &amp; Halamish, 2015; Verhaeghen, 2003), 而语言能力较强、词汇量较多 的个体在表达想法时更不容易受到表达能力的限制, 因此往往在言语创造性任务中表现得更好。 语言能力较强的个体也可能会有更多的认知资源用来产生创造性想法(Wu et al., 2005)。  <strong>其次</strong>, 不同年龄被试的语义结构和语义记忆也是不同的, 例如, 老年人语义记忆中的概念更加模块化, 也更分散(Dubossarsky et al., 2017; Wulff et al., 2019; Zortea et al., 2014)。因此, 样本群体的年龄 可能会影响语义距离与创造性思维之间的关系。</p>
<p><br><br></p>
<h2 id="五元分析结果">五、元分析结果</h2>
<h3 id="51-语义距离测量创造性思维的有效性">5.1 语义距离测量创造性思维的有效性</h3>
<p><strong>本研究结果显示 , 语义距离与创造性思维呈显著正相关 (r = 0.379, p &lt; 0.001), 与以往研究结 果一致 (Hass, 2017; Heinen &amp; Johnson, 2018) 。该结果进一步验证了 Mednick (1962)提出的创造力联想理论 , 即如果某些概念在语义层面相距越远 , 由它们所产生的新的组合就越有创意新颖性越高</strong>。</p>
<p>语义距离作为一种连续变量 , 可以更精准地反映出创造性思维的定量变化 , 而不仅仅是二元对比 ( 例如 , 创造性与非创造性条件 ) (Kenett et al., 2017; Kenett, 2018; Kenett, 2019)。 因此 , 语义距离具有测量创造性思维的独特优势 (Green, 2016)。</p>
<p>然而 , 本研究发现 , 语义距离与创造性思维 关系的效应值为 0.379, 仍处于中等程度的正相关 (Cohen, 1988) 。 这说明尽管使用语义距离测量创 造性思维有一定的有效性 , 但是语义距离对创造 性思维的代表程度有限。</p>
<br>
<h3 id="52-语义距离与创造性思维关系中存在的调节效应">5.2 语义距离与创造性思维关系中存在的调节效应</h3>
<p><strong>被试年龄对语义距离与创造性思维的关系具有显著的调节作用，二者的相关性随着年龄的增加而逐渐降低</strong>。 原因可能在于 , 随着年龄的增长 , 个体的语义记忆结构和知识储备也在逐渐发生改变 , 从而影响了语义距离与创造性思维的关系。首先是语义记忆结 构的变化。 个体的语义记忆结构会随着年龄的增 长而逐渐变得稀疏 (Dubossarsky et al., 2017; Wulff et al., 2019; Zortea et al., 2014)。其次 是个体的知识储备和生活经验的变化。 常见的言语类创造性思维任务介于现实问题任务和图形任 务之间 , 完成这类任务需要一定的知识储备 (Wu et al., 2005)。</p>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>JMR | 测量消费者的语言确定性</title>
      <link>https://textdata.cn/blog/2023-10-16-measurement-of-consumer-certainty-in-language/</link>
      <pubDate>Mon, 16 Oct 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-10-16-measurement-of-consumer-certainty-in-language/</guid>
      <description>情感分析从根本上改变了市场营销者评估消费者意见的能力。的确，通过自然语言测量态度已经影响了市场营销在日常实践中的方式。**然而，最近的研究发现，情感分析目前强调测量情感的正负面（即积极或消极）可能会产生不完整、不准确甚至误导性的见解**。从概念上讲，这项研究挑战情感分析超越对情感正负面的侧重。作者识别出消费者情感的确定性或信心是一个特别有力的评估方面。从经验上，**他们开发了一种新的计算语言中确定性的测量工具——确定性词典（Certainty Lexicon）**，并验证了其与情感分析的使用。为了构建和验证这种测量，作者使用了来自1160万人的文本，他们生成了数十亿的词汇，数百万的在线评论，以及在线预测市场的数十万条记录。在社交媒体数据集、实验室实验和在线评论中，作者发现与其他工具相比，确定性词典在其测量中更为全面、可推广和准确。作者还展示了对市场营销者来说，测量情感确定性的价值：确定性预测了广告的实际成功，而传统的情感分析则未能做到这一点。</description>
      <content:encoded><![CDATA[<h2 id="一文献">一、文献</h2>
<p>Rocklage Matthew D.,He Sharlene,Rucker Derek D.,Nordgren Loran F..<strong>Beyond Sentiment: The Value and Measurement of Consumer Certainty in Language</strong>[J].<strong>Journal of Marketing Research</strong>,2023,60(5).</p>
<p><strong>摘要(译文)</strong>:  情感分析从根本上改变了市场营销者评估消费者意见的能力。的确，通过自然语言测量态度已经影响了市场营销在日常实践中的方式。<strong>然而，最近的研究发现，情感分析目前强调测量情感的正负面（即积极或消极）可能会产生不完整、不准确甚至误导性的见解</strong>。从概念上讲，这项研究挑战情感分析超越对情感正负面的侧重。作者识别出消费者情感的确定性或信心是一个特别有力的评估方面。从经验上，<strong>他们开发了一种新的计算语言中确定性的测量工具——「确定性词典（Certainty Lexicon）</strong>，并验证了其与情感分析的使用。为了构建和验证这种测量，作者使用了来自1160万人的文本，他们生成了数十亿的词汇，数百万的在线评论，以及在线预测市场的数十万条记录。在社交媒体数据集、实验室实验和在线评论中，作者发现与其他工具相比，确定性词典在其测量中更为全面、可推广和准确。作者还展示了对市场营销者来说，测量情感确定性的价值：确定性预测了广告的实际成功，而传统的情感分析则未能做到这一点。</p>
<p><img loading="lazy" src="img/paper.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="二消费者的确定性">二、消费者的确定性</h2>
<p>为了更好地理解消费者情绪，<strong>我们认为消费者持有该情绪的确定性是至关重要的。确定性是个人对信心或信仰的主观感觉（Petrocelli, Tormala, 和 Rucker 2007）。态度研究的结果表明，消费者对所持有的态度或信仰的确定性越强，该态度或信仰驱动行为的可能性就越大</strong>（参见 Tormala 和 Rucker 2018）。例如，研究表明，当态度持有更大的确定性时，态度和行为意图之间的关联更强（r = .89），而确定性较低时关联较弱（r = .68；Tormala 和 Petty [2002], 实验 4；参见也有 Franc [1999]）。同样，持有更大确定性的想法更能预测人们对这些想法的依赖（Briñol, Petty, 和 Tormala 2004）。在态度文献中，大量的研究表明，持有更大确定性的态度更可能随着时间的推移而持续，并抵御变化（Rucker, Petty, 和 Briñol 2008；Tormala 2016；Tormala 和 Petty 2002）。</p>
<p><strong>确定性还与情感值（sentiment, 即它是积极还是消极）和情感效价（valence, 即情感的价值是多么的正或多么的负）有所不同</strong>（Clarkson, Tormala, 和 Leone 2011；Petty 和 Krosnick 1995）。<strong>虽然更极端的情感值通常与更确定的态度相关联，但这种关联并不强烈</strong>（例如，r ∼ .50；Krosnick 等人 1993）。即使文本中情感效价是相同的，但语言中的确定性的差异也可能很普遍（参见 Rucker 和 Petty 2004；Tormala 和 Petty 2002）。此外，极端态度可能持有的确定性较低（Litt 和 Tormala 2010），并且不太可能随着时间的推移而持续（Rocklage 和 Luttrell 2021）。<strong>因此，确定性 和 情感效价极端性 是不同的</strong>。</p>
<p>举例来说，考虑两位顾客访问同一家餐厅并给予其完美的五星评级。尽管他们对餐厅的态度都是一样的正面，但其中一位可能对其态度更有确定感，因为他们的许多朋友持有类似的态度（Tormala 和 DeSensi 2009）。<strong>尽管态度的情感值完全相同，但确定性更强的顾客更有可能再次光顾餐厅并将其推荐给他人</strong>（例如，Barden 和 Petty 2008）。确定性的差异可以由社交共识的数量或直接的个人经验等因素产生。更普遍地说，确定性可以源于任何影响消费者感觉其态度或信仰背后的信息是准确、完整、相关、合法或重要的因素（Rucker 等人 2014）。</p>
<p><strong>鉴于确定性是态度的一个重要和突出的方面，它是扩展消费者情感评估的理想候选指标</strong>。目前，情感分析主要集中在测量价值上，但忽略了与该价值相关的确定性。此外，研究表明，因为大多数在线表达的消费者情感都是积极的，所以市场营销人员经常面临一个“<strong>positivity problem</strong>”（Rocklage, Rucker, 和 Nordgren 2021b）。这种正面信息的过剩导致了价值的受限范围，仅基于价值或价值极端性就很难获得洞察。<strong>在这些情境中，确定性的测量可能特别有用，确定性可能比情感价更准地预测消费者行为</strong>。</p>
<p><br><br></p>
<h2 id="三语言中的确定性测量">三、语言中的确定性测量</h2>
<h3 id="31-已有测量工具">3.1 已有测量工具</h3>
<p>语言中确定性的两个最突出的度量来自Linguistic Inquiry and Word Count (LIWC; Pennebaker等人2015) 和 DICTION (Hart和Carroll 2015) 软件程序。这两个程序都提供了用于评估文本属性（如其情感）的测量方法。尽管它们也包含与确定性相关的测量，但这些测量在其有效性、普遍性以及它们用于量化语言的方法上都存在局限性。</p>
<p><strong>首先，LIWC 和 DICTION 都没有得到足够的实证验证来测量确定性，也没有经过验证以评估情感确定性</strong>。例如，这两个工具都是基于研究者对哪些词会表示个人的确定性的直觉来创建的，而不是一个更正式或基于数据的方法（Hart 1976；Pennebaker 和 Francis 1996）。LIWC包含两个名为“确定性”和“犹豫”的确定性度量。然而，“<strong>certainty</strong>”测量尚未得到直接验证（Petrie, Booth, 和 Pennebaker 1998），而“<strong>tentativeness</strong>”测量仅在一组35名写有关其大学经历的大学生中得到验证（Pennebaker和Francis 1996）。同样，DICTION的确定性测量尚未直接得到验证（Hart 1976, 1984）。尽管它们有可能作为情感确定性的测量工具，但其有效性和普遍性仍然不清晰。</p>
<p><br><br></p>
<h3 id="32已有工具不足">3.2已有工具不足</h3>
<p>首先，它们都依赖于词频计数方法。考虑LIWC如何量化以下两个句子：（1）“<em>I&rsquo;ve often dislikeed my experience with that brand.</em>”和（2）“<em>I&rsquo;ve sorta dislikeed my experience with that brand.</em>” 。其中 “<strong>often</strong>”和“<strong>sorta</strong>”都出现在LIWC的“<strong>tentativeness</strong>”（<strong>certainty</strong>）词汇表中。根据LIWC的词频计数方法，这两个句子因此都被给予了12.50%的分数（即，八个词中有一个词表示不确定性）。因此， “<strong>often</strong>”和“<strong>sorta</strong>”被计为表示相同程度的不确定性。同样，DICTION给这些句子在确定性上同样的分数。通过简单地计算每个句子中的词，词频计数方法将给定词汇表中的所有词都视为表示相同的确定性。</p>
<p>其次，在测量短文本时候表现较差。这是因为短文本包含的信息相对较少，因此通常只有一个与确定性相关的关键词（Pennebaker等人2015）。鉴于词频计数方法假设给定词典中的所有词都表示相同程度的确定性，这些测量方法可能导致数据中的大偏斜（观察变化小），从而产生大量噪音，因此得到的结果无信息性或甚至具有误导性（Garten等人2018；Rocklage和Rucker 2019；Sterling, Jost, 和 Bonneau 2020）。鉴于市场营销人员依赖社交媒体来了解消费者情感，这一限制对他们尤为重要（Schaefer 2015）。</p>
<p>第三， 只能分析单个词汇，不能处理词组短语。例如，LIWC和DICTION都会将短语“<strong>i&rsquo;m not sure</strong>”视为表示高确定性，因为它包含单词“<strong>sure</strong>”；这些方法无法识别关键短语“<strong>not sure</strong>”。同样，它们会将“<strong>likely</strong>”和“<strong>extremely likely</strong>”视为表示相同程度的确定性。</p>
<p><img loading="lazy" src="img/1-table1.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="四构建确定性词典certainty-lexicon">四、构建确定性词典(Certainty Lexicon)</h2>
<h3 id="41-构建词典步骤">4.1 构建词典步骤</h3>
<p>这篇论文测量消费者确定性的词典叫做Certainty Lexicon， 该词典构建方法及步骤如下</p>
<p><strong>Phase-1</strong> 准备候选词表； 根据LIWC和Diction中的相关词，并生成近义词、ngram词组，以扩充确定性词的候选范围。</p>
<p><strong>Phase-2</strong> 初始剔除工作； 基于真实场景，剔除掉使用低频的词语与词组， 剔除掉人工阅读后觉得不符合确定性这个概念的词。</p>
<p><strong>Phase-3</strong>量化每个词的确定性； 设计9-likert量表(0 = “very uncertain,” and 9 = “very certain”; see Web Appendix B)，通过MTurk在线网站， 发放调查问卷。问题如。收到515多个参与者的问卷，最终保留489有效问卷。</p>
<p><strong>Phase-4</strong> 验证词典有效性;</p>
<p><img loading="lazy" src="img/2-table3.png" alt=""  />
</p>
<h3 id="42-确定性词典">4.2 确定性词典</h3>
<p>论文团队开发了 <strong>LexiconSuite</strong> 文本分析工具，内置了<strong>Evaluate Lexicon</strong>、<strong>Certainty Lexicon</strong>，可以用来分析文本的情感、确定性，工具是开源免费的，下载地址 <a href="http://www.lexicalsuite.com/">http://www.lexicalsuite.com/</a> 。 在LexiconSuite软件安装目录中，经过探索我找到了软件内置的词典txt文件。以本文介绍的<strong>确定性词典</strong>(Certainty Lexicon) ，对应的文件是 <a href="Certainty.txt"><strong>Certainty.txt</strong></a> ，</p>
<p><img loading="lazy" src="img/dict.png" alt=""  />
</p>
<p>打开txt如上图，使用Python读取发现一共有 <strong>3485</strong> 个词语(组)</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_table</span><span class="p">(</span><span class="s1">&#39;Certainty.txt&#39;</span><span class="p">,</span> <span class="n">sep</span><span class="o">=</span><span class="s1">&#39;,&#39;</span><span class="p">)</span>
<span class="n">df</span>
</code></pre></div><p><img loading="lazy" src="img/df.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="四计算文本的确定性">四、计算文本的确定性</h2>
<p>根据确定性词典(Certainty Lexicon), 就可以计算文本的确定性指标， 我在mac安装了<strong>LexiconSuite</strong>并且做了测试，导入了一个<a href="certainty_test.csv"><strong>csv文件</strong></a>。</p>
<p><img loading="lazy" src="img/certainty_test.png" alt=""  />
</p>
<p><img loading="lazy" src="img/ls-1.png" alt=""  />
<br>
<img loading="lazy" src="img/ls-2.png" alt=""  />
<br>
<img loading="lazy" src="img/ls-3.png" alt=""  />
<br>点击<strong>Run New Analysis</strong>
<img loading="lazy" src="img/ls-4.png" alt=""  />
<br>软件运行结果与论文中的Table-1数值是一样的。(额， 准备的实验数据中单词dislike我拼写成了dislikeed的。)
<br>
<img loading="lazy" src="img/ls-6.png" alt=""  />
</p>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>网络爬虫 |  采集穷游网某城市旅游景点</title>
      <link>https://textdata.cn/blog/2023-10-13-crawler-for-qyer/</link>
      <pubDate>Fri, 13 Oct 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-10-13-crawler-for-qyer/</guid>
      <description>&lt;h2 id=&#34;一发现网址规律&#34;&gt;一、发现网址规律&lt;/h2&gt;
&lt;h3 id=&#34;11-判断网站类型&#34;&gt;1.1 判断网站类型&lt;/h3&gt;
&lt;p&gt;这里我选择哈尔滨作为目标城市，采集哈尔滨的景点信息。第一页的网址&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;https://place.qyer.com/haerbin/sight/
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/1-url.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/2-url.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;点击页面下方翻页到第二页， 页面内容已经发生变化，但是网址栏中的网址没有变化，依然是&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;https://place.qyer.com/haerbin/sight/
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;所以可以判断该网站为动态网站类型，对付这类网站，需要打开开发者工具Network面板来构建网址规律。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/3-url.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h3 id=&#34;12--抓包构建网址规律&#34;&gt;1.2  抓包构建网址规律&lt;/h3&gt;
&lt;h3 id=&#34;121-headers&#34;&gt;1.2.1 Headers&lt;/h3&gt;
&lt;p&gt;我用的chrome浏览器， F12键(Mac 快捷键command+option+I)打开开发者工具，如下图。&lt;/p&gt;
&lt;p&gt;打开&lt;strong&gt;开发者工具&lt;/strong&gt;后， 点击&lt;strong&gt;Network面板&lt;/strong&gt;。为了让 &lt;strong&gt;Network&lt;/strong&gt;监测到数据流，  点击2 。这样就能发现下方截图中的&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;poi.php?action=list_json
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/4-headers.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;可以基于上方截图确认，该网站现在用的是post请求方法， 写代码时可以用requests.post(url, data)方式发起请求。 &lt;br&gt;&lt;/p&gt;
&lt;h3 id=&#34;122-payload&#34;&gt;1.2.2 Payload&lt;/h3&gt;
&lt;p&gt;网址规律翻页规律如何构造呢， 通过检查发现 &lt;strong&gt;Payload&lt;/strong&gt;决定着翻页，我们也看到下面截图中，&lt;strong&gt;page: 2对应着页面 2&lt;/strong&gt;。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/5-payload.png&#34; alt=&#34;&#34;  /&gt;
&lt;br&gt;&lt;/p&gt;
&lt;h3 id=&#34;123-preview&#34;&gt;1.2.3 Preview&lt;/h3&gt;
&lt;p&gt;我们顺便点击Preview，检查预览数据是否与页面数据有对应关系。 截图中「丁香公园」出现在页面和preview中。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/6-preview.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;h2 id=&#34;13-构造网址规律&#34;&gt;1.3 构造网址规律&lt;/h2&gt;
&lt;p&gt;构造网址规律， 以第二页为例， 发起请求，查看数据&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/7-first-requests.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二存储数据&#34;&gt;二、存储数据&lt;/h2&gt;
&lt;p&gt;采集的字段包括&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;景点名称&lt;/li&gt;
&lt;li&gt;景点链接&lt;/li&gt;
&lt;li&gt;评论人数&lt;/li&gt;
&lt;li&gt;评级&lt;/li&gt;
&lt;li&gt;图片链接&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;使用csv格式存储数据&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;csv&lt;/span&gt;

&lt;span class=&#34;k&#34;&gt;with&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;open&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;harbin_sight.csv&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;w&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;encoding&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;utf-8&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;newline&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;csvf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;fieldnames&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;sightName&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;sightUrl&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;commentCount&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;rate&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;imgUrl&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;writer&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;csv&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;DictWriter&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;csvf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;fieldnames&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;fieldnames&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;writer&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;writeheader&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
    
    &lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;sight&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;resp&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;json&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;data&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;][&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;list&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]:&lt;/span&gt;
        &lt;span class=&#34;n&#34;&gt;sight_info&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;sightName&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;sight&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;cnname&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt;
                      &lt;span class=&#34;s1&#34;&gt;&amp;#39;sightUrl&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;https:&amp;#39;&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;sight&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;url&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt;
                      &lt;span class=&#34;s1&#34;&gt;&amp;#39;commentCount&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;sight&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;commentCount&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt;
                      &lt;span class=&#34;s1&#34;&gt;&amp;#39;rate&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;sight&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;grade&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt;
                      &lt;span class=&#34;s1&#34;&gt;&amp;#39;imgUrl&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;sight&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;photo&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
                     &lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
        
        &lt;span class=&#34;n&#34;&gt;writer&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;writerow&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;sight_info&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;代码运行后， 尝试检查harbin_sight.csv， 现在该文件内暂时存储了第二页的景点信息&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;harbin_sight.csv&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;景点数: &amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;head&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/8-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;三批量采集&#34;&gt;三、批量采集&lt;/h2&gt;
&lt;p&gt;以哈尔滨为例， 景点页面一共有92页，批量采集这92页信息。完整代码&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;requests&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;csv&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;time&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;url&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;https://place.qyer.com/poi.php?action=list_json&amp;#39;&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#控制翻页的字典&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;formdata&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;page&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
            &lt;span class=&#34;s1&#34;&gt;&amp;#39;type&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;city&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
            &lt;span class=&#34;s1&#34;&gt;&amp;#39;pid&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;11597&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
            &lt;span class=&#34;s1&#34;&gt;&amp;#39;sort&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;32&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
            &lt;span class=&#34;s1&#34;&gt;&amp;#39;subsort&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;all&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
            &lt;span class=&#34;s1&#34;&gt;&amp;#39;isnominate&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
            &lt;span class=&#34;s1&#34;&gt;&amp;#39;haslastm&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;false&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
            &lt;span class=&#34;s1&#34;&gt;&amp;#39;rank&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;6&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;headers&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;User-Agent&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;


&lt;span class=&#34;c1&#34;&gt;#新建csv，存储数据&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;with&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;open&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;harbin_sight.csv&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;w&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;encoding&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;utf-8&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;newline&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;csvf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;c1&#34;&gt;#设置csv的字段&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;fieldnames&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;sightName&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;sightUrl&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;commentCount&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;rate&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;imgUrl&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;writer&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;csv&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;DictWriter&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;csvf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;fieldnames&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;fieldnames&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;writer&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;writeheader&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
    
    &lt;span class=&#34;c1&#34;&gt;#采集哈尔滨从第1页到92页&lt;/span&gt;
    &lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;page&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;range&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;93&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
        &lt;span class=&#34;c1&#34;&gt;#更新formdata网页数信息，相当于翻页&lt;/span&gt;
        &lt;span class=&#34;n&#34;&gt;formdata&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;page&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;page&lt;/span&gt;
        &lt;span class=&#34;c1&#34;&gt;#time.sleep(1) 控制访问速度，每秒访问1次&lt;/span&gt;
        &lt;span class=&#34;n&#34;&gt;time&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;sleep&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
        &lt;span class=&#34;n&#34;&gt;resp&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;requests&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;post&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;url&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;data&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;formdata&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;headers&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;headers&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
        
        &lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;正在采集哈尔滨第&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{}&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;页信息&amp;#39;&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;format&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;page&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
        &lt;span class=&#34;c1&#34;&gt;#存储数据&lt;/span&gt;
        &lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;sight&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;resp&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;json&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;data&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;][&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;list&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]:&lt;/span&gt;
            &lt;span class=&#34;n&#34;&gt;sight_info&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;sightName&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;sight&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;cnname&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt;
                          &lt;span class=&#34;s1&#34;&gt;&amp;#39;sightUrl&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;https:&amp;#39;&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;sight&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;url&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt;
                          &lt;span class=&#34;s1&#34;&gt;&amp;#39;commentCount&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;sight&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;commentCount&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt;
                          &lt;span class=&#34;s1&#34;&gt;&amp;#39;rate&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;sight&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;grade&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt;
                          &lt;span class=&#34;s1&#34;&gt;&amp;#39;imgUrl&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;sight&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;photo&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
                         &lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
            &lt;span class=&#34;n&#34;&gt;writer&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;writerow&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;sight_info&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;正在采集哈尔滨第1页信息
正在采集哈尔滨第2页信息
......
正在采集哈尔滨第91页信息
正在采集哈尔滨第92页信息

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;p&gt;检查数据&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;harbin_sight.csv&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;景点数: &amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;head&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/9-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/10-check.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;四获取代码&#34;&gt;四、获取代码&lt;/h2&gt;
&lt;p&gt;&lt;a href=&#34;%E7%A9%B7%E6%B8%B8%E4%BB%A3%E7%A0%81.ipynb&#34;&gt;&lt;strong&gt;点击下载本文代码&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
</description>
      <content:encoded><![CDATA[<h2 id="一发现网址规律">一、发现网址规律</h2>
<h3 id="11-判断网站类型">1.1 判断网站类型</h3>
<p>这里我选择哈尔滨作为目标城市，采集哈尔滨的景点信息。第一页的网址</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">https://place.qyer.com/haerbin/sight/
</code></pre></div><p><img loading="lazy" src="img/1-url.png" alt=""  />
</p>
<p><br><img loading="lazy" src="img/2-url.png" alt=""  />
</p>
<br>
<p>点击页面下方翻页到第二页， 页面内容已经发生变化，但是网址栏中的网址没有变化，依然是</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">https://place.qyer.com/haerbin/sight/
</code></pre></div><p>所以可以判断该网站为动态网站类型，对付这类网站，需要打开开发者工具Network面板来构建网址规律。</p>
<p><img loading="lazy" src="img/3-url.png" alt=""  />
</p>
<p><br><br></p>
<h3 id="12--抓包构建网址规律">1.2  抓包构建网址规律</h3>
<h3 id="121-headers">1.2.1 Headers</h3>
<p>我用的chrome浏览器， F12键(Mac 快捷键command+option+I)打开开发者工具，如下图。</p>
<p>打开<strong>开发者工具</strong>后， 点击<strong>Network面板</strong>。为了让 <strong>Network</strong>监测到数据流，  点击2 。这样就能发现下方截图中的</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">poi.php?action=list_json
</code></pre></div><p><img loading="lazy" src="img/4-headers.png" alt=""  />
</p>
<p>可以基于上方截图确认，该网站现在用的是post请求方法， 写代码时可以用requests.post(url, data)方式发起请求。 <br></p>
<h3 id="122-payload">1.2.2 Payload</h3>
<p>网址规律翻页规律如何构造呢， 通过检查发现 <strong>Payload</strong>决定着翻页，我们也看到下面截图中，<strong>page: 2对应着页面 2</strong>。</p>
<p><img loading="lazy" src="img/5-payload.png" alt=""  />
<br></p>
<h3 id="123-preview">1.2.3 Preview</h3>
<p>我们顺便点击Preview，检查预览数据是否与页面数据有对应关系。 截图中「丁香公园」出现在页面和preview中。</p>
<p><img loading="lazy" src="img/6-preview.png" alt=""  />
</p>
<h2 id="13-构造网址规律">1.3 构造网址规律</h2>
<p>构造网址规律， 以第二页为例， 发起请求，查看数据</p>
<p><img loading="lazy" src="img/7-first-requests.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="二存储数据">二、存储数据</h2>
<p>采集的字段包括</p>
<ul>
<li>景点名称</li>
<li>景点链接</li>
<li>评论人数</li>
<li>评级</li>
<li>图片链接</li>
</ul>
<p>使用csv格式存储数据</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">csv</span>

<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s1">&#39;harbin_sight.csv&#39;</span><span class="p">,</span> <span class="s1">&#39;w&#39;</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8&#39;</span><span class="p">,</span> <span class="n">newline</span><span class="o">=</span><span class="s1">&#39;&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">csvf</span><span class="p">:</span>
    <span class="n">fieldnames</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;sightName&#39;</span><span class="p">,</span> <span class="s1">&#39;sightUrl&#39;</span><span class="p">,</span> <span class="s1">&#39;commentCount&#39;</span><span class="p">,</span> <span class="s1">&#39;rate&#39;</span><span class="p">,</span> <span class="s1">&#39;imgUrl&#39;</span><span class="p">]</span>
    <span class="n">writer</span> <span class="o">=</span> <span class="n">csv</span><span class="o">.</span><span class="n">DictWriter</span><span class="p">(</span><span class="n">csvf</span><span class="p">,</span> <span class="n">fieldnames</span> <span class="o">=</span> <span class="n">fieldnames</span><span class="p">)</span>
    <span class="n">writer</span><span class="o">.</span><span class="n">writeheader</span><span class="p">()</span>
    
    <span class="k">for</span> <span class="n">sight</span> <span class="ow">in</span> <span class="n">resp</span><span class="o">.</span><span class="n">json</span><span class="p">()[</span><span class="s1">&#39;data&#39;</span><span class="p">][</span><span class="s1">&#39;list&#39;</span><span class="p">]:</span>
        <span class="n">sight_info</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;sightName&#39;</span><span class="p">:</span> <span class="n">sight</span><span class="p">[</span><span class="s1">&#39;cnname&#39;</span><span class="p">],</span>
                      <span class="s1">&#39;sightUrl&#39;</span><span class="p">:</span> <span class="s1">&#39;https:&#39;</span> <span class="o">+</span> <span class="n">sight</span><span class="p">[</span><span class="s1">&#39;url&#39;</span><span class="p">],</span>
                      <span class="s1">&#39;commentCount&#39;</span><span class="p">:</span> <span class="n">sight</span><span class="p">[</span><span class="s1">&#39;commentCount&#39;</span><span class="p">],</span>
                      <span class="s1">&#39;rate&#39;</span><span class="p">:</span> <span class="n">sight</span><span class="p">[</span><span class="s1">&#39;grade&#39;</span><span class="p">],</span>
                      <span class="s1">&#39;imgUrl&#39;</span><span class="p">:</span> <span class="n">sight</span><span class="p">[</span><span class="s1">&#39;photo&#39;</span><span class="p">]</span>
                     <span class="p">}</span>
        
        <span class="n">writer</span><span class="o">.</span><span class="n">writerow</span><span class="p">(</span><span class="n">sight_info</span><span class="p">)</span>
</code></pre></div><p>代码运行后， 尝试检查harbin_sight.csv， 现在该文件内暂时存储了第二页的景点信息</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;harbin_sight.csv&#39;</span><span class="p">)</span>

<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;景点数: &#39;</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">))</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/8-df.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="三批量采集">三、批量采集</h2>
<p>以哈尔滨为例， 景点页面一共有92页，批量采集这92页信息。完整代码</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">requests</span>
<span class="kn">import</span> <span class="nn">csv</span>
<span class="kn">import</span> <span class="nn">time</span>

<span class="n">url</span> <span class="o">=</span> <span class="s1">&#39;https://place.qyer.com/poi.php?action=list_json&#39;</span>
<span class="c1">#控制翻页的字典</span>
<span class="n">formdata</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;page&#39;</span><span class="p">:</span> <span class="mi">2</span><span class="p">,</span>
            <span class="s1">&#39;type&#39;</span><span class="p">:</span> <span class="s1">&#39;city&#39;</span><span class="p">,</span>
            <span class="s1">&#39;pid&#39;</span><span class="p">:</span> <span class="mi">11597</span><span class="p">,</span>
            <span class="s1">&#39;sort&#39;</span><span class="p">:</span> <span class="mi">32</span><span class="p">,</span>
            <span class="s1">&#39;subsort&#39;</span><span class="p">:</span> <span class="s1">&#39;all&#39;</span><span class="p">,</span>
            <span class="s1">&#39;isnominate&#39;</span><span class="p">:</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span>
            <span class="s1">&#39;haslastm&#39;</span><span class="p">:</span> <span class="s1">&#39;false&#39;</span><span class="p">,</span>
            <span class="s1">&#39;rank&#39;</span><span class="p">:</span> <span class="mi">6</span><span class="p">}</span>

<span class="n">headers</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;User-Agent&#39;</span><span class="p">:</span> <span class="s1">&#39;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36&#39;</span><span class="p">}</span>


<span class="c1">#新建csv，存储数据</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s1">&#39;harbin_sight.csv&#39;</span><span class="p">,</span> <span class="s1">&#39;w&#39;</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8&#39;</span><span class="p">,</span> <span class="n">newline</span><span class="o">=</span><span class="s1">&#39;&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">csvf</span><span class="p">:</span>
    <span class="c1">#设置csv的字段</span>
    <span class="n">fieldnames</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;sightName&#39;</span><span class="p">,</span> <span class="s1">&#39;sightUrl&#39;</span><span class="p">,</span> <span class="s1">&#39;commentCount&#39;</span><span class="p">,</span> <span class="s1">&#39;rate&#39;</span><span class="p">,</span> <span class="s1">&#39;imgUrl&#39;</span><span class="p">]</span>
    <span class="n">writer</span> <span class="o">=</span> <span class="n">csv</span><span class="o">.</span><span class="n">DictWriter</span><span class="p">(</span><span class="n">csvf</span><span class="p">,</span> <span class="n">fieldnames</span> <span class="o">=</span> <span class="n">fieldnames</span><span class="p">)</span>
    <span class="n">writer</span><span class="o">.</span><span class="n">writeheader</span><span class="p">()</span>
    
    <span class="c1">#采集哈尔滨从第1页到92页</span>
    <span class="k">for</span> <span class="n">page</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">93</span><span class="p">):</span>
        <span class="c1">#更新formdata网页数信息，相当于翻页</span>
        <span class="n">formdata</span><span class="p">[</span><span class="s1">&#39;page&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">page</span>
        <span class="c1">#time.sleep(1) 控制访问速度，每秒访问1次</span>
        <span class="n">time</span><span class="o">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
        <span class="n">resp</span> <span class="o">=</span> <span class="n">requests</span><span class="o">.</span><span class="n">post</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="n">data</span><span class="o">=</span><span class="n">formdata</span><span class="p">,</span> <span class="n">headers</span><span class="o">=</span><span class="n">headers</span><span class="p">)</span>
        
        <span class="nb">print</span><span class="p">(</span><span class="s1">&#39;正在采集哈尔滨第</span><span class="si">{}</span><span class="s1">页信息&#39;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">page</span><span class="p">))</span>
        <span class="c1">#存储数据</span>
        <span class="k">for</span> <span class="n">sight</span> <span class="ow">in</span> <span class="n">resp</span><span class="o">.</span><span class="n">json</span><span class="p">()[</span><span class="s1">&#39;data&#39;</span><span class="p">][</span><span class="s1">&#39;list&#39;</span><span class="p">]:</span>
            <span class="n">sight_info</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;sightName&#39;</span><span class="p">:</span> <span class="n">sight</span><span class="p">[</span><span class="s1">&#39;cnname&#39;</span><span class="p">],</span>
                          <span class="s1">&#39;sightUrl&#39;</span><span class="p">:</span> <span class="s1">&#39;https:&#39;</span> <span class="o">+</span> <span class="n">sight</span><span class="p">[</span><span class="s1">&#39;url&#39;</span><span class="p">],</span>
                          <span class="s1">&#39;commentCount&#39;</span><span class="p">:</span> <span class="n">sight</span><span class="p">[</span><span class="s1">&#39;commentCount&#39;</span><span class="p">],</span>
                          <span class="s1">&#39;rate&#39;</span><span class="p">:</span> <span class="n">sight</span><span class="p">[</span><span class="s1">&#39;grade&#39;</span><span class="p">],</span>
                          <span class="s1">&#39;imgUrl&#39;</span><span class="p">:</span> <span class="n">sight</span><span class="p">[</span><span class="s1">&#39;photo&#39;</span><span class="p">]</span>
                         <span class="p">}</span>
            <span class="n">writer</span><span class="o">.</span><span class="n">writerow</span><span class="p">(</span><span class="n">sight_info</span><span class="p">)</span>
    
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">正在采集哈尔滨第1页信息
正在采集哈尔滨第2页信息
......
正在采集哈尔滨第91页信息
正在采集哈尔滨第92页信息

</code></pre></div><br>
<p>检查数据</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;harbin_sight.csv&#39;</span><span class="p">)</span>

<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;景点数: &#39;</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">))</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/9-df.png" alt=""  />
</p>
<p><img loading="lazy" src="img/10-check.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="四获取代码">四、获取代码</h2>
<p><a href="%E7%A9%B7%E6%B8%B8%E4%BB%A3%E7%A0%81.ipynb"><strong>点击下载本文代码</strong></a></p>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>管理世界 | 机器学习如何赋能管理学研究？——国内外前沿综述和未来展望</title>
      <link>https://textdata.cn/blog/2023-10-11-how-can-machine-learning-empower-management-research/</link>
      <pubDate>Wed, 11 Oct 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-10-11-how-can-machine-learning-empower-management-research/</guid>
      <description>机器学习正在深刻改变管理学的研究范式与方法。如何运用机器学习更好地赋能管理学研究已经成为学术界关注的前沿热点议题。然而，机器学习在中国管理学研究中的应用仍处于初级阶段。**本文基于1999～2021年发表在工商管理和会计财务两大研究领域的国内外顶级期刊的学术文献，识别了学术界借助机器学习开展管理学实证研究的4种核心途径：变量测量、事件预测（包括事件分类）、因果推断和理论构建**；梳理了每个途径的代表性文献的研究主题、研究问题、数据集、机器学习算法和研究结论；提出了使用机器学习赋能管理学研究的主要策略，并讨论了中国学者运用机器学习开展中国特色管理理论研究的未来机会。本文显示：将机器学习与传统计量经济学相结合有助于做出更加精准的因果推断；机器学习能够在模式发现这一理论构建的关键步骤中发挥重要作用；将机器学习与多案例分析相结合有助于富有成效地开展理论构建。本文为如何采用机器学习提升管理学研究质量、推进管理学研究范式变革和构建中国特色管理理论提供了方法论指引和方向性启示。</description>
      <content:encoded><![CDATA[<h2 id="一论文">一、论文</h2>
<p>刘景江,郑畅然,洪永淼.<strong>机器学习如何赋能管理学研究？——国内外前沿综述和未来展望</strong>[J].<strong>管理世界</strong>,2023,39(09):191-216.</p>
<p>摘要: 机器学习正在深刻改变管理学的研究范式与方法。如何运用机器学习更好地赋能管理学研究已经成为学术界关注的前沿热点议题。然而，机器学习在中国管理学研究中的应用仍处于初级阶段。<strong>本文基于1999～2021年发表在工商管理和会计财务两大研究领域的国内外顶级期刊的学术文献，识别了学术界借助机器学习开展管理学实证研究的4种核心途径：变量测量、事件预测（包括事件分类）、因果推断和理论构建</strong>；梳理了每个途径的代表性文献的研究主题、研究问题、数据集、机器学习算法和研究结论；提出了使用机器学习赋能管理学研究的主要策略，并讨论了中国学者运用机器学习开展中国特色管理理论研究的未来机会。本文显示：将机器学习与传统计量经济学相结合有助于做出更加精准的因果推断；机器学习能够在模式发现这一理论构建的关键步骤中发挥重要作用；将机器学习与多案例分析相结合有助于富有成效地开展理论构建。本文为如何采用机器学习提升管理学研究质量、推进管理学研究范式变革和构建中国特色管理理论提供了方法论指引和方向性启示。</p>
<br>
<br>
<h2 id="二文献范围">二、文献范围</h2>
<p>首先，本文选取 <strong>UTD-24</strong> 期刊，以“machine learning”、“decision tree”、“support vector machine”、“random forest”、“artificial neural network”和“deep learning”等为关键词，对目标期刊的所有在库文章进 行全篇检索，把正式发表时间限定到 2021 年 12 月末，得到一张包含 1258 篇文献的初步文献清单。其中，会计 领域 52 篇，财务领域 72 篇，信息系统领域 322 篇，营销领域 208 篇，管理科学领域 522 篇，工商管理领域 82 篇。 <strong>考虑到篇幅有限和用途梳理的全面性，本文只关注工商管理和会计财务两大研究领域</strong>  。</p>
<p>接着，类似地，本文选取“2021 中国最具国际影响力学术期刊（人文社会科学）”前 20 名中的管理学期刊， 用相似的关键词，搜寻到 2004~2021 年且运用机器学习方法进行实证研究的文章， 符合标准的论文，有工商管理 15 篇，会计财务 28 篇。</p>
<p><img loading="lazy" src="img/20231011-%e5%8f%91%e6%96%87%e8%b6%8b%e5%8a%bf-%e5%9b%bd%e9%99%85.png" alt=""  />
</p>
<p><img loading="lazy" src="img/20231011-%e5%8f%91%e6%96%87%e8%b6%8b%e5%8a%bf-%e5%9b%bd%e5%86%85.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="三机器学习4大核心用途">三、机器学习4大核心用途</h2>
<p>在确定研究目标后，我们按照以下 3 个步骤对 数据进行编码和文献分析。 第一步，根据以往理论和实证研究，我们总结出机器学习方法在管理学实证研究中的 4 种核心用途: <strong>变量测量、事件预测、因果推断和理论构建</strong>，如图 3 所示。</p>
<p><img loading="lazy" src="img/20231011-%e6%9c%ba%e5%99%a8%e5%ad%a6%e4%b9%a04%e5%a4%a7%e6%a0%b8%e5%bf%83%e7%94%a8%e9%80%94.png" alt=""  />
</p>
<ul>
<li>
<p><strong>变量测量</strong>是根据一种规则，用数量的方法描述研究对象所具备的某种特征或行为，其目标是对变量之间的关系进行量化推断（陈晓萍等，2008）。</p>
</li>
<li>
<p><strong>事件预测</strong>是使用已掌握的经验或知识，预先推知和判断事物未来发 展状况（阿西，2019），其目标是预料来自不同观测总体的样本已经或将要在未来实现的结果（格里默等， 2021）。</p>
</li>
<li>
<p><strong>因果推断</strong>是借助理论和对制度细节的深入了解，估计事件和选择对给定结果的影响（坎宁安，2021）， 其目标是比较在同一干预措施下不同反事实（Counterfactual）结果之间的差异（格里默等，2021）。</p>
</li>
<li>
<p><strong>理论构建</strong>是 构建概念及其相互关系，以展示一种现象是如何和为什么发生的过程（焦亚、皮特雷，1990；科利、焦亚，2011； 克里斯蒂安森、钱丹，2017），其目标是建立稳健且具有可解释性的理论。</p>
</li>
</ul>
<p>变量测量、事件预测、因果推断和理论构建是管理学实证研究的 4 项关键任务。 它们既相互区别又紧密关 联。 理论构建在管理学实证研究中占据着核心地位（班伯格，2018）。 管理学顶级期刊格外强调文章的理论贡献（科利、焦亚，2011）。 实证研究的核心目标是理论构建。 衡量一个“好”的实证研究的首要标准是它能够建立稳健且具有可解释性的理论。 因果推断是理论构建的先决条件。 事件预测是因果推断的必要前提。 变量 测量是开展管理学实证研究的根基。 <strong>总之，这 4 个途径相辅相成，构成目的与手段的关系，「变量测量」 是  「事件预测、因果推断和理论构建」 的基础。</strong></p>
<p><br><br></p>
<h2 id="四机器学习在管理学研究中的应用">四、机器学习在管理学研究中的应用</h2>
<p>工商管理和会计财务作为管理学的两大核心研究领域，包含大量来自个人、企业和政府的文本、图像、音 频、视频等极具信息价值的非结构化数据。 传统方法无法对这些非结构化数据进行量化分析，只能进行定性分析。 借助机器学习方法，学者们可以从这些非结构化数据中挖掘、提取和构建诸如高管人格特质、管理者自恋、公司文化、媒体文章语调和投资者情绪等有意义的变量（洪永淼、汪寿阳，2021a，2021b），运用灵活的函数形式和降维技术来实现更精准的预测（洪永淼、汪寿阳，2021b，2021c），利用正则化和交叉验证方法提高模型泛化能力以帮助因果推断和理论构建（蒂德尔、艾森哈特，2020；蒂芬，2019；乔杜里等，2021；瓦里安，2014），从而更好地开展这两大领域中关键问题的实证研究。 因此，本部分以这两大研究领域为例，以机器学习赋能管理学研究的 4 种核心用途为主线，全面回顾和系统梳理 UTD-24 期刊和国内顶级管理学期刊于 1999~2021 年正式发表的文章。 具体来说，本文遵循重点性原则和典型性原则，按照这些领域和用途，总结归纳了代表性文献 的研究主题、研究问题、数据集、机器学习算法和研究结论 ⑧ 。</p>
<h3 id="41-工商管理">4.1 工商管理</h3>
<p><img loading="lazy" src="img/20231011-table-1.png" alt=""  />
</p>
<p><img loading="lazy" src="img/20231011-table-2.png" alt=""  />
</p>
<h3 id="42-会计学">4.2 会计学</h3>
<p><img loading="lazy" src="img/20231011-table-3.png" alt=""  />
</p>
<p><img loading="lazy" src="img/20231011-table-4.png" alt=""  />
</p>
<p><img loading="lazy" src="img/20231011-table-5.png" alt=""  />
</p>
<p><br><br></p>
<p>&hellip;&hellip;</p>
<h2 id="五结论与讨论">五、结论与讨论</h2>
<p>以弥补管理学研究传统上所存在的短板为目标，本研究采用 1999~2021 年发表在工商管理和会计财务两 大研究领域的国内外顶级期刊的学术文献，识别了学术界借助机器学习赋能管理学实证研究的核心途径；从 多个角度系统梳理了这些途径的代表性文献；详细阐述了机器学习赋能管理学研究的主要策略，并重点讨论 了中国学者运用机器学习开展中国特色管理理论研究的未来机会（主题方向、重要问题、实施策略和主要建议）。 本研究得出如表 6 所示的主要结论。 可以预见的是，在未来，变量测量、事件预测、因果推断、理论构建等 4 种核心途径的融合将日益紧密。 它们的融合为机器学习赋能管理学研究提供了更加具有深度和广度的未来机会。 例如，事件预测可以用来揭示数据中难以假设的复杂和未知关系，开发新的理论构念及其测量，或者按照预测的相对精准度比较竞争理论（克鲁帕、米努蒂-梅扎，2022），从而更好地进行理论构建。</p>
<p><img loading="lazy" src="img/20231011-table-6.png" alt=""  />
</p>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>管理科学学报 | 使用LDA算法计算政策扩散速度与扩散程度</title>
      <link>https://textdata.cn/blog/2023-10-10-measure-the-speed-of-policy-diffusion-from-top-to-down/</link>
      <pubDate>Tue, 10 Oct 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-10-10-measure-the-speed-of-policy-diffusion-from-top-to-down/</guid>
      <description>价值不断提升的政府网站内容数据不仅可以描绘政策注意力，也为中央政策向地方层级扩散的测量与评估提供了新的机遇.在我国多层级政府组织治理模式下，地方政府对中央政策的贯彻落地是政策生效的前提条件.对纵向政策扩散的有效测量和评估将有助于理解政策扩散机制，提升政策落地效果.本文基于全国省、市级政府门户网站每日内容更新数据，通过**概率主题建模方法建构主题概率矩阵，刻画政府对不同主题的注意力分配差异，并基于概率主题建模结果构建函数测量地方政府对中央政策的扩散速度与扩散程度。本文讨论了测度建构的原理和细节，并引入机器学习方法进行鲁棒性检验**，通过多政策主题扩散的混合回归分析了影响短周期政策层级扩散的因素.研究以测度建构为突破口打通文本数据挖掘到有价值公共管理知识的“中间层”,对政策信息学在政策扩散及评估监测中的应用前景进行了初步探索.</description>
      <content:encoded><![CDATA[<h2 id="一文献">一、文献</h2>
<p>张楠,黄梅银,罗亚,马宝君.<strong>全国政府网站内容数据中的知识发现：从注意力分配到政策层级扩散</strong>[J].<strong>管理科学学报</strong>,2023,26(05):154-173.</p>
<p>摘要:价值不断提升的政府网站内容数据不仅可以描绘政策注意力，也为中央政策向地方层级扩散的测量与评估提供了新的机遇.在我国多层级政府组织治理模式下，地方政府对中央政策的贯彻落地是政策生效的前提条件.对纵向政策扩散的有效测量和评估将有助于理解政策扩散机制，提升政策落地效果.本文基于全国省、市级政府门户网站每日内容更新数据，通过<strong>概率主题建模方法建构主题概率矩阵，刻画政府对不同主题的注意力分配差异，并基于概率主题建模结果构建函数测量地方政府对中央政策的扩散速度与扩散程度。本文讨论了测度建构的原理和细节，并引入机器学习方法进行鲁棒性检验</strong>，通过多政策主题扩散的混合回归分析了影响短周期政策层级扩散的因素.研究以测度建构为突破口打通文本数据挖掘到有价值公共管理知识的“中间层”,对政策信息学在政策扩散及评估监测中的应用前景进行了初步探索.</p>
<p><br><br></p>
<h2 id="二数据处理">二、数据处理</h2>
<h3 id="21-数据准备">2.1. 数据准备</h3>
<p>基于获取到的 170 万余条政府网站内容数据， 本文选择潜在狄利克雷分配模型 (LDA)进行数据分析，以获取网络政府的政策议题注意力分布情况，数据处理路径见图 １．</p>
<p>数据抓取单位为政府门户网站每一个页面的内容信息， 包括页 面 ＵＲＬ 地址、标题、发布时间、文章发布单位或转载来源、关键词、作者、摘要、具体内容等． 数据入库前，还通过元素提取（如网页名称、大小、日期、 标题、文字内容等）、数据排重和信息过滤（广告过滤、ＵＲＬ 过滤等）等前期处理工作．</p>
<p><img loading="lazy" src="img/20231010-%e5%9f%ba%e4%ba%8eLDA%e7%9a%84%e6%94%bf%e5%ba%9c%e6%b3%a8%e6%84%8f%e5%8a%9b%e5%88%86%e9%85%8d%e6%95%b0%e6%8d%ae%e5%a4%84%e7%90%86%e8%b7%af%e5%be%84.png" alt=""  />
</p>
<br>
<h3 id="22-lda建模">2.2. LDA建模</h3>
<p>LDA建模有两个步骤</p>
<ol>
<li>首先最关键的是确定<strong>文档主题数</strong>，即平均困惑度。论文中平均困惑度为120。</li>
<li>确定好<strong>文档主题数</strong>即可开展LDA训练， 对170w训练出LDA模型，同时得到<strong>文档-主题概率矩阵</strong>， 该矩阵的形状，有120列， 170多万行。<br>
<img loading="lazy" src="img/20231010-%e7%a1%ae%e5%ae%9aLDA%e5%b9%b3%e5%9d%87%e5%9b%b0%e6%83%91%e5%ba%a6%e5%80%bc.png" alt=""  />
</li>
</ol>
<p>训练完LDA模型，虽然文档主题数设置为120， 但经过甄别，最终确定有112个主题具有可解释性。
<img loading="lazy" src="img/20231010-LDA%e4%b8%bb%e9%a2%98%e5%88%97%e8%a1%a8.png" alt=""  />
</p>
<p>下图是主题含义，及概率占比(面积大小)。
<img loading="lazy" src="img/20231010-%e6%94%bf%e5%ba%9c%e7%bd%91%e7%ab%99%e5%af%b9%e4%b8%8d%e5%90%8c%e4%b8%bb%e9%a2%98%e7%9a%84%e6%b3%a8%e6%84%8f%e5%8a%9b%e5%88%86%e9%85%8d%e6%83%85%e5%86%b5.png" alt=""  />
</p>
<br>
<h2 id="23-扩散速度与扩散程度函数构建">2.3. 扩散速度与扩散程度函数构建</h2>
<p><img loading="lazy" src="img/20231010-%e6%94%bf%e7%ad%96%e6%bf%80%e5%8a%b1%e5%93%8d%e5%ba%94%e6%97%b6%e9%97%b4%e7%82%b9.png" alt=""  />
</p>
<p><br><img loading="lazy" src="img/20231010-%e6%94%bf%e7%ad%96%e6%89%a9%e6%95%a3%e9%80%9f%e5%ba%a6.png" alt=""  />

<br><img loading="lazy" src="img/20231010-%e5%93%8d%e5%ba%94%e6%94%bf%e7%ad%96%e7%9a%84%e6%8c%81%e7%bb%ad%e6%80%a7%e5%9b%9e%e5%ba%94%e7%a8%8b%e5%ba%a61.png" alt=""  />
</p>
<p><img loading="lazy" src="img/20231010-%e5%93%8d%e5%ba%94%e6%94%bf%e7%ad%96%e7%9a%84%e6%8c%81%e7%bb%ad%e6%80%a7%e5%9b%9e%e5%ba%94%e7%a8%8b%e5%ba%a62.png" alt=""  />
<br></p>
<p>面对中央政府希望通过政府网站和其他网络政府入口监测政策落实、督查政府履职、评估回应能力的一系列需求， 本文尝试基于网络政 府大数据的对中央政策扩散情况展开分析． 图４ 展示了 2018 年地级市对中央 １３ 项政策的回应扩散速度情况． <strong>曲线越扁平，地级市政府扩散响应 时间越短， 层级扩散速度越快</strong>． 平均扩散速度为 20.04 天，意味着中央出台政策后地级市政府网站 上平均 20 天就会对中央政策予以回应． 其中， 地 级市政府回应最快的是医疗卫生监管主题， 平均 扩散时间为 12.07  天，最慢的是土地使用权主题，达 25.11 天． 从 0.5 分位数来看，当不同政策主题的中央政策激励产生后， 超过一半的城市在 20 天内快速响应中央政策， 不同政策主题扩散速度存在差异．</p>
<p><img loading="lazy" src="img/20231010-%e6%94%bf%e5%ba%9c%e5%93%8d%e5%ba%94%e9%80%9f%e5%ba%a6.png" alt=""  />
</p>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>企业ESG行为的文本度量法</title>
      <link>https://textdata.cn/blog/2023-10-07-esg-measurement/</link>
      <pubDate>Sat, 07 Oct 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-10-07-esg-measurement/</guid>
      <description>本文一个关键的贡献是使用机器学习方法从文本数据中评估初创公司环境、社会和治理（ESG）属性</description>
      <content:encoded><![CDATA[<h2 id="文献">文献</h2>
<p>Mansouri S, Momtaz P P. Financing sustainable entrepreneurship: ESG measurement, valuation, and performance[J]. <em>Journal of Business Venturing</em>, 2022, 37(6):106258.</p>
<br>
<h2 id="摘要">摘要</h2>
<p>可持续发展导向对初创企业的初始估值有积极影响，但对其融资后财务业绩有负面影响。 在其他条件相同的情况下，将可持续发展方向提高一个标准差将使初创公司的融资金额增加 28%，并将投资者每个融资后年度的异常回报减少 16%。 结果适用于基于区块链的众筹活动（也称为首次代币发行（ICO）或代币发行）的大量样本。<strong>本文一个关键的贡献是使用机器学习方法从文本数据中评估初创公司环境、社会和治理（ESG）属性</strong></p>
<br>
<br>
<h2 id="量化初创企业的esg属性">量化初创企业的ESG属性</h2>
<p>现有研究对如何衡量初创企业的ESG属性还未形成统一框架，且存在以下两个问题：（1）现有的ESG指标主要由几个数据供应商提供，而供应商之间的相关性非常低；（2）现有的ESG评级不适用于初创企业，即存在数据缺失。因此，本文采用一种机器学习的方法，量化初创企业的ESG属性：</p>
<ol>
<li>
<p><strong>文本预处理</strong>： 获取数据及预处理从公司网站等收集ICO白皮书后，使用斯坦福大学开发的CoreNLP管道生成句子的依赖性表示，并识别一些搭配词；</p>
</li>
<li>
<p><strong>建立种子词</strong>：收集《金融时报》中所有带有“ESG投资、道德金钱”标签的文章，采用标准的词袋模型提炼出现频率最高的二元组、三元组词汇，然后对这些词汇进行人工筛查，并在此基础上手动添加一些与代币发行有关的词汇，得到三个维度的种子词数为：70、38、46；</p>
</li>
<li>
<p><strong>选取联想词</strong>：使用Word2vec模型扩充种子词，为ESG的每个维度挑选500个最为相近的术语，经再次筛查后，得到三个维度的词典数量为：508、463、524；</p>
</li>
<li>
<p><strong>计算ESG分数</strong></p>
<p><img loading="lazy" src="formular.png" alt=""  />
</p>
<p>在（1）式中，代表白皮书i中术语的计数，c(n)是相应的单词列表的大小，即用频率来表征企业在某一维度的得分，然后将三个维度的得分加总得到最终的ESG分数；</p>
</li>
</ol>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>中国管理科学 | 使用业绩说明会文本数据测量上市公司前瞻性信息</title>
      <link>https://textdata.cn/blog/2023-09-08-earnings-communication-conference-forward-looking-statements-information/</link>
      <pubDate>Fri, 08 Sep 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-09-08-earnings-communication-conference-forward-looking-statements-information/</guid>
      <description>业绩说明会， 是我国上市公司和中小投资 者沟通交流的重要载体。 在年报披露后， 能够 帮助投资者快速、准确地抓取信息披露重点， 全面了解企业发展状况， 增进对企业价值及经 营理念的认同。上市公司的业绩说明会是金融领域中的重要事件，它为投资者、分析师和其他利益相关者提供了一个与公司管理层直接交流的平台。这种数据集的学术价值多方面体现。</description>
      <content:encoded><![CDATA[<p>最近几个月没怎么分享长技术文，正好昨天分享的付费数据集涉及到一篇论文，感觉用到了很多Python的地方，就想着做一期。这篇论文的Python实现，技术要点有两个部分</p>
<ol>
<li><strong>「构建词典」</strong>； 训练word2vec预训练语言模型，并使用该模型扩展出<strong>前瞻性词典集</strong></li>
<li><strong>「算前瞻性指标</strong>」； 根据<strong>前瞻性词典集</strong>,统计每个企业业绩说明会内的前瞻性词在总词数中的比例</li>
</ol>
<p>这两部分，分别对应本文 <strong>「二、实验-构建词典」</strong>、<strong>「三、计算前瞻性」</strong>。</p>
<p><strong>内容较长， 可能对初学小白不友好。 学完大邓课程「<a href="https://textdata.cn/blog/management_python_course/">Python实证指标构建与文本分析</a>」的同学，阅读起来会轻松一些</strong>。</p>
<p><br><br></p>
<p><strong>许帅,邵帅,何贤杰.业绩说明会前瞻性信息对分析师盈余预测准确性的影响——信口雌黄还是言而有征[J].中国管理科学:1-15.</strong></p>
<blockquote>
<p>摘要:本文以2007—2020年上市公司业绩说明会为背景，研究前瞻性信息披露对分析师预测的影响，发现业绩说明会中的前瞻性信息可以显著提升分析师盈余预测准确性。公司的信息不对称程度越高，前瞻性信息对分析师预测准确性提升越多。分析师专长工作经验越丰富，具备更强的信息捕捉能力，可以更好地吸收与理解业绩说明会中的前瞻性信息，做出更准确的预测。进一步，本文对前瞻性信息影响分析师预测的路径进行了讨论，认为前瞻性信息可能通过吸引分析师和机构投资者调研，增进分析师对上市公司经营状况的了解，进而提升盈余预测准确性。此外，本文发现，前瞻性信息中业绩相关类信息因具有更高的可信度，且与盈余因子直接相关，能够显著提升分析师盈余预测准确性。本研究为管理层披露与分析师的互动研究提供了增量证据，研究结果支持了业绩说明会有效性，对未来监管部门制定相关信息披露政策提供依据和建议。</p>
</blockquote>
<p><a href="https://ir.p5w.net/roadshow/"><img loading="lazy" src="img/p5w.png" alt=""  />
</a></p>
<p><br><br></p>
<h2 id="一前瞻性指标衡量">一、前瞻性指标衡量</h2>
<p>本文关注业绩说明会中前瞻性信息披露的比重， 借鉴Li [5] 、Muslu等 [6] 和马黎珺等 [14] 对前瞻性信息的定义， 采用 “ 词袋法 ” 构建前瞻性 指标， <strong>运用Python软件中jieba中文分词技术统计在问答阶段前瞻性词汇词频占业绩说明会文本总词频（去除停用词）的比例</strong>。 同时， 手工剔除了诸如“请关注后续公告”、“详见以后公告”等不具备实质性前瞻性信息的词频。</p>
<p>在词典的选取上， 本文前瞻性词典集借鉴胡楠和薛付婧 [15] 的种子词汇， 为了保证词汇的全面性， 还将所有种子词导入到开源分析工具 word2vec中， 并在业绩说明会语料库中寻找与种子词内容接近程度最高的词汇，其中包含（1） 管理团队的预测，譬如“计划/预计/预测”等表 述；（2）出现未来时点的表述， 譬如“未来/以 后/明年/下半年”等表述；（3）暗示企业即将发 生的动作， 譬如“有望/后续”等表述， 共计 174个前瞻性词汇（详见附录2）。前瞻性指标比重越大，表明公司的前瞻性信息披露越多。</p>
<p><img loading="lazy" src="img/formular.png" alt=""  />
</p>
<p><img loading="lazy" src="img/dict1.png" alt=""  />
</p>
<br>
<h2 id="二实验-构建词典">二、实验-构建词典</h2>
<h3 id="21-整理数据">2.1 整理数据</h3>
<p>把 <a href="https://textdata.cn/blog/2023-09-08-china-a-share-market-listed-company-earnings-communication-conference/"><strong>数据集 | 84w条业绩说明会问答数据(2005-2023)</strong></a>汇总到一个txt文件内。为了保证问答上下文一致， 问答要放在相邻处。 可能需要安装</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">pip3 install openpyxl
pip3 install pandas
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_excel</span><span class="p">(</span><span class="s1">&#39;业绩说明会问答05-23.xlsx&#39;</span><span class="p">)</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;text&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;提问内容&#39;</span><span class="p">]</span><span class="o">+</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;回答内容&#39;</span><span class="p">]</span>
<span class="n">df</span><span class="o">.</span><span class="n">dropna</span><span class="p">(</span><span class="n">subset</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;text&#39;</span><span class="p">],</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/df1.png" alt=""  />
</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s1">&#39;业绩说明会05-23.txt&#39;</span><span class="p">,</span> <span class="s1">&#39;a+&#39;</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
    <span class="c1">#为了保证问答上下文一致， 问答要放在相邻处</span>
    <span class="n">text</span> <span class="o">=</span> <span class="s1">&#39;&#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;text&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">to_list</span><span class="p">())</span>
    <span class="n">f</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>
</code></pre></div><br>
<h3 id="22-训练word2vec">2.2 训练word2vec</h3>
<p>一般都是使用gensim库，对  <strong>「业绩说明会05-23.txt」</strong> 数据集进行训练，我已经封装到的cntext库内。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">pip3 install cntext==1.8.6
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>
<span class="kn">import</span> <span class="nn">os</span>


<span class="c1">#Init W2VModels. Support English and Chinese</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">W2VModels</span><span class="p">(</span><span class="n">cwd</span><span class="o">=</span><span class="n">os</span><span class="o">.</span><span class="n">getcwd</span><span class="p">(),</span> 
                     <span class="n">lang</span><span class="o">=</span><span class="s1">&#39;chinese&#39;</span><span class="p">)</span>  <span class="c1">#corpus data w2v_corpus.txt</span>


<span class="c1">#训练结束后，「业绩说明会05-23.100.6.bin」会出现在「output/Word2Vec」文件夹内 </span>
<span class="n">model</span><span class="o">.</span><span class="n">train</span><span class="p">(</span><span class="n">input_txt_file</span><span class="o">=</span><span class="s1">&#39;业绩说明会05-23.txt&#39;</span><span class="p">,</span> 
            <span class="n">model_name</span><span class="o">=</span><span class="s1">&#39;业绩说明会05-23.100.6.bin&#39;</span><span class="p">)</span>
</code></pre></div><br>
<p>需要注意， output/word2vec文件夹内会同时含有</p>
<ul>
<li><strong>业绩说明会05-23.100.6.bin</strong></li>
<li><strong>业绩说明会05-23.100.6.bin.vectors.npy</strong></li>
</ul>
<p>两个文件都不要删除， 这些是预训练词向量文件。</p>
<br>
<h3 id="23-扩展词典">2.3 扩展词典</h3>
<p>根据前瞻性研究需要，整理了一些种子词</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">seedwords</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;计划&#39;</span><span class="p">,</span> <span class="s1">&#39;预计&#39;</span><span class="p">,</span> <span class="s1">&#39;未来&#39;</span><span class="p">,</span> <span class="s1">&#39;目标&#39;</span><span class="p">,</span> <span class="s1">&#39;可能&#39;</span><span class="p">,</span> <span class="s1">&#39;如果&#39;</span><span class="p">,</span> <span class="s1">&#39;机遇&#39;</span><span class="p">,</span> <span class="s1">&#39;预期&#39;</span><span class="p">,</span> <span class="s1">&#39;挑战&#39;</span><span class="p">,</span> <span class="s1">&#39;预测&#39;</span><span class="p">,</span> <span class="s1">&#39;今后&#39;</span><span class="p">,</span> <span class="s1">&#39;目的&#39;</span><span class="p">,</span> <span class="s1">&#39;契机&#39;</span><span class="p">,</span> <span class="s1">&#39;前景&#39;</span><span class="p">,</span> <span class="s1">&#39;希望&#39;</span><span class="p">,</span> <span class="s1">&#39;展望&#39;</span><span class="p">,</span> <span class="s1">&#39;相信&#39;</span><span class="p">,</span> <span class="s1">&#39;愿景&#39;</span><span class="p">,</span> <span class="s1">&#39;期待&#39;</span><span class="p">,</span> <span class="s1">&#39;明年&#39;</span><span class="p">,</span> <span class="s1">&#39;期望&#39;</span><span class="p">]</span>
</code></pre></div><ol>
<li>导入word2vec预训练语言模型文件 <strong>业绩说明会05-23.100.6.bin</strong></li>
<li>寻找与种子词语义最相似的n个词。</li>
<li>经过人工检查，剔除n个词中与 <strong>前瞻性</strong> 无关的词语，最终得到 <strong>前瞻性词典</strong>(论文中是174个词)。</li>
</ol>
<p>但是，经过大邓测试发现业绩说明会训练得到的业绩说明会word2vec(05-23.100.6.bin)模型表现很差。</p>
<p>之前大邓用01-21年管理层讨论与分析训练过一个word2vec(<strong>mda01-21.200.6.bin</strong>)，</p>
<blockquote>
<p><a href="https://textdata.cn/blog/2023-03-24-load-w2v-and-expand-your-concpet-dicitonary/">预训练模型 | 金融会计类word2vec， 可扩展或构建领域内概念情感词典</a></p>
</blockquote>
<p>在这次前瞻性扩展词任务中，mda01-21.200.6.bin表现要远好于05-23.100.6.bin。</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">gensim.models</span> <span class="kn">import</span> <span class="n">KeyedVectors</span>

<span class="k">def</span> <span class="nf">load_w2v</span><span class="p">(</span><span class="n">w2v_path</span><span class="p">):</span>
    <span class="s2">&#34;&#34;&#34;
</span><span class="s2">    Load word2vec model
</span><span class="s2">
</span><span class="s2">    Args:
</span><span class="s2">        w2v_path (str): path of word2vec model
</span><span class="s2">
</span><span class="s2">    Returns:
</span><span class="s2">        model: word2vec model
</span><span class="s2">    &#34;&#34;&#34;</span>
    <span class="nb">print</span><span class="p">(</span><span class="s1">&#39;Loading word2vec model...&#39;</span><span class="p">)</span>
    <span class="n">model</span> <span class="o">=</span> <span class="n">KeyedVectors</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="n">w2v_path</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">model</span>


<span class="n">wv</span> <span class="o">=</span> <span class="n">load_w2v</span><span class="p">(</span><span class="s1">&#39;Embeddings/业绩说明会05-23.100.6.bin&#39;</span><span class="p">)</span>
<span class="n">wv2</span> <span class="o">=</span> <span class="n">load_w2v</span><span class="p">(</span><span class="s1">&#39;Embeddings/mda01-21.200.6.bin&#39;</span><span class="p">)</span>
</code></pre></div><pre><code>Loading word2vec model...
Loading word2vec model...
</code></pre>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#词汇量</span>
<span class="nb">print</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">wv</span><span class="o">.</span><span class="n">index_to_key</span><span class="p">))</span>
<span class="nb">print</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">wv2</span><span class="o">.</span><span class="n">index_to_key</span><span class="p">))</span>
</code></pre></div><pre><code>198776
789539
</code></pre>
<p>​ <br>
<br></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#查询某词的词向量</span>
<span class="n">wv</span><span class="o">.</span><span class="n">get_vector</span><span class="p">(</span><span class="s1">&#39;创新&#39;</span><span class="p">)</span>

<span class="c1">#查询多个词的词向量</span>
<span class="c1">#wv.get_mean_vector([&#39;创新&#39;, &#39;研发&#39;])</span>
</code></pre></div><pre><code>array([ 0.43675017,  0.74739504,  3.3765798 , -0.29287583,  0.40125442,
    0.9364979 ,  0.62465197,  0.06480039,  0.12256158, -2.0735328 ,
   -0.256066  , -1.7680115 , -0.8514873 , -0.756108  ,  1.3441261 ,
   -0.18098126,  2.7290103 , -4.6596766 ,  0.4046495 , -4.0644083 ,
    0.6022293 ,  1.3569978 ,  1.0036035 ,  0.06123297, -2.0733726 ,
    2.2704456 , -1.2935334 , -0.2855776 ,  1.588003  ,  1.5027634 ,
    2.0897112 , -0.8861778 ,  0.4014722 , -0.41474393, -1.5390201 ,
    0.23899865, -0.9823706 , -2.986944  , -2.6887195 , -2.2386284 ,
    0.04810223,  1.3241886 , -0.71262985, -0.8015585 ,  1.5249555 ,
   -3.611584  , -1.4187033 , -1.6014036 ,  0.816903  ,  3.1821172 ,
   -1.7302881 , -0.8280679 , -1.2833163 ,  0.65565586, -0.8857021 ,
    2.098562  ,  1.4773984 ,  1.0931807 , -0.02242889,  1.1279039 ,
   -2.2318523 ,  0.24540211,  0.17126203,  2.5631666 , -1.7135285 ,
    0.60896975, -0.2654438 ,  0.5718087 , -1.4996717 ,  1.0189433 ,
    1.0205768 ,  3.7439635 , -0.3575424 , -3.189775  ,  0.6117708 ,
   -0.60615975,  2.940066  , -0.89338064, -0.626806  , -1.4389508 ,
   -1.1291629 , -2.2354846 , -0.6873424 ,  1.9574465 , -1.2231802 ,
    1.2850708 , -0.7581777 ,  0.8184319 ,  1.542834  , -0.8685869 ,
    1.1841776 , -0.4524089 , -0.8068617 ,  0.01519055, -0.23408687,
   -0.51564324,  0.20584114,  0.14295417,  0.5481142 ,  2.523313  ],
  dtype=float32)
</code></pre>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">expand_dictionary</span><span class="p">(</span><span class="n">wv</span><span class="p">,</span> <span class="n">seedwords</span><span class="p">,</span> <span class="n">topn</span><span class="o">=</span><span class="mi">100</span><span class="p">):</span>
    <span class="s2">&#34;&#34;&#34;
</span><span class="s2">    According to the seed word file, select the top n words with the most similar semantics and save them in the directory save_dir.
</span><span class="s2">    
</span><span class="s2">    Args:
</span><span class="s2">        wv (Word2VecKeyedVectors): the word embedding model
</span><span class="s2">        seedwords (list): 种子词
</span><span class="s2">        topn (int, optional): Set the number of most similar words to retrieve to topn. Defaults to 100.
</span><span class="s2">        save_dir (str, optional): the directory to save the candidate words. Defaults to &#39;Word2Vec&#39;.
</span><span class="s2">    
</span><span class="s2">    Returns:
</span><span class="s2">    &#34;&#34;&#34;</span>
    <span class="n">simidx_scores</span> <span class="o">=</span> <span class="p">[]</span>

    <span class="n">similars_candidate_idxs</span> <span class="o">=</span> <span class="p">[]</span> <span class="c1">#the candidate words of seedwords</span>
    <span class="n">dictionary</span> <span class="o">=</span> <span class="n">wv</span><span class="o">.</span><span class="n">key_to_index</span>
    <span class="n">seedidxs</span> <span class="o">=</span> <span class="p">[]</span> <span class="c1">#transform word to index</span>
    <span class="k">for</span> <span class="n">seed</span> <span class="ow">in</span> <span class="n">seedwords</span><span class="p">:</span>
        <span class="k">if</span> <span class="n">seed</span> <span class="ow">in</span> <span class="n">dictionary</span><span class="p">:</span>
            <span class="n">seedidx</span> <span class="o">=</span> <span class="n">dictionary</span><span class="p">[</span><span class="n">seed</span><span class="p">]</span>
            <span class="n">seedidxs</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">seedidx</span><span class="p">)</span>
    <span class="k">for</span> <span class="n">seedidx</span> <span class="ow">in</span> <span class="n">seedidxs</span><span class="p">:</span>
        <span class="c1"># sims_words such as [(&#39;by&#39;, 0.99984), (&#39;or&#39;, 0.99982), (&#39;an&#39;, 0.99981), (&#39;up&#39;, 0.99980)]</span>
        <span class="n">sims_words</span> <span class="o">=</span> <span class="n">wv</span><span class="o">.</span><span class="n">similar_by_word</span><span class="p">(</span><span class="n">seedidx</span><span class="p">,</span> <span class="n">topn</span><span class="o">=</span><span class="n">topn</span><span class="p">)</span>
        <span class="c1">#Convert words to index and store them</span>
        <span class="n">similars_candidate_idxs</span><span class="o">.</span><span class="n">extend</span><span class="p">([</span><span class="n">dictionary</span><span class="p">[</span><span class="n">sim</span><span class="p">[</span><span class="mi">0</span><span class="p">]]</span> <span class="k">for</span> <span class="n">sim</span> <span class="ow">in</span> <span class="n">sims_words</span><span class="p">])</span>
    <span class="n">similars_candidate_idxs</span> <span class="o">=</span> <span class="nb">set</span><span class="p">(</span><span class="n">similars_candidate_idxs</span><span class="p">)</span>
    
    <span class="k">for</span> <span class="n">idx</span> <span class="ow">in</span> <span class="n">similars_candidate_idxs</span><span class="p">:</span>
        <span class="n">score</span> <span class="o">=</span> <span class="n">wv</span><span class="o">.</span><span class="n">n_similarity</span><span class="p">([</span><span class="n">idx</span><span class="p">],</span> <span class="n">seedidxs</span><span class="p">)</span>
        <span class="n">simidx_scores</span><span class="o">.</span><span class="n">append</span><span class="p">((</span><span class="n">idx</span><span class="p">,</span> <span class="n">score</span><span class="p">))</span>
    <span class="n">simidxs</span> <span class="o">=</span> <span class="p">[</span><span class="n">w</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">simidx_scores</span><span class="p">,</span> <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">k</span><span class="p">:</span><span class="n">k</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">reverse</span><span class="o">=</span><span class="kc">True</span><span class="p">)]</span>

    <span class="n">simwords</span> <span class="o">=</span> <span class="p">[</span><span class="nb">str</span><span class="p">(</span><span class="n">wv</span><span class="o">.</span><span class="n">index_to_key</span><span class="p">[</span><span class="n">idx</span><span class="p">])</span> <span class="k">for</span> <span class="n">idx</span> <span class="ow">in</span> <span class="n">simidxs</span><span class="p">][:</span><span class="n">topn</span><span class="p">]</span>

    <span class="n">resultwords</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="n">resultwords</span><span class="o">.</span><span class="n">extend</span><span class="p">(</span><span class="n">seedwords</span><span class="p">)</span>
    <span class="n">resultwords</span><span class="o">.</span><span class="n">extend</span><span class="p">(</span><span class="n">simwords</span><span class="p">)</span>
    
    <span class="k">return</span> <span class="n">resultwords</span>


<span class="c1">#为了节省板面，这里设置为50</span>
<span class="c1">#论文中经过筛选留下174个词，实际上topn应该远大于174，</span>
<span class="n">expand_dictionary</span><span class="p">(</span><span class="n">wv</span><span class="o">=</span><span class="n">wv</span><span class="p">,</span> 
                  <span class="c1">#前瞻性种子词</span>
                  <span class="n">seedwords</span><span class="o">=</span> <span class="p">[</span><span class="s1">&#39;计划&#39;</span><span class="p">,</span> <span class="s1">&#39;预计&#39;</span><span class="p">,</span> <span class="s1">&#39;未来&#39;</span><span class="p">,</span> <span class="s1">&#39;目标&#39;</span><span class="p">,</span> <span class="s1">&#39;可能&#39;</span><span class="p">,</span> <span class="s1">&#39;如果&#39;</span><span class="p">,</span> <span class="s1">&#39;机遇&#39;</span><span class="p">,</span> <span class="s1">&#39;预期&#39;</span><span class="p">,</span> <span class="s1">&#39;挑战&#39;</span><span class="p">,</span> <span class="s1">&#39;预测&#39;</span><span class="p">,</span> <span class="s1">&#39;今后&#39;</span><span class="p">,</span> <span class="s1">&#39;目的&#39;</span><span class="p">,</span> <span class="s1">&#39;契机&#39;</span><span class="p">,</span> <span class="s1">&#39;前景&#39;</span><span class="p">,</span> <span class="s1">&#39;希望&#39;</span><span class="p">,</span> <span class="s1">&#39;展望&#39;</span><span class="p">,</span> <span class="s1">&#39;相信&#39;</span><span class="p">,</span> <span class="s1">&#39;愿景&#39;</span><span class="p">,</span> <span class="s1">&#39;期待&#39;</span><span class="p">,</span> <span class="s1">&#39;明年&#39;</span><span class="p">,</span> <span class="s1">&#39;期望&#39;</span><span class="p">],</span>
                  <span class="n">topn</span><span class="o">=</span><span class="mi">50</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[&#39;计划&#39;,
 &#39;预计&#39;,
 ......
 &#39;几年&#39;,
 &#39;积极影响&#39;,
 &#39;有何&#39;,
 &#39;谢谢您提问&#39;,
 &#39;今后&#39;,
 &#39;这块&#39;,
 &#39;近几年&#39;,
 &#39;近两年&#39;,
 &#39;请问李&#39;,
 &#39;裁员&#39;,
 &#39;亮点&#39;,
 &#39;准备采取&#39;,
 &#39;将会&#39;,
 &#39;接下来&#39;,
 &#39;有何规划&#39;,
 &#39;前景&#39;,
 &#39;管理层是否&#39;,
 &#39;未来几年&#39;,
 &#39;有没有新&#39;,
 &#39;发展状况&#39;,
 &#39;一块&#39;,
 &#39;当前&#39;,
 &#39;很大&#39;,
 &#39;这块业务&#39;,
 &#39;LNG船&#39;,
 &#39;具体措施您好&#39;,
 &#39;当下&#39;,
 &#39;是否能够&#39;,
 &#39;明后&#39;,
 &#39;一个台阶&#39;,
 &#39;是否符合&#39;,
 &#39;巨大&#39;,
 &#39;预判&#39;,
 &#39;对此&#39;,
 &#39;未来三年&#39;,
 &#39;资本开支&#39;,
 &#39;不少&#39;,
 &#39;未来是否&#39;,
 &#39;这方面&#39;,
 &#39;看法&#39;,
 &#39;今年以来&#39;,
 &#39;疫情结束&#39;,
 &#39;想知道&#39;,
 &#39;取得不错&#39;,
 &#39;谈谈&#39;,
 &#39;一步&#39;,
 &#39;今年是否&#39;,
 &#39;发展前景&#39;,
 &#39;东宝&#39;,
 &#39;现状&#39;]
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">expand_dictionary</span><span class="p">(</span><span class="n">wv</span><span class="o">=</span><span class="n">wv2</span><span class="p">,</span> 
                  <span class="c1">#前瞻性种子词</span>
                  <span class="n">seedwords</span><span class="o">=</span> <span class="p">[</span><span class="s1">&#39;计划&#39;</span><span class="p">,</span> <span class="s1">&#39;预计&#39;</span><span class="p">,</span> <span class="s1">&#39;未来&#39;</span><span class="p">,</span> <span class="s1">&#39;目标&#39;</span><span class="p">,</span> <span class="s1">&#39;可能&#39;</span><span class="p">,</span> <span class="s1">&#39;如果&#39;</span><span class="p">,</span> <span class="s1">&#39;机遇&#39;</span><span class="p">,</span> <span class="s1">&#39;预期&#39;</span><span class="p">,</span> <span class="s1">&#39;挑战&#39;</span><span class="p">,</span> <span class="s1">&#39;预测&#39;</span><span class="p">,</span> <span class="s1">&#39;今后&#39;</span><span class="p">,</span> <span class="s1">&#39;目的&#39;</span><span class="p">,</span> <span class="s1">&#39;契机&#39;</span><span class="p">,</span> <span class="s1">&#39;前景&#39;</span><span class="p">,</span> <span class="s1">&#39;希望&#39;</span><span class="p">,</span> <span class="s1">&#39;展望&#39;</span><span class="p">,</span> <span class="s1">&#39;相信&#39;</span><span class="p">,</span> <span class="s1">&#39;愿景&#39;</span><span class="p">,</span> <span class="s1">&#39;期待&#39;</span><span class="p">,</span> <span class="s1">&#39;明年&#39;</span><span class="p">,</span> <span class="s1">&#39;期望&#39;</span><span class="p">],</span>
                  <span class="n">topn</span><span class="o">=</span><span class="mi">50</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[&#39;计划&#39;,
 &#39;预计&#39;,
  ......
 &#39;相信&#39;,
 &#39;将会&#39;,
 &#39;未来&#39;,
 &#39;希望&#39;,
 &#39;预见&#39;,
 &#39;预期&#39;,
 &#39;可能&#39;,
 &#39;必将&#39;,
 &#39;应该&#39;,
 &#39;未来几年&#39;,
 &#39;今后&#39;,
 &#39;有望&#39;,
 &#39;目标&#39;,
 &#39;这一&#39;,
 &#39;当前&#39;,
 &#39;当下&#39;,
 &#39;无疑&#39;,
 &#39;期望&#39;,
 &#39;接下来&#39;,
 &#39;意味着&#39;,
 &#39;背景&#39;,
 &#39;期待&#39;,
 &#39;近期&#39;,
 &#39;下一阶段&#39;,
 &#39;机会&#39;,
 &#39;看到&#39;,
 &#39;预示&#39;,
 &#39;能够&#39;,
 &#39;短期内&#39;,
 &#39;未来一段时间&#39;,
 &#39;将来&#39;,
 &#39;展望未来&#39;,
 &#39;必须&#39;,
 &#39;真正&#39;,
 &#39;眼光&#39;,
 &#39;必然&#39;,
 &#39;还会&#39;,
 &#39;预计&#39;,
 &#39;未来十年&#39;,
 &#39;机遇&#39;,
 &#39;可能性&#39;,
 &#39;后续&#39;,
 &#39;潜在&#39;,
 &#39;决心&#39;,
 &#39;信心&#39;,
 &#39;仍然&#39;,
 &#39;非常&#39;,
 &#39;这为&#39;,
 &#39;未来五年&#39;,
 &#39;短时间&#39;]
</code></pre></div><p><strong>大邓假装经过很多检查，剔除不相关词语，最终跟论文一样，都得到了174个前瞻性词语。  需要说明，大邓已经将174个词内置到了cntext库(1.8.6中)</strong></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">import cntext as ct

#cntext已内置了论文中的174个前瞻性词集
fls_words = ct.load_pkl_dict(&#39;Chinese_FLS.pkl&#39;)[&#39;Chinese_FLS&#39;]
fls_words
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">[&#39;计划&#39;,
 &#39;预计&#39;,
 &#39;未来&#39;,
 &#39;目标&#39;,
 ......
 &#39;企业宗旨&#39;,
 &#39;宗旨&#39;,
 &#39;该愿景&#39;,
 &#39;愿望&#39;,
 &#39;心愿&#39;,
 &#39;盼望&#39;,
 &#39;祝愿&#39;,
 &#39;今年年底&#39;,
 &#39;今年底&#39;,
 &#39;明年初&#39;,
 &#39;第二季度&#39;,
 &#39;上半年&#39;,
 &#39;下半年&#39;,
 &#39;本月底&#39;,
 &#39;下周&#39;,
 &#39;马上&#39;,
 &#39;厚望&#39;,
 &#39;期盼&#39;,
 &#39;鞭策&#39;,
 &#39;梦想&#39;,
 &#39;愿&#39;]
</code></pre></div><p><br><br></p>
<h2 id="三计算前瞻性">三、计算前瞻性</h2>
<ol>
<li>汇总记录； 将同一年同一家上市公司的所有问答合并为一条记录，存储于df2中。</li>
<li>设计前瞻性计算函数 <strong>compute_fls</strong></li>
<li>对df2[&lsquo;text&rsquo;]使用前瞻性计算函数<strong>compute_fls</strong>，计算结果保存到字段<strong>Forward</strong></li>
</ol>
<h3 id="31-汇总记录">3.1 汇总记录</h3>
<p>将同一年同一家上市公司的所有问答合并为一条记录，存储于新df的text中。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df2</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s1">&#39;股票代码&#39;</span><span class="p">,</span> <span class="s1">&#39;会计年度&#39;</span><span class="p">])[</span><span class="s1">&#39;text&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="o">.</span><span class="n">join</span><span class="p">)</span><span class="o">.</span><span class="n">reset_index</span><span class="p">()</span>
<span class="n">df2</span><span class="o">.</span><span class="n">dropna</span><span class="p">(</span><span class="n">subset</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;text&#39;</span><span class="p">],</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>

<span class="n">df2</span>
</code></pre></div><p><img loading="lazy" src="img/df2.png" alt=""  />
</p>
<br>
<h3 id="32--设计前瞻性计算函数compute_fls">3.2  设计前瞻性计算函数compute_fls</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>
<span class="kn">import</span> <span class="nn">jieba</span>


<span class="n">fls_words</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">load_pkl_dict</span><span class="p">(</span><span class="s1">&#39;Chinese_FLS.pkl&#39;</span><span class="p">)[</span><span class="s1">&#39;Chinese_FLS&#39;</span><span class="p">]</span>
<span class="n">stopwords</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">load_pkl_dict</span><span class="p">(</span><span class="s1">&#39;STOPWORDS.pkl&#39;</span><span class="p">)[</span><span class="s1">&#39;STOPWORDS&#39;</span><span class="p">][</span><span class="s1">&#39;chinese&#39;</span><span class="p">]</span>

<span class="k">def</span> <span class="nf">compute_fls</span><span class="p">(</span><span class="n">text</span><span class="p">):</span>
    <span class="n">num</span> <span class="o">=</span> <span class="mi">0</span>
    <span class="n">words</span> <span class="o">=</span> <span class="n">jieba</span><span class="o">.</span><span class="n">lcut</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>
    <span class="n">words</span> <span class="o">=</span> <span class="p">[</span><span class="n">w</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">words</span> <span class="k">if</span> <span class="n">w</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">stopwords</span><span class="p">]</span>
    <span class="k">for</span> <span class="n">word</span> <span class="ow">in</span> <span class="n">words</span><span class="p">:</span>
        <span class="k">if</span> <span class="n">word</span> <span class="ow">in</span> <span class="n">fls_words</span><span class="p">:</span>
            <span class="n">num</span><span class="o">+=</span><span class="mi">1</span>
    <span class="c1">#+1是为了防止分母为0的情况</span>
    <span class="k">return</span> <span class="n">num</span><span class="o">/</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">words</span><span class="p">)</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span>
</code></pre></div><br>
<h3 id="33-批量计算df2text">3.3 批量计算df2[&lsquo;text&rsquo;]</h3>
<p>对df2[&lsquo;text&rsquo;]批量使用前瞻性计算函数<strong>compute_fls</strong>，计算结果保存到字段<strong>Forward</strong></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">df2[&#39;Forward&#39;] = df2[&#39;text&#39;].apply(compute_fls)

df2.head()
</code></pre></div><p><img loading="lazy" src="img/df3.png" alt=""  />
</p>
<br>
<p>下图是论文中的Forward描述性统计，</p>
<p><img loading="lazy" src="img/stats.png" alt=""  />
</p>
<p>我们试着看看分析结果 <strong>df2[&lsquo;Forward&rsquo;]</strong> 的</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="nb">print</span><span class="p">(</span><span class="s1">&#39;Forward最小值: &#39;</span><span class="p">,</span> <span class="n">df2</span><span class="p">[</span><span class="s1">&#39;Forward&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">min</span><span class="p">())</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;Forward中位数: &#39;</span><span class="p">,</span> <span class="n">df2</span><span class="p">[</span><span class="s1">&#39;Forward&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">median</span><span class="p">())</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;Forward均值: &#39;</span><span class="p">,</span> <span class="n">df2</span><span class="p">[</span><span class="s1">&#39;Forward&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">mean</span><span class="p">())</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;Forward最大值: &#39;</span><span class="p">,</span> <span class="n">df2</span><span class="p">[</span><span class="s1">&#39;Forward&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">())</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;Forward标准层: &#39;</span><span class="p">,</span> <span class="n">df2</span><span class="p">[</span><span class="s1">&#39;Forward&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">std</span><span class="p">())</span>
</code></pre></div><p>可以发现描述性统计信息与论文的存在较大差异，可能的原因包括但不限于</p>
<pre><code>  1. 数据集存在差异。**论文中选取2007-2020年中小板和创业板上市公司作为研究对象。**而本实验使用的A股2005年-2023年所有的上市公司作为实验对象。
  2. 可能筛选记录，文本太短的会议剔除。
  3. 使用的停用词表不同
  4. jieba导入自定义词典
</code></pre>
<p><br><br></p>
<h2 id="精选内容">精选内容</h2>
<ul>
<li><a href="https://textdata.cn/blog/2025-02-14-using-online-large-model-api-to-transform-text-data-into-structured-data/"><strong>教程 | 使用大模型将文本数据转化为结构化数据</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-06-14-how-to-download-large-language-model-with-ollama/"><strong>教程 | 如何使用 Ollama 下载 &amp; 使用本地大语言模型</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-08-07-structured-outputs-with-ollama/"><strong>实验 | 如何使 Ollama 结构化输出 JSON 样式的结果</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-04-27-cntext2x-usage-tutorial/"><strong>推荐 | 文本分析库 cntext 使用手册</strong></a></li>
<li><a href="https://textdata.cn/blog/2024-06-14-using-large-language-model-to-extract-structure-data-from-raw-text/"><strong>实验 | 使用本地大模型从文本中提取结构化信息</strong></a></li>
</ul>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>数据集 | 84w条业绩说明会问答数据(2005-2023)</title>
      <link>https://textdata.cn/blog/2023-09-08-china-a-share-market-listed-company-earnings-communication-conference/</link>
      <pubDate>Fri, 08 Sep 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-09-08-china-a-share-market-listed-company-earnings-communication-conference/</guid>
      <description>业绩说明会， 是我国上市公司和中小投资 者沟通交流的重要载体。 在年报披露后， 能够 帮助投资者快速、准确地抓取信息披露重点， 全面了解企业发展状况， 增进对企业价值及经 营理念的认同。上市公司的业绩说明会是金融领域中的重要事件，它为投资者、分析师和其他利益相关者提供了一个与公司管理层直接交流的平台。这种数据集的学术价值多方面体现。</description>
      <content:encoded><![CDATA[<p>业绩说明会， 是我国上市公司和中小投资 者沟通交流的重要载体。 在年报披露后， 能够 帮助投资者快速、准确地抓取信息披露重点， 全面了解企业发展状况， 增进对企业价值及经 营理念的认同。上市公司的业绩说明会是金融领域中的重要事件，它为投资者、分析师和其他利益相关者提供了一个与公司管理层直接交流的平台。这种数据集的学术价值多方面体现。</p>
<ul>
<li>公司沟通策略的研究：业绩说明会的数据可以帮助研究者深入了解公司如何与公众沟通其财务状况、业务策略和未来展望。这对于传播学、公关和企业战略研究领域都是宝贵的。</li>
<li>情感分析与市场反应：通过对业绩说明会中的语言和情感进行分析，研究者可以探索市场对公司信息披露的反应。这对于金融经济学和计量经济学的研究尤为重要。</li>
<li>公司治理与透明度：业绩说明会的频率、内容和与投资者的互动可以为研究者提供关于公司治理质量和透明度的线索。</li>
<li>预测模型的建立：这种数据集可以用于建立预测模型，预测公司的未来业绩、股价走势或其他相关指标。</li>
<li>行为金融学的研究：业绩说明会中的问题和答案可以为研究者提供关于投资者和分析师行为和心理的深入了解，从而深化我们对市场非理性行为的理解。</li>
<li>宏观经济指标的研究：通过对多家公司的业绩说明会数据进行汇总和分析，研究者可以获得宏观经济趋势和行业动态的宝贵见解。</li>
</ul>
<p>总之，上市公司业绩说明会数据集为学术界提供了一个独特的、多维度的研究视角，有助于深化我们对金融市场、公司策略和投资者行为的理解。</p>
<p><br><br></p>
<h2 id="数据集介绍">数据集介绍</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">上市公司业绩说明会问答数据

【年度】
 2005-2023年

【字段】
 - 股票代码
 - 会计年度
 - 问题序号
 - 提问内容
 - 提问时间
 - 回答人
 - 回答时间
 - 回答内容
 
 【数据量】
  841876
  
  科研用途，仅供展示；如有任何问题， 请加微信372335839， 备注「姓名-学校-专业-业绩说明会」
</code></pre></div><p><br><br></p>
<h2 id="导入数据">导入数据</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_excel</span><span class="p">(</span><span class="s1">&#39;业绩说明会问答05-23.xlsx&#39;</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/df.png" alt=""  />
</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="s1">&#39;数据集覆盖的年度: </span><span class="si">{start}</span><span class="s1">~</span><span class="si">{end}</span><span class="s1">&#39;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">start</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;会计年度&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">min</span><span class="p">(),</span> 
                                 <span class="n">end</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;会计年度&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">())</span>
</code></pre></div><pre><code>'数据集覆盖的年度: 2005~2023'
</code></pre>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#数据量</span>
<span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
</code></pre></div><pre><code>841876
</code></pre>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#字段包括</span>
<span class="n">df</span><span class="o">.</span><span class="n">columns</span>
</code></pre></div><pre><code>Index(['股票代码', '会计年度', '问题序号', '提问内容', '提问时间', '回答人', '回答时间', '回答内容'], dtype='object')
</code></pre>
<br>
<p>设置matplotlib</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">import</span> <span class="nn">matplotlib</span>
<span class="kn">import</span> <span class="nn">matplotlib_inline</span>
<span class="n">matplotlib_inline</span><span class="o">.</span><span class="n">backend_inline</span><span class="o">.</span><span class="n">set_matplotlib_formats</span><span class="p">(</span><span class="s1">&#39;png&#39;</span><span class="p">,</span> <span class="s1">&#39;svg&#39;</span><span class="p">)</span>
<span class="c1">#pip3 install scienceplots</span>
<span class="kn">import</span> <span class="nn">scienceplots</span> 
<span class="kn">import</span> <span class="nn">platform</span>
<span class="n">plt</span><span class="o">.</span><span class="n">style</span><span class="o">.</span><span class="n">use</span><span class="p">([</span><span class="s1">&#39;science&#39;</span><span class="p">,</span> <span class="s1">&#39;no-latex&#39;</span><span class="p">,</span> <span class="s1">&#39;cjk-sc-font&#39;</span><span class="p">])</span>
<span class="n">system</span> <span class="o">=</span> <span class="n">platform</span><span class="o">.</span><span class="n">system</span><span class="p">()</span>  <span class="c1"># 获取操作系统类型</span>

<span class="k">if</span> <span class="n">system</span> <span class="o">==</span> <span class="s1">&#39;Windows&#39;</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;SimHei&#39;</span><span class="p">}</span>
<span class="k">elif</span> <span class="n">system</span> <span class="o">==</span> <span class="s1">&#39;Darwin&#39;</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;Arial Unicode MS&#39;</span><span class="p">}</span>
<span class="k">else</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;sans-serif&#39;</span><span class="p">}</span>
<span class="n">matplotlib</span><span class="o">.</span><span class="n">rc</span><span class="p">(</span><span class="s1">&#39;font&#39;</span><span class="p">,</span> <span class="o">**</span><span class="n">font</span><span class="p">)</span>  <span class="c1"># 设置全局字体</span>
    
<span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">6</span><span class="p">))</span>
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#年份变化(业绩说明会数量)</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;会计年度&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">value_counts</span><span class="p">()</span><span class="o">.</span><span class="n">sort_index</span><span class="p">()</span>
</code></pre></div><pre><code>会计年度
2005     4042
2006    10051
2007    18906
2008    31782
2009    35802
2010    47141
2011    69439
2012    73231
2013    80456
2014    80690
2015    62764
2016    61820
2017    52543
2018    44279
2019    42009
2020    37026
2021    53898
2022    35917
2023       80
Name: count, dtype: int64
</code></pre>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s1">&#39;会计年度&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">value_counts</span><span class="p">()</span><span class="o">.</span><span class="n">sort_index</span><span class="p">()</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">kind</span><span class="o">=</span><span class="s1">&#39;bar&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xticks</span><span class="p">(</span><span class="n">rotation</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s1">&#39;会计年度&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s1">&#39;业绩说明会数量&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">&#39;业绩说明会数量年份变化&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/output_5_0.svg" alt="svg"  />
</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s1">&#39;会计年度&#39;</span><span class="p">,</span> <span class="s1">&#39;股票代码&#39;</span><span class="p">])[</span><span class="s1">&#39;问题序号&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">count</span><span class="p">()</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">&#39;会计年度&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span><span class="o">.</span><span class="n">plot</span><span class="p">()</span>

<span class="n">plt</span><span class="o">.</span><span class="n">xticks</span><span class="p">(</span><span class="n">rotation</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s1">&#39;会计年度&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s1">&#39;年度问答次数&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">&#39;业绩说明会平均问答次数年份变化&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/output_6_0.svg" alt="svg"  />
</p>
<p><br><br></p>
<h2 id="相关文献">相关文献</h2>
<p>许帅,邵帅,何贤杰.业绩说明会前瞻性信息对分析师盈余预测准确性的影响——信口雌黄还是言而有征[J].中国管理科学:1-15.</p>
<blockquote>
<p>摘要:本文以2007—2020年上市公司业绩说明会为背景，研究前瞻性信息披露对分析师预测的影响，发现业绩说明会中的前瞻性信息可以显著提升分析师盈余预测准确性。公司的信息不对称程度越高，前瞻性信息对分析师预测准确性提升越多。分析师专长工作经验越丰富，具备更强的信息捕捉能力，可以更好地吸收与理解业绩说明会中的前瞻性信息，做出更准确的预测。进一步，本文对前瞻性信息影响分析师预测的路径进行了讨论，认为前瞻性信息可能通过吸引分析师和机构投资者调研，增进分析师对上市公司经营状况的了解，进而提升盈余预测准确性。此外，本文发现，前瞻性信息中业绩相关类信息因具有更高的可信度，且与盈余因子直接相关，能够显著提升分析师盈余预测准确性。本研究为管理层披露与分析师的互动研究提供了增量证据，研究结果支持了业绩说明会有效性，对未来监管部门制定相关信息披露政策提供依据和建议。</p>
</blockquote>
<br>
<p>卞世博,管之凡,阎志鹏.答非所问与市场反应:基于业绩说明会的研究[J].管理科学学报,2021,24(04):109-126.</p>
<blockquote>
<p>摘要:对上市公司业绩说明会中投资者与管理层问答互动中管理层答非所问的现象进行了研究.本文以中小板和创业板上市公司召开的业绩说明会作为研究样本,利用文本分析方法对业绩说明会中管理层在回答投资者提问时答非所问的程度进行度量,进而实证分析了管理层的答非所问与市场反应和公司未来业绩表现之间的可能关联.结果发现:在控制其它因素之后,管理层的答非所问与市场反应之间呈现显著的负相关关系,即公司管理层的答非所问程度越高,随后公司股票的市场表现则就会越差,并且对于那些低分析师关注的公司尤为明显;而在公司未来业绩表现方面,管理层答非所问的程度越高,则公司未来的业绩表现则会越差.</p>
</blockquote>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>可视化 | 2021年幸福指数&amp;人口数据可视化最佳实践</title>
      <link>https://textdata.cn/blog/2023-08-31-data_eda_2021_happiness_and_population/</link>
      <pubDate>Thu, 31 Aug 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-08-31-data_eda_2021_happiness_and_population/</guid>
      <description>&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;作者: JOSH
原文: https://www.kaggle.com/code/joshuaswords/awesome-eda-2021-happiness-population/notebook
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h2 id=&#34;幸福指数&#34;&gt;幸福指数&lt;/h2&gt;
&lt;p&gt;本笔记本纯粹是一项探索性数据分析，目的是看看我能否找出使一个国家感到幸福或不幸的因素。为此，我将分析和探索&lt;strong&gt;2021年的世界幸福指数&lt;/strong&gt;，以及&lt;strong&gt;自2005年以来的历史世界幸福指数数据&lt;/strong&gt;。&lt;/p&gt;
&lt;p&gt;我希望在这个过程中能学到一些东西，也希望读到这篇文章的任何人也能如此。&lt;/p&gt;
&lt;p&gt;另外，我会引入人口数据来研究这是否与幸福水平有明显的联系。&lt;/p&gt;
&lt;p&gt;我还将探索各国是否能够随着时间的推移改善其排名，或者这些排名是否基本保持不变。&lt;/p&gt;
&lt;p&gt;最后，我将使用K均值和肘部方法正式地对我们的数据进行聚类，以查看我们是否可以根据数据集中各种指标的分数将国家分组在一起。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;准备工作&#34;&gt;准备工作&lt;/h2&gt;
&lt;p&gt;安装必要的包&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;pip3 install pywaffle, geopandas, pycountry 
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;strong&gt;本文代码较多， 只展示部分代码，&lt;a href=&#34;code.zip&#34;&gt;点击完整的代码&amp;amp;数据&lt;/a&gt;，请前往textdata.cn下载&lt;/strong&gt;。&lt;/p&gt;
&lt;p&gt;&lt;br&gt; 导入数据&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;warnings&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;warnings&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;filterwarnings&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;ignore&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;        
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;matplotlib&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;plt&lt;/span&gt;


&lt;span class=&#34;c1&#34;&gt;#get data&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;data/world-happiness-report-2021.csv&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df2&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;data/world-happiness-report.csv&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;pop&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;data/population_by_country_2020.csv&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;safety&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;copy&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;# 统一不同数据中的字段名renaming columns for easier merge later&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;rename&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;columns&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;Country name&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;Country&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;},&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;inplace&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df2&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;rename&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;columns&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;Country name&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;Country&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;},&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;inplace&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;pop&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;rename&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;columns&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;Country (or dependency)&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;Country&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;},&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;inplace&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#might use later &lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;temporal&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df2&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;groupby&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;([&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;year&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;Country&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;Life Ladder&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;mean&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;unstack&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;T&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;temporal&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;temporal&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fillna&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;astype&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;int&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;# colours&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;low_c&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;#dd4124&amp;#39;&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;high_c&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;#009473&amp;#39;&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;rcParams&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;font.family&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;monospace&amp;#34;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;初始概览&#34;&gt;初始概览&lt;/h2&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# inspiration ; https://www.kaggle.com/gaetanlopez/how-to-make-clean-visualizations&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;# changed code signif.&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;fig&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;figure&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;figsize&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;6&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;3&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;dpi&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;150&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;gs&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;fig&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;add_gridspec&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;gs&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;update&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;wspace&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mf&#34;&gt;0.2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;hspace&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mf&#34;&gt;0.4&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;ax0&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;fig&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;add_subplot&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;gs&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;background_color&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;#fafafa&amp;#34;&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;fig&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;patch&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;set_facecolor&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;background_color&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;c1&#34;&gt;# figure background color&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;ax0&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;set_facecolor&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;background_color&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; 

&lt;span class=&#34;n&#34;&gt;ax0&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mf&#34;&gt;1.167&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;mf&#34;&gt;0.85&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;2021 World Happiness Index&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;color&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;#323232&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fontsize&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;28&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;fontweight&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;bold&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;fontfamily&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;sanserif&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ha&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;center&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;ax0&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mf&#34;&gt;1.13&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mf&#34;&gt;0.35&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;stand-out facts&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;color&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;lightgray&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fontsize&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;28&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;fontweight&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;bold&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;fontfamily&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;monospace&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ha&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;center&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;ax0&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;mf&#34;&gt;0.4&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;Finland&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;color&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;high_c&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fontsize&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;25&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;fontweight&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;bold&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;fontfamily&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;monospace&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ha&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;center&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;ax0&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;mf&#34;&gt;0.1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;Happiest&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;color&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;gray&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fontsize&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;15&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;fontfamily&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;monospace&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ha&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;center&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;ax0&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mf&#34;&gt;0.77&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;mf&#34;&gt;0.4&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;9 of top 10&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;color&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;high_c&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fontsize&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;25&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;fontweight&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;bold&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;fontfamily&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;monospace&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ha&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;center&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;ax0&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mf&#34;&gt;0.75&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;mf&#34;&gt;0.1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;in Europe&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;color&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;gray&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fontsize&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;15&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;fontfamily&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;monospace&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ha&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;center&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;ax0&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mf&#34;&gt;1.5&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;mf&#34;&gt;0.4&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;7 of bottom 10&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;color&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;low_c&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fontsize&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;25&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;fontweight&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;bold&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;fontfamily&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;monospace&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ha&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;center&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;ax0&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mf&#34;&gt;1.5&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;mf&#34;&gt;0.1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;in Africa&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;color&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;gray&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fontsize&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;15&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;fontfamily&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;monospace&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ha&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;center&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;ax0&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mf&#34;&gt;2.25&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;mf&#34;&gt;0.4&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;Afghanistan&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;color&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;low_c&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fontsize&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;25&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;fontweight&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;bold&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;fontfamily&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;monospace&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ha&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;center&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;ax0&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mf&#34;&gt;2.25&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;mf&#34;&gt;0.1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;Unhappiest&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;color&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;gray&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fontsize&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;15&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;fontfamily&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;monospace&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ha&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;center&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;ax0&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;set_yticklabels&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;ax0&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;set_xticklabels&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;ax0&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;tick_params&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;axis&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;both&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;length&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;s&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;top&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;right&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;left&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;bottom&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]:&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;ax0&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;spines&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;s&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;set_visible&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;False&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;matplotlib.lines&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;lines&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;l1&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;lines&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Line2D&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;([&lt;/span&gt;&lt;span class=&#34;mf&#34;&gt;0.15&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mf&#34;&gt;1.95&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;mf&#34;&gt;0.67&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mf&#34;&gt;0.67&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;transform&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fig&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;transFigure&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;figure&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fig&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;color&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;gray&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;linestyle&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;-&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;linewidth&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;mf&#34;&gt;1.1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;alpha&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;mf&#34;&gt;.5&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;fig&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;lines&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;extend&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;([&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;l1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;l2&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;lines&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Line2D&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;([&lt;/span&gt;&lt;span class=&#34;mf&#34;&gt;0.15&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mf&#34;&gt;1.95&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;mf&#34;&gt;0.07&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mf&#34;&gt;0.07&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;transform&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fig&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;transFigure&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;figure&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fig&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;color&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;gray&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;linestyle&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;-&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;linewidth&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;mf&#34;&gt;1.1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;alpha&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;mf&#34;&gt;.5&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;fig&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;lines&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;extend&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;([&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;l2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
    
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;show&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/output_4_1.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;世界上最幸福的国家是哪些&#34;&gt;世界上最幸福的国家是哪些？&lt;/h2&gt;
&lt;p&gt;对我来说，&#39;&lt;strong&gt;幸福&lt;/strong&gt;&amp;lsquo;似乎是一个个体化的指标，很难进行概括。然而，有些国家在幸福指数排名中表现始终稳定。&lt;/p&gt;
&lt;p&gt;我们还注意到，前10名中有9个是欧洲国家，而后10名中有7个是非洲国家。&lt;/p&gt;
&lt;p&gt;让我们看看目前位于列表顶端的国家，以及那些位于底部的国家。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/output_6_0.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
现在让我们把前10名和后10名并排放置，以便从另一个角度观察。
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/output_8_0.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;乍一看，我们发现世界上最幸福的许多国家确实位于欧洲。&lt;/p&gt;
&lt;p&gt;另一个额外的观察是，位于前10名的欧洲国家都是北欧国家。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;happiness_mean&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;Ladder score&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;mean&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;lower_happy&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;Ladder score&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;apply&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;k&#34;&gt;lambda&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;0&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;if&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;x&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;happiness_mean&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;else&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;这种情况经常发生吗&#34;&gt;这种情况经常发生吗？&lt;/h2&gt;
&lt;p&gt;稍后我将更深入地探索时间上的变化，但现在，让我们看一下这些年来排在前20名的国家。&lt;/p&gt;
&lt;p&gt;这个图展示了从2005年至今，前20名国家的所有分数，特别突出了它们的平均分和2021年的分数。&lt;/p&gt;
&lt;p&gt;值得注意的是，尽管有疫情的影响，许多国家在2021年的分数比他们的平均分还要高。&lt;/p&gt;
&lt;p&gt;尽管这些分数确实有所不同，但它们仍然相对较高。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/output_12_1.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;为什么会有差异&#34;&gt;为什么会有差异？&lt;/h2&gt;
&lt;p&gt;我们现在了解到，北欧国家一直位居榜首。&lt;/p&gt;
&lt;p&gt;让我们更仔细地探究一下欧洲与世界其他地区之间的这些差异。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/output_14_1.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;幸福程度较高的国家往往是那些预期寿命更长、GDP更高的国家。这也基本上包括了西欧。&lt;/p&gt;
&lt;p&gt;现在让我们明确地关注一下非洲&amp;hellip;&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/output_16_1.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;总体而言，非洲国家有更低的预期寿命、更低的GDP，最终也有更低的幸福指数分数。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;其他因素&#34;&gt;其他因素&lt;/h2&gt;
&lt;p&gt;因此，GDP和预期寿命是影响因素。还有什么其他因素可以考虑呢？&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/output_18_1.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;正如我在图中指出的，自由和腐败是成反比的关系：更高的腐败通常伴随着更低的自由度。&lt;/p&gt;
&lt;p&gt;然而，有趣的是需要注意的是，几个欧洲国家也有高度认知的腐败水平。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;大陆视角&#34;&gt;大陆视角&lt;/h2&gt;
&lt;p&gt;让我们将这些国家按照各自所属的大陆分类，看看我们能否了解更多。&lt;/p&gt;
&lt;p&gt;当然，我们预期西欧会排名很高，但是在幸福排名中，还有没有其他表现特别好或特别差的大陆？&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/output_20_1.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;可以清晰地看到有三个大陆群体。稍后将对此进行更多讨论&amp;hellip;&lt;/p&gt;
&lt;p&gt;撒哈拉以南非洲和南亚的分数最低。而西欧以及北美和澳新（ANZ）则遥遥领先，位于榜单的顶端。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;continent_score&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;groupby&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;Regional indicator&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)[[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;Healthy life expectancy&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;Logged GDP per capita&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;Perceptions of corruption&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;Freedom to make life choices&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;Ladder score&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;mean&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;mean&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;axis&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;sort_values&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ascending&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)[:&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;10&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df_bottom&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;groupby&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;Country&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)[[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;Logged GDP per capita&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;Perceptions of corruption&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;Freedom to make life choices&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;Social support&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;Ladder score&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;mean&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;sort_values&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;by&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;Ladder score&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ascending&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)[:&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;10&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df_bottom&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;Logged GDP per capita&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df_bottom&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;Logged GDP per capita&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;/&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;10&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df_bottom&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;Ladder score&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df_bottom&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;Ladder score&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;/&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;5&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;categorical&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;var&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;var&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;columns&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;if&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;var&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;dtype&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;==&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;O&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;continuous&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;var&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;var&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;columns&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;if&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;var&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;dtype&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;!=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;O&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#refined&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;continuous&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;Logged GDP per capita&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
 &lt;span class=&#34;s1&#34;&gt;&amp;#39;Social support&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
 &lt;span class=&#34;s1&#34;&gt;&amp;#39;Healthy life expectancy&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
 &lt;span class=&#34;s1&#34;&gt;&amp;#39;Freedom to make life choices&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
 &lt;span class=&#34;s1&#34;&gt;&amp;#39;Generosity&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
 &lt;span class=&#34;s1&#34;&gt;&amp;#39;Perceptions of corruption&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;高于和低于平均幸福水平的差异&#34;&gt;高于和低于平均幸福水平的差异&lt;/h2&gt;
&lt;p&gt;让我们一次绘制多个特征，按照平均幸福水平进行划分。如往常一样，最幸福的国家以绿色显示。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/output_24_1.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;上面的图表确认了我们之前看到的一些内容，并带有一些值得注意的特点，比如社会支持。&lt;/p&gt;
&lt;p&gt;在不太幸福的国家中，慷慨度被认为更高，这非常有趣。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;全球视角&#34;&gt;全球视角&lt;/h2&gt;
&lt;p&gt;我们现在已经看到了基于多个因素不同国家之间明显的差异。&lt;/p&gt;
&lt;p&gt;现在让我们从全球角度来看这个问题。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/output_26_1.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;这张图确认了我们之前的发现，南亚和非洲处于红色区域。&lt;/p&gt;
&lt;p&gt;但它也突出了我们可以进一步调查的地区。例如，中国和印度都在红色区域，它们的人口都超过了10亿。我们能否研究人口与幸福水平之间的关系？
&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;人口&#34;&gt;人口&lt;/h2&gt;
&lt;p&gt;让我们引入更多的因素——比如人口。&lt;/p&gt;
&lt;p&gt;这是否会影响幸福水平？&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/output_28_1.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;我们清晰地看到，更幸福的国家往往年龄更大，人口更少。&lt;/p&gt;
&lt;p&gt;我加入了欧洲作为参考。&lt;/p&gt;
&lt;p&gt;那么生育率呢？&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/output_30_1.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;正如我所怀疑的，更幸福的国家通常也有更少的孩子。这很可能是由于可以更容易地获得避孕方法。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/output_32_1.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;我很惊讶人口密度并不影响幸福感——尽管这可能是因为个人偏好！&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;随着时间的推移有没有变化&#34;&gt;随着时间的推移，有没有变化？&lt;/h2&gt;
&lt;p&gt;不快乐的人会变得更快乐吗？&lt;/p&gt;
&lt;p&gt;这仅仅是一个时间点的快照吗？还是这些趋势更加持久？&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/output_34_1.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;令人关注的是，不快乐的人依然不快乐，更糟糕的是，他们似乎变得更加不快乐。&lt;/p&gt;
&lt;p&gt;这种趋势是持续的吗？或者某些国家的分数会随着时间的推移而提高？&lt;/p&gt;
&lt;p&gt;让我们更多地探讨一下随时间变化的情况。&lt;/p&gt;
&lt;p&gt;在上面，我选取了几个国家作为样本。让我们用一个斜率图来绘制他们从2007年到2020年的变化，看看我们能否从中学到什么。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/output_36_1.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;显然，多年来确实有很多变化。&lt;/p&gt;
&lt;p&gt;哪些国家经历了最大的变化？&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/output_38_0.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/output_39_0.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;让我们比较在幸福指数得分方面增长最多和下降最多的两个国家：保加利亚和约旦。&lt;/p&gt;
&lt;p&gt;我们将对比他们多年来的表现。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/output_41_0.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;当我探究这个关于时间变化的观点时，我想从大陆的角度来看。&lt;/p&gt;
&lt;p&gt;例如，西欧的所有国家都“幸福”吗？&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/output_43_1.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
</description>
      <content:encoded><![CDATA[<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">作者: JOSH
原文: https://www.kaggle.com/code/joshuaswords/awesome-eda-2021-happiness-population/notebook
</code></pre></div><br>
<h2 id="幸福指数">幸福指数</h2>
<p>本笔记本纯粹是一项探索性数据分析，目的是看看我能否找出使一个国家感到幸福或不幸的因素。为此，我将分析和探索<strong>2021年的世界幸福指数</strong>，以及<strong>自2005年以来的历史世界幸福指数数据</strong>。</p>
<p>我希望在这个过程中能学到一些东西，也希望读到这篇文章的任何人也能如此。</p>
<p>另外，我会引入人口数据来研究这是否与幸福水平有明显的联系。</p>
<p>我还将探索各国是否能够随着时间的推移改善其排名，或者这些排名是否基本保持不变。</p>
<p>最后，我将使用K均值和肘部方法正式地对我们的数据进行聚类，以查看我们是否可以根据数据集中各种指标的分数将国家分组在一起。</p>
<p><br><br></p>
<h2 id="准备工作">准备工作</h2>
<p>安装必要的包</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">pip3 install pywaffle, geopandas, pycountry 
</code></pre></div><p><strong>本文代码较多， 只展示部分代码，<a href="code.zip">点击完整的代码&amp;数据</a>，请前往textdata.cn下载</strong>。</p>
<p><br> 导入数据</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">import</span> <span class="nn">warnings</span>
<span class="n">warnings</span><span class="o">.</span><span class="n">filterwarnings</span><span class="p">(</span><span class="s2">&#34;ignore&#34;</span><span class="p">)</span>        
<span class="kn">import</span> <span class="nn">matplotlib</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>


<span class="c1">#get data</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;data/world-happiness-report-2021.csv&#39;</span><span class="p">)</span>
<span class="n">df2</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;data/world-happiness-report.csv&#39;</span><span class="p">)</span>
<span class="n">pop</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;data/population_by_country_2020.csv&#39;</span><span class="p">)</span>

<span class="n">safety</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">copy</span><span class="p">()</span>

<span class="c1"># 统一不同数据中的字段名renaming columns for easier merge later</span>
<span class="n">df</span><span class="o">.</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="s1">&#39;Country name&#39;</span><span class="p">:</span> <span class="s1">&#39;Country&#39;</span><span class="p">},</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">df2</span><span class="o">.</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="s1">&#39;Country name&#39;</span><span class="p">:</span> <span class="s1">&#39;Country&#39;</span><span class="p">},</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">pop</span><span class="o">.</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="s1">&#39;Country (or dependency)&#39;</span><span class="p">:</span> <span class="s1">&#39;Country&#39;</span><span class="p">},</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>

<span class="c1">#might use later </span>
<span class="n">temporal</span> <span class="o">=</span> <span class="n">df2</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s1">&#39;year&#39;</span><span class="p">,</span><span class="s1">&#39;Country&#39;</span><span class="p">])[</span><span class="s1">&#39;Life Ladder&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span><span class="o">.</span><span class="n">unstack</span><span class="p">()</span><span class="o">.</span><span class="n">T</span>
<span class="n">temporal</span> <span class="o">=</span> <span class="n">temporal</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="nb">int</span><span class="p">)</span>

<span class="c1"># colours</span>
<span class="n">low_c</span> <span class="o">=</span> <span class="s1">&#39;#dd4124&#39;</span>
<span class="n">high_c</span> <span class="o">=</span> <span class="s1">&#39;#009473&#39;</span>
<span class="n">plt</span><span class="o">.</span><span class="n">rcParams</span><span class="p">[</span><span class="s2">&#34;font.family&#34;</span><span class="p">]</span> <span class="o">=</span> <span class="s2">&#34;monospace&#34;</span>
</code></pre></div><p><br><br></p>
<h2 id="初始概览">初始概览</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1"># inspiration ; https://www.kaggle.com/gaetanlopez/how-to-make-clean-visualizations</span>
<span class="c1"># changed code signif.</span>

<span class="n">fig</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">6</span><span class="p">,</span><span class="mi">3</span><span class="p">),</span><span class="n">dpi</span><span class="o">=</span><span class="mi">150</span><span class="p">)</span>
<span class="n">gs</span> <span class="o">=</span> <span class="n">fig</span><span class="o">.</span><span class="n">add_gridspec</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">gs</span><span class="o">.</span><span class="n">update</span><span class="p">(</span><span class="n">wspace</span><span class="o">=</span><span class="mf">0.2</span><span class="p">,</span> <span class="n">hspace</span><span class="o">=</span><span class="mf">0.4</span><span class="p">)</span>
<span class="n">ax0</span> <span class="o">=</span> <span class="n">fig</span><span class="o">.</span><span class="n">add_subplot</span><span class="p">(</span><span class="n">gs</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">])</span>

<span class="n">background_color</span> <span class="o">=</span> <span class="s2">&#34;#fafafa&#34;</span>
<span class="n">fig</span><span class="o">.</span><span class="n">patch</span><span class="o">.</span><span class="n">set_facecolor</span><span class="p">(</span><span class="n">background_color</span><span class="p">)</span> <span class="c1"># figure background color</span>
<span class="n">ax0</span><span class="o">.</span><span class="n">set_facecolor</span><span class="p">(</span><span class="n">background_color</span><span class="p">)</span> 

<span class="n">ax0</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="mf">1.167</span><span class="p">,</span><span class="mf">0.85</span><span class="p">,</span><span class="s2">&#34;2021 World Happiness Index&#34;</span><span class="p">,</span><span class="n">color</span><span class="o">=</span><span class="s1">&#39;#323232&#39;</span><span class="p">,</span><span class="n">fontsize</span><span class="o">=</span><span class="mi">28</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s1">&#39;bold&#39;</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;sanserif&#39;</span><span class="p">,</span><span class="n">ha</span><span class="o">=</span><span class="s1">&#39;center&#39;</span><span class="p">)</span>
<span class="n">ax0</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="mf">1.13</span><span class="p">,</span><span class="o">-</span><span class="mf">0.35</span><span class="p">,</span><span class="s2">&#34;stand-out facts&#34;</span><span class="p">,</span><span class="n">color</span><span class="o">=</span><span class="s1">&#39;lightgray&#39;</span><span class="p">,</span><span class="n">fontsize</span><span class="o">=</span><span class="mi">28</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s1">&#39;bold&#39;</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;monospace&#39;</span><span class="p">,</span><span class="n">ha</span><span class="o">=</span><span class="s1">&#39;center&#39;</span><span class="p">)</span>

<span class="n">ax0</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mf">0.4</span><span class="p">,</span><span class="s2">&#34;Finland&#34;</span><span class="p">,</span><span class="n">color</span><span class="o">=</span><span class="n">high_c</span><span class="p">,</span><span class="n">fontsize</span><span class="o">=</span><span class="mi">25</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s1">&#39;bold&#39;</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;monospace&#39;</span><span class="p">,</span><span class="n">ha</span><span class="o">=</span><span class="s1">&#39;center&#39;</span><span class="p">)</span>
<span class="n">ax0</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mf">0.1</span><span class="p">,</span><span class="s2">&#34;Happiest&#34;</span><span class="p">,</span><span class="n">color</span><span class="o">=</span><span class="s1">&#39;gray&#39;</span><span class="p">,</span><span class="n">fontsize</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;monospace&#39;</span><span class="p">,</span><span class="n">ha</span><span class="o">=</span><span class="s1">&#39;center&#39;</span><span class="p">)</span>

<span class="n">ax0</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="mf">0.77</span><span class="p">,</span><span class="mf">0.4</span><span class="p">,</span><span class="s2">&#34;9 of top 10&#34;</span><span class="p">,</span><span class="n">color</span><span class="o">=</span><span class="n">high_c</span><span class="p">,</span><span class="n">fontsize</span><span class="o">=</span><span class="mi">25</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s1">&#39;bold&#39;</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;monospace&#39;</span><span class="p">,</span><span class="n">ha</span><span class="o">=</span><span class="s1">&#39;center&#39;</span><span class="p">)</span>
<span class="n">ax0</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="mf">0.75</span><span class="p">,</span><span class="mf">0.1</span><span class="p">,</span><span class="s2">&#34;in Europe&#34;</span><span class="p">,</span><span class="n">color</span><span class="o">=</span><span class="s1">&#39;gray&#39;</span><span class="p">,</span><span class="n">fontsize</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;monospace&#39;</span><span class="p">,</span><span class="n">ha</span><span class="o">=</span><span class="s1">&#39;center&#39;</span><span class="p">)</span>

<span class="n">ax0</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="mf">1.5</span><span class="p">,</span><span class="mf">0.4</span><span class="p">,</span><span class="s2">&#34;7 of bottom 10&#34;</span><span class="p">,</span><span class="n">color</span><span class="o">=</span><span class="n">low_c</span><span class="p">,</span><span class="n">fontsize</span><span class="o">=</span><span class="mi">25</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s1">&#39;bold&#39;</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;monospace&#39;</span><span class="p">,</span><span class="n">ha</span><span class="o">=</span><span class="s1">&#39;center&#39;</span><span class="p">)</span>
<span class="n">ax0</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="mf">1.5</span><span class="p">,</span><span class="mf">0.1</span><span class="p">,</span><span class="s2">&#34;in Africa&#34;</span><span class="p">,</span><span class="n">color</span><span class="o">=</span><span class="s1">&#39;gray&#39;</span><span class="p">,</span><span class="n">fontsize</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;monospace&#39;</span><span class="p">,</span><span class="n">ha</span><span class="o">=</span><span class="s1">&#39;center&#39;</span><span class="p">)</span>

<span class="n">ax0</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="mf">2.25</span><span class="p">,</span><span class="mf">0.4</span><span class="p">,</span><span class="s2">&#34;Afghanistan&#34;</span><span class="p">,</span><span class="n">color</span><span class="o">=</span><span class="n">low_c</span><span class="p">,</span><span class="n">fontsize</span><span class="o">=</span><span class="mi">25</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s1">&#39;bold&#39;</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;monospace&#39;</span><span class="p">,</span><span class="n">ha</span><span class="o">=</span><span class="s1">&#39;center&#39;</span><span class="p">)</span>
<span class="n">ax0</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="mf">2.25</span><span class="p">,</span><span class="mf">0.1</span><span class="p">,</span><span class="s2">&#34;Unhappiest&#34;</span><span class="p">,</span><span class="n">color</span><span class="o">=</span><span class="s1">&#39;gray&#39;</span><span class="p">,</span><span class="n">fontsize</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;monospace&#39;</span><span class="p">,</span><span class="n">ha</span><span class="o">=</span><span class="s1">&#39;center&#39;</span><span class="p">)</span>

<span class="n">ax0</span><span class="o">.</span><span class="n">set_yticklabels</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span>
<span class="n">ax0</span><span class="o">.</span><span class="n">set_xticklabels</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="p">)</span>
<span class="n">ax0</span><span class="o">.</span><span class="n">tick_params</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="s1">&#39;both&#39;</span><span class="p">,</span><span class="n">length</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>

<span class="k">for</span> <span class="n">s</span> <span class="ow">in</span> <span class="p">[</span><span class="s1">&#39;top&#39;</span><span class="p">,</span><span class="s1">&#39;right&#39;</span><span class="p">,</span><span class="s1">&#39;left&#39;</span><span class="p">,</span><span class="s1">&#39;bottom&#39;</span><span class="p">]:</span>
    <span class="n">ax0</span><span class="o">.</span><span class="n">spines</span><span class="p">[</span><span class="n">s</span><span class="p">]</span><span class="o">.</span><span class="n">set_visible</span><span class="p">(</span><span class="kc">False</span><span class="p">)</span>
    
<span class="kn">import</span> <span class="nn">matplotlib.lines</span> <span class="k">as</span> <span class="nn">lines</span>
<span class="n">l1</span> <span class="o">=</span> <span class="n">lines</span><span class="o">.</span><span class="n">Line2D</span><span class="p">([</span><span class="mf">0.15</span><span class="p">,</span> <span class="mf">1.95</span><span class="p">],</span> <span class="p">[</span><span class="mf">0.67</span><span class="p">,</span> <span class="mf">0.67</span><span class="p">],</span> <span class="n">transform</span><span class="o">=</span><span class="n">fig</span><span class="o">.</span><span class="n">transFigure</span><span class="p">,</span> <span class="n">figure</span><span class="o">=</span><span class="n">fig</span><span class="p">,</span><span class="n">color</span> <span class="o">=</span> <span class="s1">&#39;gray&#39;</span><span class="p">,</span> <span class="n">linestyle</span><span class="o">=</span><span class="s1">&#39;-&#39;</span><span class="p">,</span><span class="n">linewidth</span> <span class="o">=</span> <span class="mf">1.1</span><span class="p">,</span> <span class="n">alpha</span> <span class="o">=</span> <span class="mf">.5</span><span class="p">)</span>
<span class="n">fig</span><span class="o">.</span><span class="n">lines</span><span class="o">.</span><span class="n">extend</span><span class="p">([</span><span class="n">l1</span><span class="p">])</span>
<span class="n">l2</span> <span class="o">=</span> <span class="n">lines</span><span class="o">.</span><span class="n">Line2D</span><span class="p">([</span><span class="mf">0.15</span><span class="p">,</span> <span class="mf">1.95</span><span class="p">],</span> <span class="p">[</span><span class="mf">0.07</span><span class="p">,</span> <span class="mf">0.07</span><span class="p">],</span> <span class="n">transform</span><span class="o">=</span><span class="n">fig</span><span class="o">.</span><span class="n">transFigure</span><span class="p">,</span> <span class="n">figure</span><span class="o">=</span><span class="n">fig</span><span class="p">,</span><span class="n">color</span> <span class="o">=</span> <span class="s1">&#39;gray&#39;</span><span class="p">,</span> <span class="n">linestyle</span><span class="o">=</span><span class="s1">&#39;-&#39;</span><span class="p">,</span><span class="n">linewidth</span> <span class="o">=</span> <span class="mf">1.1</span><span class="p">,</span> <span class="n">alpha</span> <span class="o">=</span> <span class="mf">.5</span><span class="p">)</span>
<span class="n">fig</span><span class="o">.</span><span class="n">lines</span><span class="o">.</span><span class="n">extend</span><span class="p">([</span><span class="n">l2</span><span class="p">])</span>
    
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/output_4_1.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="世界上最幸福的国家是哪些">世界上最幸福的国家是哪些？</h2>
<p>对我来说，'<strong>幸福</strong>&lsquo;似乎是一个个体化的指标，很难进行概括。然而，有些国家在幸福指数排名中表现始终稳定。</p>
<p>我们还注意到，前10名中有9个是欧洲国家，而后10名中有7个是非洲国家。</p>
<p>让我们看看目前位于列表顶端的国家，以及那些位于底部的国家。</p>
<p><img loading="lazy" src="img/output_6_0.png" alt=""  />
</p>
<br>
现在让我们把前10名和后10名并排放置，以便从另一个角度观察。
<p><img loading="lazy" src="img/output_8_0.png" alt=""  />
</p>
<br>
<p>乍一看，我们发现世界上最幸福的许多国家确实位于欧洲。</p>
<p>另一个额外的观察是，位于前10名的欧洲国家都是北欧国家。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">happiness_mean</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;Ladder score&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span>

<span class="n">df</span><span class="p">[</span><span class="s1">&#39;lower_happy&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;Ladder score&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="mi">0</span> <span class="k">if</span> <span class="n">x</span> <span class="o">&lt;</span> <span class="n">happiness_mean</span> <span class="k">else</span> <span class="mi">1</span><span class="p">)</span>
</code></pre></div><p><br><br></p>
<h2 id="这种情况经常发生吗">这种情况经常发生吗？</h2>
<p>稍后我将更深入地探索时间上的变化，但现在，让我们看一下这些年来排在前20名的国家。</p>
<p>这个图展示了从2005年至今，前20名国家的所有分数，特别突出了它们的平均分和2021年的分数。</p>
<p>值得注意的是，尽管有疫情的影响，许多国家在2021年的分数比他们的平均分还要高。</p>
<p>尽管这些分数确实有所不同，但它们仍然相对较高。</p>
<p><img loading="lazy" src="img/output_12_1.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="为什么会有差异">为什么会有差异？</h2>
<p>我们现在了解到，北欧国家一直位居榜首。</p>
<p>让我们更仔细地探究一下欧洲与世界其他地区之间的这些差异。</p>
<p><img loading="lazy" src="img/output_14_1.png" alt=""  />
</p>
<p>幸福程度较高的国家往往是那些预期寿命更长、GDP更高的国家。这也基本上包括了西欧。</p>
<p>现在让我们明确地关注一下非洲&hellip;</p>
<p><img loading="lazy" src="img/output_16_1.png" alt=""  />
</p>
<p>总体而言，非洲国家有更低的预期寿命、更低的GDP，最终也有更低的幸福指数分数。</p>
<p><br><br></p>
<h2 id="其他因素">其他因素</h2>
<p>因此，GDP和预期寿命是影响因素。还有什么其他因素可以考虑呢？</p>
<p><img loading="lazy" src="img/output_18_1.png" alt=""  />
</p>
<p>正如我在图中指出的，自由和腐败是成反比的关系：更高的腐败通常伴随着更低的自由度。</p>
<p>然而，有趣的是需要注意的是，几个欧洲国家也有高度认知的腐败水平。</p>
<p><br><br></p>
<h2 id="大陆视角">大陆视角</h2>
<p>让我们将这些国家按照各自所属的大陆分类，看看我们能否了解更多。</p>
<p>当然，我们预期西欧会排名很高，但是在幸福排名中，还有没有其他表现特别好或特别差的大陆？</p>
<p><img loading="lazy" src="img/output_20_1.png" alt=""  />
</p>
<br>
<p>可以清晰地看到有三个大陆群体。稍后将对此进行更多讨论&hellip;</p>
<p>撒哈拉以南非洲和南亚的分数最低。而西欧以及北美和澳新（ANZ）则遥遥领先，位于榜单的顶端。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">continent_score</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">&#39;Regional indicator&#39;</span><span class="p">)[[</span><span class="s1">&#39;Healthy life expectancy&#39;</span><span class="p">,</span><span class="s1">&#39;Logged GDP per capita&#39;</span><span class="p">,</span><span class="s1">&#39;Perceptions of corruption&#39;</span><span class="p">,</span><span class="s1">&#39;Freedom to make life choices&#39;</span><span class="p">,</span><span class="s1">&#39;Ladder score&#39;</span><span class="p">]]</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">sort_values</span><span class="p">(</span><span class="n">ascending</span><span class="o">=</span><span class="kc">True</span><span class="p">)[:</span><span class="mi">10</span><span class="p">]</span>

<span class="n">df_bottom</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">&#39;Country&#39;</span><span class="p">)[[</span><span class="s1">&#39;Logged GDP per capita&#39;</span><span class="p">,</span><span class="s1">&#39;Perceptions of corruption&#39;</span><span class="p">,</span><span class="s1">&#39;Freedom to make life choices&#39;</span><span class="p">,</span><span class="s1">&#39;Social support&#39;</span><span class="p">,</span><span class="s1">&#39;Ladder score&#39;</span><span class="p">]]</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span><span class="o">.</span><span class="n">sort_values</span><span class="p">(</span><span class="n">by</span><span class="o">=</span><span class="s1">&#39;Ladder score&#39;</span><span class="p">,</span><span class="n">ascending</span><span class="o">=</span><span class="kc">True</span><span class="p">)[:</span><span class="mi">10</span><span class="p">]</span>

<span class="n">df_bottom</span><span class="p">[</span><span class="s1">&#39;Logged GDP per capita&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df_bottom</span><span class="p">[</span><span class="s1">&#39;Logged GDP per capita&#39;</span><span class="p">]</span><span class="o">/</span><span class="mi">10</span>
<span class="n">df_bottom</span><span class="p">[</span><span class="s1">&#39;Ladder score&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df_bottom</span><span class="p">[</span><span class="s1">&#39;Ladder score&#39;</span><span class="p">]</span><span class="o">/</span><span class="mi">5</span>

<span class="n">categorical</span> <span class="o">=</span> <span class="p">[</span><span class="n">var</span> <span class="k">for</span> <span class="n">var</span> <span class="ow">in</span> <span class="n">df</span><span class="o">.</span><span class="n">columns</span> <span class="k">if</span> <span class="n">df</span><span class="p">[</span><span class="n">var</span><span class="p">]</span><span class="o">.</span><span class="n">dtype</span><span class="o">==</span><span class="s1">&#39;O&#39;</span><span class="p">]</span>
<span class="n">continuous</span> <span class="o">=</span> <span class="p">[</span><span class="n">var</span> <span class="k">for</span> <span class="n">var</span> <span class="ow">in</span> <span class="n">df</span><span class="o">.</span><span class="n">columns</span> <span class="k">if</span> <span class="n">df</span><span class="p">[</span><span class="n">var</span><span class="p">]</span><span class="o">.</span><span class="n">dtype</span><span class="o">!=</span><span class="s1">&#39;O&#39;</span><span class="p">]</span>

<span class="c1">#refined</span>
<span class="n">continuous</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;Logged GDP per capita&#39;</span><span class="p">,</span>
 <span class="s1">&#39;Social support&#39;</span><span class="p">,</span>
 <span class="s1">&#39;Healthy life expectancy&#39;</span><span class="p">,</span>
 <span class="s1">&#39;Freedom to make life choices&#39;</span><span class="p">,</span>
 <span class="s1">&#39;Generosity&#39;</span><span class="p">,</span>
 <span class="s1">&#39;Perceptions of corruption&#39;</span><span class="p">]</span>
</code></pre></div><p><br><br></p>
<h2 id="高于和低于平均幸福水平的差异">高于和低于平均幸福水平的差异</h2>
<p>让我们一次绘制多个特征，按照平均幸福水平进行划分。如往常一样，最幸福的国家以绿色显示。</p>
<p><img loading="lazy" src="img/output_24_1.png" alt=""  />
</p>
<p>上面的图表确认了我们之前看到的一些内容，并带有一些值得注意的特点，比如社会支持。</p>
<p>在不太幸福的国家中，慷慨度被认为更高，这非常有趣。</p>
<p><br><br></p>
<h2 id="全球视角">全球视角</h2>
<p>我们现在已经看到了基于多个因素不同国家之间明显的差异。</p>
<p>现在让我们从全球角度来看这个问题。</p>
<p><img loading="lazy" src="img/output_26_1.png" alt=""  />
</p>
<p>这张图确认了我们之前的发现，南亚和非洲处于红色区域。</p>
<p>但它也突出了我们可以进一步调查的地区。例如，中国和印度都在红色区域，它们的人口都超过了10亿。我们能否研究人口与幸福水平之间的关系？
<br><br></p>
<h2 id="人口">人口</h2>
<p>让我们引入更多的因素——比如人口。</p>
<p>这是否会影响幸福水平？</p>
<p><img loading="lazy" src="img/output_28_1.png" alt=""  />
</p>
<p>我们清晰地看到，更幸福的国家往往年龄更大，人口更少。</p>
<p>我加入了欧洲作为参考。</p>
<p>那么生育率呢？</p>
<p><img loading="lazy" src="img/output_30_1.png" alt=""  />
</p>
<p>正如我所怀疑的，更幸福的国家通常也有更少的孩子。这很可能是由于可以更容易地获得避孕方法。</p>
<p><img loading="lazy" src="img/output_32_1.png" alt=""  />
</p>
<p>我很惊讶人口密度并不影响幸福感——尽管这可能是因为个人偏好！</p>
<p><br><br></p>
<h2 id="随着时间的推移有没有变化">随着时间的推移，有没有变化？</h2>
<p>不快乐的人会变得更快乐吗？</p>
<p>这仅仅是一个时间点的快照吗？还是这些趋势更加持久？</p>
<p><img loading="lazy" src="img/output_34_1.png" alt=""  />
</p>
<br>
<p>令人关注的是，不快乐的人依然不快乐，更糟糕的是，他们似乎变得更加不快乐。</p>
<p>这种趋势是持续的吗？或者某些国家的分数会随着时间的推移而提高？</p>
<p>让我们更多地探讨一下随时间变化的情况。</p>
<p>在上面，我选取了几个国家作为样本。让我们用一个斜率图来绘制他们从2007年到2020年的变化，看看我们能否从中学到什么。</p>
<p><img loading="lazy" src="img/output_36_1.png" alt=""  />
</p>
<br>
<p>显然，多年来确实有很多变化。</p>
<p>哪些国家经历了最大的变化？</p>
<p><img loading="lazy" src="img/output_38_0.png" alt=""  />
</p>
<p><img loading="lazy" src="img/output_39_0.png" alt=""  />
</p>
<br>
<p>让我们比较在幸福指数得分方面增长最多和下降最多的两个国家：保加利亚和约旦。</p>
<p>我们将对比他们多年来的表现。</p>
<p><img loading="lazy" src="img/output_41_0.png" alt=""  />
</p>
<br>
<p>当我探究这个关于时间变化的观点时，我想从大陆的角度来看。</p>
<p>例如，西欧的所有国家都“幸福”吗？</p>
<p><img loading="lazy" src="img/output_43_1.png" alt=""  />
</p>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>可视化 | 使用geopandas可视化地图数据</title>
      <link>https://textdata.cn/blog/2023-08-31-data-visualization-how-to-plot-a-map-with-geopandas/</link>
      <pubDate>Thu, 31 Aug 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-08-31-data-visualization-how-to-plot-a-map-with-geopandas/</guid>
      <description>GeoDataFrame是GeoPandas中的核心数据结构，可以存储几何列并执行空间操作。GeoSeries 数据结构可以包含任何几何类型，例如点、线、多边形等。</description>
      <content:encoded><![CDATA[<h2 id="本文代码codezip"><a href="code.zip">本文代码</a></h2>
<p><br><br></p>
<p>Pandas 可能是最流行的用于数据分析的 Python 库。GeoPandas 扩展了 Pandas 的数据类型，使我们能够更轻松地在 Python 中处理地理空间数据。它目前有两种数据类型结构：GeoSeries 和 GeoDataFrame，它们分别是 pandas.Series 和 pandas.DataFrame 的子类。</p>
<p><strong>GeoDataFrame</strong> 是 <strong>GeoPandas</strong> 中的核心数据结构，可以存储几何列并执行空间操作。GeoSeries 数据结构可以包含任何几何类型，例如点、线、多边形等。</p>
<p>总体上，GeoDataFrame 是 pandas.Series 和 geopandas.GeoSeries 的组合。</p>
<p>为了按照本文中的方法进行操作，您需要从 ArcGIS Hub 下载一个世界国家的 Shapefile (<a href="https://hub.arcgis.com/datasets/esri::world-countries-generalized/">https://hub.arcgis.com/datasets/esri::world-countries-generalized/</a>)。 如果您已经有自己的 Shapefile 数据，也可以使用您自己的数据。</p>
<p><img loading="lazy" src="img/01-arcgis.png" alt=""  />
</p>
<h2 id="heading"></h2>
<p><br><br></p>
<h2 id="一安装geopandas">一、安装GeoPandas</h2>
<p>GeoPandas库是纯 Python 编写的，但是它的一些依赖库是用 C 编写的，比如 GEOS、GDAL、PROJ。有时在 Windows 上安装这些 C 库并不容易。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="err">!</span><span class="n">pip3</span> <span class="n">install</span> <span class="n">pyogrio</span>
<span class="err">!</span><span class="n">pip3</span> <span class="n">install</span> <span class="n">pyproj</span>
<span class="err">!</span><span class="n">pip3</span> <span class="n">install</span> <span class="n">rtree</span>
<span class="err">!</span><span class="n">pip3</span> <span class="n">install</span> <span class="n">shapely</span>
<span class="err">!</span><span class="n">pip3</span> <span class="n">install</span> <span class="n">geopandas</span>
</code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">geopandas</span> <span class="k">as</span> <span class="nn">gpd</span>
<span class="kn">import</span> <span class="nn">warnings</span>
<span class="n">warnings</span><span class="o">.</span><span class="n">filterwarnings</span><span class="p">(</span><span class="s1">&#39;ignore&#39;</span><span class="p">)</span>

<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;Geopandas版本号: &#39;</span><span class="p">,</span> <span class="n">gpd</span><span class="o">.</span><span class="n">__version__</span><span class="p">)</span>
</code></pre></div><pre><code>Geopandas版本号:  0.13.2
</code></pre>
<p><br><br></p>
<h2 id="二读写数据">二、读写数据</h2>
<h3 id="21-读入数据">2.1 读入数据</h3>
<p>geopandas库支持多种数据格式</p>
<ul>
<li>shp</li>
<li>geojson</li>
</ul>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">os</span>

<span class="n">os</span><span class="o">.</span><span class="n">listdir</span><span class="p">(</span><span class="s1">&#39;data&#39;</span><span class="p">)</span>
</code></pre></div><pre><code>['.DS_Store',
 'World_Countries_Generalized',
 'World_Countries_Generalized.geojson',
 'world-population.geo.json']
</code></pre>
<h4 id="211-shp">2.1.1 shp</h4>
<p><img loading="lazy" src="img/02-shp_data.png" alt=""  />
</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">geopandas</span> <span class="k">as</span> <span class="nn">gpd</span>

<span class="c1">#shp必须与shx同处于一个文件夹内</span>
<span class="n">gdf</span> <span class="o">=</span> <span class="n">gpd</span><span class="o">.</span><span class="n">read_file</span><span class="p">(</span><span class="s1">&#39;data/World_Countries_Generalized/World_Countries_Generalized.shp&#39;</span><span class="p">)</span>
<span class="n">gdf</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/df.png" alt=""  />
</p>
<br>
<h3 id="212-geojson">2.1.2 geojson</h3>
<p>GeoJson是Json文件，所以该类数据文件尾缀名一般为<code>.geojson</code> 或  <code>.json</code></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">geopandas</span> <span class="k">as</span> <span class="nn">gpd</span>

<span class="n">gdf2</span> <span class="o">=</span> <span class="n">gpd</span><span class="o">.</span><span class="n">read_file</span><span class="p">(</span><span class="s1">&#39;data/world-population.geo.json&#39;</span><span class="p">)</span>
<span class="n">gdf2</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/df2.png" alt=""  />
</p>
<p><br><br></p>
<h3 id="22-保存数据">2.2 保存数据</h3>
<p>我们可以使用 GeoDataFrame.to_file() 将切片或修改后的 GeoDataFrame 写回文件。</p>
<ul>
<li>gdf.to_file(&lsquo;shp文件路径&rsquo;)</li>
<li>gdf.to_file(&lsquo;GeoJson文件路径&rsquo;, driver=&lsquo;GeoJSON&rsquo;)</li>
</ul>
<p>默认的文件格式是 Shapefile，但我们可以使用 driver 关键字指定其他格式。例如，让我们将 DataFrame 保存为 GeoJSON 格式。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">gdf2</span><span class="o">.</span><span class="n">to_file</span><span class="p">(</span><span class="s1">&#39;output/World_Countries_Generalized.shp&#39;</span><span class="p">)</span>
<span class="n">gdf2</span><span class="o">.</span><span class="n">to_file</span><span class="p">(</span><span class="s1">&#39;output/World_Countries_Generalized.geojson&#39;</span><span class="p">,</span> <span class="n">driver</span><span class="o">=</span><span class="s1">&#39;GeoJSON&#39;</span><span class="p">)</span>
</code></pre></div><p><br><br></p>
<h2 id="三geodataframe数据类型">三、GeoDataFrame数据类型</h2>
<p>让我们以 gdf GeoDataFrame 为例。大多数用于 pandas 的方法在 GeoPandas 中仍然适用。在本节中，我们只会看到一些示例。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#数据形状</span>
<span class="n">gdf2</span><span class="o">.</span><span class="n">shape</span>
</code></pre></div><pre><code>(211, 10)
</code></pre>
<p>211行，8列，最后一列是多边形几何数据</p>
<br>
<h3 id="31-坐标参考系统crs">3.1 坐标参考系统（CRS）</h3>
<p>通常我们使用一个二维坐标系统，其中经度（垂直的南北线）和纬度（东西方向的水平线）用于标识地球表面上的位置。</p>
<p>GeoDataFrame 包含了将几何列中定义的多边形映射到地球表面的CRS信息。要检查CRS，我们使用 .crs 方法。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">gdf2</span><span class="o">.</span><span class="n">crs</span>
</code></pre></div><pre><code>&lt;Geographic 2D CRS: EPSG:4326&gt;
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich
</code></pre>
<br>
<h3 id="32-筛选">3.2 筛选</h3>
<p>筛选出中国的数据</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">gdf2</span><span class="p">[</span><span class="n">gdf2</span><span class="p">[</span><span class="s1">&#39;NAME&#39;</span><span class="p">]</span><span class="o">==</span><span class="s1">&#39;China&#39;</span><span class="p">]</span>
</code></pre></div><p><img loading="lazy" src="img/df3.png" alt=""  />
</p>
<br>
<h3 id="32-中心">3.2 中心</h3>
<p>每个记录所代码实体(国家、省州、城市)的地理中心</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">gdf2</span><span class="o">.</span><span class="n">centroid</span>
</code></pre></div><pre><code>0       POINT (66.02695 33.83959)
1       POINT (20.06466 41.14350)
2        POINT (2.63167 28.16258)
3        POINT (1.58730 42.54147)
4      POINT (17.54495 -12.29359)
                  ...            
206     POINT (47.59134 15.77731)
207     POINT (20.80471 44.02662)
208     POINT (23.65690 -2.87535)
209    POINT (27.79925 -13.45302)
210    POINT (29.87045 -19.00312)
Length: 211, dtype: geometry
</code></pre>
<br>
<h3 id="33-投影">3.3 投影</h3>
<p>几何数据（多边形）转换到 EPSG 3857 坐标参考系统，并计算每个多边形的中心点（centroid）。 EPSG 3857 通常被称为 「Web Mercator 投影」，用于在 Web 地图上呈现地理数据。转换为此坐标参考系统可以用于生成更适合在 Web 地图上显示的数据。</p>
<p>根据代码运行提示， 更改代码。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">gdf2</span><span class="o">.</span><span class="n">to_crs</span><span class="p">(</span><span class="mi">3857</span><span class="p">)</span><span class="o">.</span><span class="n">centroid</span>
</code></pre></div><pre><code>0       POINT (7354486.896 4017736.603)
1       POINT (2233478.912 5035365.825)
2        POINT (292615.786 3302567.699)
3        POINT (176697.417 5242446.985)
4      POINT (1953402.929 -1385745.990)
                     ...               
206     POINT (5299033.934 1780605.357)
207     POINT (2315093.566 5473720.440)
208     POINT (2633886.156 -323258.251)
209    POINT (3092719.338 -1515282.005)
210    POINT (3325201.834 -2158042.585)
Length: 211, dtype: geometry
</code></pre>
<br>
<h3 id="34-计算区域面积">3.4 计算区域面积</h3>
<p>数据已经包含了一个 SHAPE_Area 列。假设没有这样的列，我们可以通过几何数据来计算面积。为了获得正确的面积，您必须使用<strong>等面积投影</strong>。适用于您的代码的投影是 EPSG 6933。它是柱面等面积投影。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">gdf2</span><span class="o">.</span><span class="n">to_crs</span><span class="p">(</span><span class="mi">6933</span><span class="p">)</span><span class="o">.</span><span class="n">area</span>
</code></pre></div><pre><code>0      6.419639e+11
1      2.875576e+10
2      2.318240e+12
3      4.702513e+08
4      1.247851e+12
           ...     
206    3.990873e+11
207    8.856234e+10
208    2.325712e+12
209    7.521495e+11
210    3.892279e+11
Length: 211, dtype: float64
</code></pre>
<p><br><br></p>
<h2 id="四可视化">四、可视化</h2>
<p>因为geopandas绘图功能是在matplotlib的基础上实现的，gdf.plot()一行代码就能绘图</p>
<h3 id="41-最简地图">4.1 最简地图</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">gdf2</span><span class="o">.</span><span class="n">plot</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/output_26_1.png" alt=""  />
</p>
<br>
<h3 id="42-更改颜色">4.2 更改颜色</h3>
<p>图的颜色和边界的颜色的更改</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">gdf2</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">color</span><span class="o">=</span><span class="s1">&#39;green&#39;</span><span class="p">,</span> <span class="n">edgecolor</span><span class="o">=</span><span class="s1">&#39;white&#39;</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/output_28_1.png" alt=""  />
</p>
<br>
<h3 id="43-colormap">4.3 colormap</h3>
<p>使用matplotlib的colormaps配色</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">gdf2</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">cmap</span><span class="o">=</span><span class="s1">&#39;jet&#39;</span><span class="p">,</span> <span class="n">linewidth</span><span class="o">=</span><span class="mf">0.2</span><span class="p">,</span> <span class="n">edgecolor</span><span class="o">=</span><span class="s1">&#39;gray&#39;</span><span class="p">,</span> <span class="n">column</span><span class="o">=</span><span class="s1">&#39;NAME&#39;</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span>
</code></pre></div><p><img loading="lazy" src="img/output_30_1.png" alt=""  />
</p>
<br>
<h3 id="44-legend-colorbar">4.4 Legend Colorbar</h3>
<p>图例配色</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>

<span class="n">gdf2</span><span class="p">[</span><span class="s1">&#39;POP2005&#39;</span><span class="p">]</span><span class="o">=</span><span class="n">gdf2</span><span class="p">[</span><span class="s1">&#39;POP2005&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="nb">float</span><span class="p">)</span>

<span class="n">gdf2</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">cmap</span><span class="o">=</span><span class="s1">&#39;hot&#39;</span><span class="p">,</span><span class="n">linewidth</span><span class="o">=</span><span class="mf">0.2</span><span class="p">,</span> <span class="n">edgecolor</span><span class="o">=</span><span class="s1">&#39;gray&#39;</span><span class="p">,</span><span class="n">column</span><span class="o">=</span><span class="s1">&#39;POP2005&#39;</span><span class="p">,</span><span class="n">legend</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">&#39;World Population&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/output_32_0.png" alt=""  />
</p>
<br>
<h3 id="45-局部">4.5 局部</h3>
<p>使用gdf2除了可以绘制全世界地图，还可以绘制局部地图，如美国地图</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">gdf2</span><span class="p">[</span><span class="n">gdf2</span><span class="p">[</span><span class="s1">&#39;NAME&#39;</span><span class="p">]</span><span class="o">==</span><span class="s1">&#39;China&#39;</span><span class="p">]</span>
</code></pre></div><p><img loading="lazy" src="img/df3.png" alt=""  />
</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">china_mainland</span> <span class="o">=</span> <span class="n">gdf2</span><span class="p">[</span><span class="n">gdf2</span><span class="p">[</span><span class="s1">&#39;NAME&#39;</span><span class="p">]</span> <span class="o">==</span><span class="s1">&#39;China&#39;</span><span class="p">]</span>
<span class="n">china_taiwan</span> <span class="o">=</span> <span class="n">gdf2</span><span class="p">[</span><span class="n">gdf2</span><span class="p">[</span><span class="s1">&#39;NAME&#39;</span><span class="p">]</span> <span class="o">==</span><span class="s1">&#39;Taiwan&#39;</span><span class="p">]</span>

<span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">subplots</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span>

<span class="n">china_mainland</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">ax</span><span class="o">=</span><span class="n">ax</span><span class="p">)</span> 
<span class="n">china_taiwan</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">ax</span><span class="o">=</span><span class="n">ax</span><span class="p">)</span>

<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">&#39;Map of China&#39;</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/output_35_1.png" alt=""  />
</p>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>可视化 | ggparliament包绘制议会图</title>
      <link>https://textdata.cn/blog/2023-08-29-ggparliament/</link>
      <pubDate>Tue, 29 Aug 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-08-29-ggparliament/</guid>
      <description>&lt;h2 id=&#34;ggparliament&#34;&gt;ggparliament&lt;/h2&gt;
&lt;p&gt;ggparliament 包是为了使用 &lt;code&gt;ggplot2&lt;/code&gt; 创建议会图表而开发的。该库还提供了一个示例数据集，其中包含了多个国家的选举数据。&lt;/p&gt;
&lt;p&gt;在本教程中，我们将使用来自2016年俄罗斯国家杜马选举的数据来进行所有示例。需要注意的是，根据每个国家的不同，议会的类型也会不同，因此您应该根据您想要显示的数据使用相应的类型。可用的类型包括 &amp;ldquo;semicircle&amp;rdquo;（美国、法国、西班牙等）、&amp;ldquo;circle&amp;rdquo;、&amp;ldquo;opposing_benches&amp;rdquo;（英国）、&amp;ldquo;classroom&amp;rdquo; 和 &amp;ldquo;horsehoe&amp;rdquo;（澳大利亚、新西兰）。&lt;/p&gt;
&lt;p&gt;这段文字描述了 &lt;code&gt;ggparliament&lt;/code&gt; 包的功能以及如何在教程中使用2016年俄罗斯国家杜马选举的数据来进行演示。同时，它还提到了不同国家使用不同类型的议会图表，需要根据数据选择相应的类型。&lt;/p&gt;
&lt;p&gt;&lt;a href=&#34;ggparliament.ipynb&#34;&gt;&lt;strong&gt;点击下载本文代码&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;
&lt;br&gt;
&lt;h2 id=&#34;一准备数据&#34;&gt;一、准备数据&lt;/h2&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-R&#34; data-lang=&#34;R&#34;&gt;&lt;span class=&#34;nf&#34;&gt;library&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ggparliament&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;nf&#34;&gt;library&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;tidyverse&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;# Data&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;ru&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;election_data&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;%&amp;gt;%&lt;/span&gt;
  &lt;span class=&#34;nf&#34;&gt;filter&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;country&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;==&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;Russia&amp;#34;&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;year&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;==&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;2016&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;ru&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;table class=&#34;dataframe&#34;&gt;
&lt;caption&gt;A data.frame: 7 × 8&lt;/caption&gt;
&lt;thead&gt;
	&lt;tr&gt;&lt;th scope=col&gt;year&lt;/th&gt;&lt;th scope=col&gt;country&lt;/th&gt;&lt;th scope=col&gt;house&lt;/th&gt;&lt;th scope=col&gt;party_long&lt;/th&gt;&lt;th scope=col&gt;party_short&lt;/th&gt;&lt;th scope=col&gt;seats&lt;/th&gt;&lt;th scope=col&gt;government&lt;/th&gt;&lt;th scope=col&gt;colour&lt;/th&gt;&lt;/tr&gt;
	&lt;tr&gt;&lt;th scope=col&gt;&amp;lt;int&amp;gt;&lt;/th&gt;&lt;th scope=col&gt;&amp;lt;chr&amp;gt;&lt;/th&gt;&lt;th scope=col&gt;&amp;lt;chr&amp;gt;&lt;/th&gt;&lt;th scope=col&gt;&amp;lt;chr&amp;gt;&lt;/th&gt;&lt;th scope=col&gt;&amp;lt;chr&amp;gt;&lt;/th&gt;&lt;th scope=col&gt;&amp;lt;int&amp;gt;&lt;/th&gt;&lt;th scope=col&gt;&amp;lt;int&amp;gt;&lt;/th&gt;&lt;th scope=col&gt;&amp;lt;chr&amp;gt;&lt;/th&gt;&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
	&lt;tr&gt;&lt;td&gt;2016&lt;/td&gt;&lt;td&gt;Russia&lt;/td&gt;&lt;td&gt;Duma&lt;/td&gt;&lt;td&gt;Communist                         &lt;/td&gt;&lt;td&gt;CPRF  &lt;/td&gt;&lt;td&gt; 42&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;#D50000&lt;/td&gt;&lt;/tr&gt;
	&lt;tr&gt;&lt;td&gt;2016&lt;/td&gt;&lt;td&gt;Russia&lt;/td&gt;&lt;td&gt;Duma&lt;/td&gt;&lt;td&gt;Liberal Democratic Party of Russia&lt;/td&gt;&lt;td&gt;LDPR  &lt;/td&gt;&lt;td&gt; 39&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;#2862B3&lt;/td&gt;&lt;/tr&gt;
	&lt;tr&gt;&lt;td&gt;2016&lt;/td&gt;&lt;td&gt;Russia&lt;/td&gt;&lt;td&gt;Duma&lt;/td&gt;&lt;td&gt;A Just Russia                     &lt;/td&gt;&lt;td&gt;JR    &lt;/td&gt;&lt;td&gt; 23&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;#FAB512&lt;/td&gt;&lt;/tr&gt;
	&lt;tr&gt;&lt;td&gt;2016&lt;/td&gt;&lt;td&gt;Russia&lt;/td&gt;&lt;td&gt;Duma&lt;/td&gt;&lt;td&gt;Rodina                            &lt;/td&gt;&lt;td&gt;Rodina&lt;/td&gt;&lt;td&gt;  1&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;#EA484A&lt;/td&gt;&lt;/tr&gt;
	&lt;tr&gt;&lt;td&gt;2016&lt;/td&gt;&lt;td&gt;Russia&lt;/td&gt;&lt;td&gt;Duma&lt;/td&gt;&lt;td&gt;Civic Platform                    &lt;/td&gt;&lt;td&gt;CPI   &lt;/td&gt;&lt;td&gt;  1&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;#641263&lt;/td&gt;&lt;/tr&gt;
	&lt;tr&gt;&lt;td&gt;2016&lt;/td&gt;&lt;td&gt;Russia&lt;/td&gt;&lt;td&gt;Duma&lt;/td&gt;&lt;td&gt;Independent                       &lt;/td&gt;&lt;td&gt;Ind   &lt;/td&gt;&lt;td&gt;  1&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;#B4B4B4&lt;/td&gt;&lt;/tr&gt;
	&lt;tr&gt;&lt;td&gt;2016&lt;/td&gt;&lt;td&gt;Russia&lt;/td&gt;&lt;td&gt;Duma&lt;/td&gt;&lt;td&gt;United Russia                     &lt;/td&gt;&lt;td&gt;UR    &lt;/td&gt;&lt;td&gt;343&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;#0C2C84&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二议会图&#34;&gt;二、议会图&lt;/h2&gt;
&lt;h3 id=&#34;21-半圆形议会图&#34;&gt;2.1 半圆形议会图&lt;/h3&gt;
&lt;p&gt;要在 ggplot2 中使用 ggparliament 创建议会图表，您需要将数据转换为该软件包可以理解的格式。为此，您可以使用 parliament_data 函数，其中您可以指定原始数据集、议会类型及其行数、各党派的席位数以及其他参数。&lt;/p&gt;
&lt;p&gt;然后，您可以将数据传递给 ggplot2，并使用 geom_parliament_seats() 函数。&lt;/p&gt;
&lt;p&gt;请注意，该软件包提供了一个名为 theme_ggparliament 的自定义主题。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-R&#34; data-lang=&#34;R&#34;&gt;&lt;span class=&#34;nf&#34;&gt;library&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ggparliament&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;nf&#34;&gt;library&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;tidyverse&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;ru_semicircle&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;parliament_data&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;election_data&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ru&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
                                 &lt;span class=&#34;n&#34;&gt;type&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#39;semicircle&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;c1&#34;&gt;#半圆形议会图&lt;/span&gt;
                                 &lt;span class=&#34;n&#34;&gt;parl_rows&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;10&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;c1&#34;&gt;#议会图席位行数&lt;/span&gt;
                                 &lt;span class=&#34;n&#34;&gt;party_seats&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ru&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;$&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;seats&lt;/span&gt; &lt;span class=&#34;c1&#34;&gt;#各党派席位&lt;/span&gt;
                                &lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;nf&#34;&gt;ggplot&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ru_semicircle&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;aes&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;y&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;y&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;colour&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;party_short&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;
  &lt;span class=&#34;nf&#34;&gt;geom_parliament_seats&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt; 
  &lt;span class=&#34;nf&#34;&gt;theme_ggparliament&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;
  &lt;span class=&#34;nf&#34;&gt;draw_totalseats&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;n&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;450&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;type&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;semicircle&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;
  &lt;span class=&#34;nf&#34;&gt;labs&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;title&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;Russia, 2016&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;
  &lt;span class=&#34;nf&#34;&gt;scale_colour_manual&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;values&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ru_semicircle&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;$&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;colour&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
                      &lt;span class=&#34;n&#34;&gt;limits&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ru_semicircle&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;$&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;party_short&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;
  &lt;span class=&#34;nf&#34;&gt;coord_fixed&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;  &lt;span class=&#34;c1&#34;&gt;# 设置纵横比为1:1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/output_3_0.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;22-圆形议会图&#34;&gt;2.2 圆形议会图&lt;/h3&gt;
&lt;p&gt;如果你想要创建其他类型的议会，只需将不同的类型传递给 parliament_data 函数的 type 参数。在下面的示例中，我们正在创建一个圆形议会，这在一些国家中被使用。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-R&#34; data-lang=&#34;R&#34;&gt;&lt;span class=&#34;nf&#34;&gt;library&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ggparliament&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;# install.packages(&amp;#34;tidyverse&amp;#34;)&lt;/span&gt;
&lt;span class=&#34;nf&#34;&gt;library&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;tidyverse&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;ru_circle&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;parliament_data&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;election_data&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ru&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
                             &lt;span class=&#34;n&#34;&gt;type&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;circle&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
                             &lt;span class=&#34;n&#34;&gt;parl_rows&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;10&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
                             &lt;span class=&#34;n&#34;&gt;party_seats&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ru&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;$&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;seats&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;nf&#34;&gt;ggplot&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ru_circle&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;aes&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;y&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;y&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;colour&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;party_short&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;
  &lt;span class=&#34;nf&#34;&gt;geom_parliament_seats&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt; 
  &lt;span class=&#34;nf&#34;&gt;theme_ggparliament&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;
  &lt;span class=&#34;nf&#34;&gt;draw_totalseats&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;n&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;450&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;type&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;semicircle&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;
  &lt;span class=&#34;nf&#34;&gt;labs&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;title&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;Russia, 2016&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;
  &lt;span class=&#34;nf&#34;&gt;scale_colour_manual&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;values&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ru_circle&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;$&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;colour&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
                      &lt;span class=&#34;n&#34;&gt;limits&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ru_circle&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;$&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;party_short&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;
  &lt;span class=&#34;nf&#34;&gt;coord_fixed&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;  &lt;span class=&#34;c1&#34;&gt;# 设置纵横比为1:1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;​ &lt;br&gt;
&lt;img loading=&#34;lazy&#34; src=&#34;img/output_6_0.png&#34; alt=&#34;&#34;  /&gt;

​&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;三进一步自定义&#34;&gt;三、进一步自定义&lt;/h2&gt;
&lt;p&gt;该软件包提供了其他功能来自定义议会图表，例如标记政党、绘制多数门槛线、突出显示执政党等。&lt;/p&gt;
&lt;p&gt;在以下示例中，我们将使用半圆图表，但您也可以将相同的函数用于其他类型的议会。&lt;/p&gt;
&lt;h3 id=&#34;31-突出显示执政党并绘制多数门槛&#34;&gt;3.1 突出显示执政党并绘制多数门槛&lt;/h3&gt;
&lt;p&gt;geom_highlight_government 函数允许突出显示政府或控制立法机构的政党。此外，&lt;strong&gt;draw_majoritythreshold&lt;/strong&gt; 函数允许添加表示多数门槛的线条。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-R&#34; data-lang=&#34;R&#34;&gt;&lt;span class=&#34;nf&#34;&gt;library&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ggparliament&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;nf&#34;&gt;library&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;tidyverse&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;ru_semicircle&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;parliament_data&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;election_data&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ru&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
                                 &lt;span class=&#34;n&#34;&gt;type&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;semicircle&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
                                 &lt;span class=&#34;n&#34;&gt;parl_rows&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;10&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
                                 &lt;span class=&#34;n&#34;&gt;party_seats&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ru&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;$&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;seats&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;nf&#34;&gt;ggplot&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ru_semicircle&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;aes&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;y&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;y&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;colour&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;party_short&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;
  &lt;span class=&#34;nf&#34;&gt;geom_parliament_seats&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt; 
  &lt;span class=&#34;nf&#34;&gt;geom_highlight_government&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;government&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;==&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;
  &lt;span class=&#34;nf&#34;&gt;draw_totalseats&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;n&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;450&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;type&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;semicircle&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;
  &lt;span class=&#34;nf&#34;&gt;draw_majoritythreshold&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;n&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;225&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;label&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;kc&#34;&gt;TRUE&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;type&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;semicircle&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;
  &lt;span class=&#34;nf&#34;&gt;theme_ggparliament&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;
  &lt;span class=&#34;nf&#34;&gt;labs&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;title&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;Russia, 2016&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;
  &lt;span class=&#34;nf&#34;&gt;scale_colour_manual&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;values&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ru_semicircle&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;$&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;colour&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
                      &lt;span class=&#34;n&#34;&gt;limits&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ru_semicircle&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;$&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;party_short&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;
   &lt;span class=&#34;nf&#34;&gt;coord_fixed&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/output_8_0.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;h3 id=&#34;32-议会柱状图&#34;&gt;3.2 议会柱状图&lt;/h3&gt;
&lt;p&gt;您还可以使用 geom_parliament_bar 函数添加一个议会柱状图，显示各政党在议会中的席位比例，如下所示。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-R&#34; data-lang=&#34;R&#34;&gt;&lt;span class=&#34;nf&#34;&gt;library&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ggparliament&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;nf&#34;&gt;library&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;tidyverse&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;ru_semicircle&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;parliament_data&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;election_data&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ru&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
                                 &lt;span class=&#34;n&#34;&gt;type&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;semicircle&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
                                 &lt;span class=&#34;n&#34;&gt;parl_rows&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;10&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
                                 &lt;span class=&#34;n&#34;&gt;party_seats&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ru&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;$&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;seats&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;nf&#34;&gt;ggplot&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ru_semicircle&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;aes&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;y&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;y&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;colour&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;party_short&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;
  &lt;span class=&#34;nf&#34;&gt;geom_parliament_seats&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt; 
  &lt;span class=&#34;nf&#34;&gt;geom_highlight_government&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;government&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;==&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;
  &lt;span class=&#34;nf&#34;&gt;geom_parliament_bar&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;colour&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;colour&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;party&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;party_long&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;label&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;kc&#34;&gt;TRUE&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;
  &lt;span class=&#34;nf&#34;&gt;draw_totalseats&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;n&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;450&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;type&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;semicircle&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;
  &lt;span class=&#34;nf&#34;&gt;draw_majoritythreshold&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;n&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;225&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;label&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;kc&#34;&gt;TRUE&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;type&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;semicircle&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;
  &lt;span class=&#34;nf&#34;&gt;theme_ggparliament&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;
  &lt;span class=&#34;nf&#34;&gt;labs&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;title&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;R&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;
  &lt;span class=&#34;nf&#34;&gt;scale_colour_manual&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;values&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ru_semicircle&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;$&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;colour&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
                      &lt;span class=&#34;n&#34;&gt;limits&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ru_semicircle&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;$&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;party_short&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;
  &lt;span class=&#34;nf&#34;&gt;coord_fixed&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/output_10_0.png&#34; alt=&#34;&#34;  /&gt;

&lt;br&gt;&lt;br&gt;&lt;/p&gt;
</description>
      <content:encoded><![CDATA[<h2 id="ggparliament">ggparliament</h2>
<p>ggparliament 包是为了使用 <code>ggplot2</code> 创建议会图表而开发的。该库还提供了一个示例数据集，其中包含了多个国家的选举数据。</p>
<p>在本教程中，我们将使用来自2016年俄罗斯国家杜马选举的数据来进行所有示例。需要注意的是，根据每个国家的不同，议会的类型也会不同，因此您应该根据您想要显示的数据使用相应的类型。可用的类型包括 &ldquo;semicircle&rdquo;（美国、法国、西班牙等）、&ldquo;circle&rdquo;、&ldquo;opposing_benches&rdquo;（英国）、&ldquo;classroom&rdquo; 和 &ldquo;horsehoe&rdquo;（澳大利亚、新西兰）。</p>
<p>这段文字描述了 <code>ggparliament</code> 包的功能以及如何在教程中使用2016年俄罗斯国家杜马选举的数据来进行演示。同时，它还提到了不同国家使用不同类型的议会图表，需要根据数据选择相应的类型。</p>
<p><a href="ggparliament.ipynb"><strong>点击下载本文代码</strong></a></p>
<br>
<h2 id="一准备数据">一、准备数据</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-R" data-lang="R"><span class="nf">library</span><span class="p">(</span><span class="n">ggparliament</span><span class="p">)</span>
<span class="nf">library</span><span class="p">(</span><span class="n">tidyverse</span><span class="p">)</span>

<span class="c1"># Data</span>
<span class="n">ru</span> <span class="o">&lt;-</span> <span class="n">election_data</span> <span class="o">%&gt;%</span>
  <span class="nf">filter</span><span class="p">(</span><span class="n">country</span> <span class="o">==</span> <span class="s">&#34;Russia&#34;</span> <span class="o">&amp;</span> <span class="n">year</span> <span class="o">==</span> <span class="m">2016</span><span class="p">)</span>

<span class="n">ru</span>
</code></pre></div><table class="dataframe">
<caption>A data.frame: 7 × 8</caption>
<thead>
	<tr><th scope=col>year</th><th scope=col>country</th><th scope=col>house</th><th scope=col>party_long</th><th scope=col>party_short</th><th scope=col>seats</th><th scope=col>government</th><th scope=col>colour</th></tr>
	<tr><th scope=col>&lt;int&gt;</th><th scope=col>&lt;chr&gt;</th><th scope=col>&lt;chr&gt;</th><th scope=col>&lt;chr&gt;</th><th scope=col>&lt;chr&gt;</th><th scope=col>&lt;int&gt;</th><th scope=col>&lt;int&gt;</th><th scope=col>&lt;chr&gt;</th></tr>
</thead>
<tbody>
	<tr><td>2016</td><td>Russia</td><td>Duma</td><td>Communist                         </td><td>CPRF  </td><td> 42</td><td>0</td><td>#D50000</td></tr>
	<tr><td>2016</td><td>Russia</td><td>Duma</td><td>Liberal Democratic Party of Russia</td><td>LDPR  </td><td> 39</td><td>0</td><td>#2862B3</td></tr>
	<tr><td>2016</td><td>Russia</td><td>Duma</td><td>A Just Russia                     </td><td>JR    </td><td> 23</td><td>0</td><td>#FAB512</td></tr>
	<tr><td>2016</td><td>Russia</td><td>Duma</td><td>Rodina                            </td><td>Rodina</td><td>  1</td><td>0</td><td>#EA484A</td></tr>
	<tr><td>2016</td><td>Russia</td><td>Duma</td><td>Civic Platform                    </td><td>CPI   </td><td>  1</td><td>0</td><td>#641263</td></tr>
	<tr><td>2016</td><td>Russia</td><td>Duma</td><td>Independent                       </td><td>Ind   </td><td>  1</td><td>0</td><td>#B4B4B4</td></tr>
	<tr><td>2016</td><td>Russia</td><td>Duma</td><td>United Russia                     </td><td>UR    </td><td>343</td><td>1</td><td>#0C2C84</td></tr>
</tbody>
</table>
<p><br><br></p>
<h2 id="二议会图">二、议会图</h2>
<h3 id="21-半圆形议会图">2.1 半圆形议会图</h3>
<p>要在 ggplot2 中使用 ggparliament 创建议会图表，您需要将数据转换为该软件包可以理解的格式。为此，您可以使用 parliament_data 函数，其中您可以指定原始数据集、议会类型及其行数、各党派的席位数以及其他参数。</p>
<p>然后，您可以将数据传递给 ggplot2，并使用 geom_parliament_seats() 函数。</p>
<p>请注意，该软件包提供了一个名为 theme_ggparliament 的自定义主题。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-R" data-lang="R"><span class="nf">library</span><span class="p">(</span><span class="n">ggparliament</span><span class="p">)</span>
<span class="nf">library</span><span class="p">(</span><span class="n">tidyverse</span><span class="p">)</span>

<span class="n">ru_semicircle</span> <span class="o">&lt;-</span> <span class="nf">parliament_data</span><span class="p">(</span><span class="n">election_data</span> <span class="o">=</span> <span class="n">ru</span><span class="p">,</span>
                                 <span class="n">type</span> <span class="o">=</span> <span class="s">&#39;semicircle&#39;</span><span class="p">,</span><span class="c1">#半圆形议会图</span>
                                 <span class="n">parl_rows</span> <span class="o">=</span> <span class="m">10</span><span class="p">,</span> <span class="c1">#议会图席位行数</span>
                                 <span class="n">party_seats</span> <span class="o">=</span> <span class="n">ru</span><span class="o">$</span><span class="n">seats</span> <span class="c1">#各党派席位</span>
                                <span class="p">)</span>

<span class="nf">ggplot</span><span class="p">(</span><span class="n">ru_semicircle</span><span class="p">,</span> <span class="nf">aes</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">y</span><span class="p">,</span> <span class="n">colour</span> <span class="o">=</span> <span class="n">party_short</span><span class="p">))</span> <span class="o">+</span>
  <span class="nf">geom_parliament_seats</span><span class="p">()</span> <span class="o">+</span> 
  <span class="nf">theme_ggparliament</span><span class="p">()</span> <span class="o">+</span>
  <span class="nf">draw_totalseats</span><span class="p">(</span><span class="n">n</span> <span class="o">=</span> <span class="m">450</span><span class="p">,</span> <span class="n">type</span> <span class="o">=</span> <span class="s">&#34;semicircle&#34;</span><span class="p">)</span> <span class="o">+</span>
  <span class="nf">labs</span><span class="p">(</span><span class="n">title</span> <span class="o">=</span> <span class="s">&#34;Russia, 2016&#34;</span><span class="p">)</span> <span class="o">+</span>
  <span class="nf">scale_colour_manual</span><span class="p">(</span><span class="n">values</span> <span class="o">=</span> <span class="n">ru_semicircle</span><span class="o">$</span><span class="n">colour</span><span class="p">,</span> 
                      <span class="n">limits</span> <span class="o">=</span> <span class="n">ru_semicircle</span><span class="o">$</span><span class="n">party_short</span><span class="p">)</span> <span class="o">+</span>
  <span class="nf">coord_fixed</span><span class="p">()</span>  <span class="c1"># 设置纵横比为1:1</span>
</code></pre></div><p><img loading="lazy" src="img/output_3_0.png" alt=""  />
</p>
<br>
<h3 id="22-圆形议会图">2.2 圆形议会图</h3>
<p>如果你想要创建其他类型的议会，只需将不同的类型传递给 parliament_data 函数的 type 参数。在下面的示例中，我们正在创建一个圆形议会，这在一些国家中被使用。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-R" data-lang="R"><span class="nf">library</span><span class="p">(</span><span class="n">ggparliament</span><span class="p">)</span>
<span class="c1"># install.packages(&#34;tidyverse&#34;)</span>
<span class="nf">library</span><span class="p">(</span><span class="n">tidyverse</span><span class="p">)</span>

<span class="n">ru_circle</span> <span class="o">&lt;-</span> <span class="nf">parliament_data</span><span class="p">(</span><span class="n">election_data</span> <span class="o">=</span> <span class="n">ru</span><span class="p">,</span>
                             <span class="n">type</span> <span class="o">=</span> <span class="s">&#34;circle&#34;</span><span class="p">,</span>
                             <span class="n">parl_rows</span> <span class="o">=</span> <span class="m">10</span><span class="p">,</span>
                             <span class="n">party_seats</span> <span class="o">=</span> <span class="n">ru</span><span class="o">$</span><span class="n">seats</span><span class="p">)</span>

<span class="nf">ggplot</span><span class="p">(</span><span class="n">ru_circle</span><span class="p">,</span> <span class="nf">aes</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">y</span><span class="p">,</span> <span class="n">colour</span> <span class="o">=</span> <span class="n">party_short</span><span class="p">))</span> <span class="o">+</span>
  <span class="nf">geom_parliament_seats</span><span class="p">()</span> <span class="o">+</span> 
  <span class="nf">theme_ggparliament</span><span class="p">()</span> <span class="o">+</span>
  <span class="nf">draw_totalseats</span><span class="p">(</span><span class="n">n</span> <span class="o">=</span> <span class="m">450</span><span class="p">,</span> <span class="n">type</span> <span class="o">=</span> <span class="s">&#34;semicircle&#34;</span><span class="p">)</span> <span class="o">+</span>
  <span class="nf">labs</span><span class="p">(</span><span class="n">title</span> <span class="o">=</span> <span class="s">&#34;Russia, 2016&#34;</span><span class="p">)</span> <span class="o">+</span>
  <span class="nf">scale_colour_manual</span><span class="p">(</span><span class="n">values</span> <span class="o">=</span> <span class="n">ru_circle</span><span class="o">$</span><span class="n">colour</span><span class="p">,</span> 
                      <span class="n">limits</span> <span class="o">=</span> <span class="n">ru_circle</span><span class="o">$</span><span class="n">party_short</span><span class="p">)</span> <span class="o">+</span>
  <span class="nf">coord_fixed</span><span class="p">()</span>  <span class="c1"># 设置纵横比为1:1</span>
</code></pre></div><p>​ <br>
<img loading="lazy" src="img/output_6_0.png" alt=""  />

​</p>
<p><br><br></p>
<h2 id="三进一步自定义">三、进一步自定义</h2>
<p>该软件包提供了其他功能来自定义议会图表，例如标记政党、绘制多数门槛线、突出显示执政党等。</p>
<p>在以下示例中，我们将使用半圆图表，但您也可以将相同的函数用于其他类型的议会。</p>
<h3 id="31-突出显示执政党并绘制多数门槛">3.1 突出显示执政党并绘制多数门槛</h3>
<p>geom_highlight_government 函数允许突出显示政府或控制立法机构的政党。此外，<strong>draw_majoritythreshold</strong> 函数允许添加表示多数门槛的线条。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-R" data-lang="R"><span class="nf">library</span><span class="p">(</span><span class="n">ggparliament</span><span class="p">)</span>
<span class="nf">library</span><span class="p">(</span><span class="n">tidyverse</span><span class="p">)</span>

<span class="n">ru_semicircle</span> <span class="o">&lt;-</span> <span class="nf">parliament_data</span><span class="p">(</span><span class="n">election_data</span> <span class="o">=</span> <span class="n">ru</span><span class="p">,</span>
                                 <span class="n">type</span> <span class="o">=</span> <span class="s">&#34;semicircle&#34;</span><span class="p">,</span>
                                 <span class="n">parl_rows</span> <span class="o">=</span> <span class="m">10</span><span class="p">,</span>
                                 <span class="n">party_seats</span> <span class="o">=</span> <span class="n">ru</span><span class="o">$</span><span class="n">seats</span><span class="p">)</span>

<span class="nf">ggplot</span><span class="p">(</span><span class="n">ru_semicircle</span><span class="p">,</span> <span class="nf">aes</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">y</span><span class="p">,</span> <span class="n">colour</span> <span class="o">=</span> <span class="n">party_short</span><span class="p">))</span> <span class="o">+</span>
  <span class="nf">geom_parliament_seats</span><span class="p">()</span> <span class="o">+</span> 
  <span class="nf">geom_highlight_government</span><span class="p">(</span><span class="n">government</span> <span class="o">==</span> <span class="m">1</span><span class="p">)</span> <span class="o">+</span>
  <span class="nf">draw_totalseats</span><span class="p">(</span><span class="n">n</span> <span class="o">=</span> <span class="m">450</span><span class="p">,</span> <span class="n">type</span> <span class="o">=</span> <span class="s">&#34;semicircle&#34;</span><span class="p">)</span> <span class="o">+</span>
  <span class="nf">draw_majoritythreshold</span><span class="p">(</span><span class="n">n</span> <span class="o">=</span> <span class="m">225</span><span class="p">,</span> <span class="n">label</span> <span class="o">=</span> <span class="kc">TRUE</span><span class="p">,</span> <span class="n">type</span> <span class="o">=</span> <span class="s">&#34;semicircle&#34;</span><span class="p">)</span> <span class="o">+</span>
  <span class="nf">theme_ggparliament</span><span class="p">()</span> <span class="o">+</span>
  <span class="nf">labs</span><span class="p">(</span><span class="n">title</span> <span class="o">=</span> <span class="s">&#34;Russia, 2016&#34;</span><span class="p">)</span> <span class="o">+</span>
  <span class="nf">scale_colour_manual</span><span class="p">(</span><span class="n">values</span> <span class="o">=</span> <span class="n">ru_semicircle</span><span class="o">$</span><span class="n">colour</span><span class="p">,</span> 
                      <span class="n">limits</span> <span class="o">=</span> <span class="n">ru_semicircle</span><span class="o">$</span><span class="n">party_short</span><span class="p">)</span> <span class="o">+</span>
   <span class="nf">coord_fixed</span><span class="p">()</span> 
</code></pre></div><p><img loading="lazy" src="img/output_8_0.png" alt=""  />
</p>
<h3 id="32-议会柱状图">3.2 议会柱状图</h3>
<p>您还可以使用 geom_parliament_bar 函数添加一个议会柱状图，显示各政党在议会中的席位比例，如下所示。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-R" data-lang="R"><span class="nf">library</span><span class="p">(</span><span class="n">ggparliament</span><span class="p">)</span>
<span class="nf">library</span><span class="p">(</span><span class="n">tidyverse</span><span class="p">)</span>

<span class="n">ru_semicircle</span> <span class="o">&lt;-</span> <span class="nf">parliament_data</span><span class="p">(</span><span class="n">election_data</span> <span class="o">=</span> <span class="n">ru</span><span class="p">,</span>
                                 <span class="n">type</span> <span class="o">=</span> <span class="s">&#34;semicircle&#34;</span><span class="p">,</span>
                                 <span class="n">parl_rows</span> <span class="o">=</span> <span class="m">10</span><span class="p">,</span>
                                 <span class="n">party_seats</span> <span class="o">=</span> <span class="n">ru</span><span class="o">$</span><span class="n">seats</span><span class="p">)</span>

<span class="nf">ggplot</span><span class="p">(</span><span class="n">ru_semicircle</span><span class="p">,</span> <span class="nf">aes</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">y</span><span class="p">,</span> <span class="n">colour</span> <span class="o">=</span> <span class="n">party_short</span><span class="p">))</span> <span class="o">+</span>
  <span class="nf">geom_parliament_seats</span><span class="p">()</span> <span class="o">+</span> 
  <span class="nf">geom_highlight_government</span><span class="p">(</span><span class="n">government</span> <span class="o">==</span> <span class="m">1</span><span class="p">)</span> <span class="o">+</span>
  <span class="nf">geom_parliament_bar</span><span class="p">(</span><span class="n">colour</span> <span class="o">=</span> <span class="n">colour</span><span class="p">,</span> <span class="n">party</span> <span class="o">=</span> <span class="n">party_long</span><span class="p">,</span> <span class="n">label</span> <span class="o">=</span> <span class="kc">TRUE</span><span class="p">)</span> <span class="o">+</span>
  <span class="nf">draw_totalseats</span><span class="p">(</span><span class="n">n</span> <span class="o">=</span> <span class="m">450</span><span class="p">,</span> <span class="n">type</span> <span class="o">=</span> <span class="s">&#34;semicircle&#34;</span><span class="p">)</span> <span class="o">+</span>
  <span class="nf">draw_majoritythreshold</span><span class="p">(</span><span class="n">n</span> <span class="o">=</span> <span class="m">225</span><span class="p">,</span> <span class="n">label</span> <span class="o">=</span> <span class="kc">TRUE</span><span class="p">,</span> <span class="n">type</span> <span class="o">=</span> <span class="s">&#34;semicircle&#34;</span><span class="p">)</span> <span class="o">+</span>
  <span class="nf">theme_ggparliament</span><span class="p">()</span> <span class="o">+</span>
  <span class="nf">labs</span><span class="p">(</span><span class="n">title</span> <span class="o">=</span> <span class="s">&#34;R&#34;</span><span class="p">)</span> <span class="o">+</span>
  <span class="nf">scale_colour_manual</span><span class="p">(</span><span class="n">values</span> <span class="o">=</span> <span class="n">ru_semicircle</span><span class="o">$</span><span class="n">colour</span><span class="p">,</span> 
                      <span class="n">limits</span> <span class="o">=</span> <span class="n">ru_semicircle</span><span class="o">$</span><span class="n">party_short</span><span class="p">)</span> <span class="o">+</span>
  <span class="nf">coord_fixed</span><span class="p">()</span> 
</code></pre></div><p><img loading="lazy" src="img/output_10_0.png" alt=""  />

<br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>可视化 | PyWaffle绘制华夫图</title>
      <link>https://textdata.cn/blog/2023-08-29-visualization-pywaffle/</link>
      <pubDate>Tue, 29 Aug 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-08-29-visualization-pywaffle/</guid>
      <description>华夫图（Waffle Chart）是一种数据可视化工具，用于呈现不同类别的占比关系。它的主要特点是将一个矩形图形（通常是正方形）分割成小块，每个小块代表一个特定的数据单元，如百分比、频率或数量。华夫图的颜色通常与数据类别相关联，使得观察者能够快速了解各个类别之间的比例关系。</description>
      <content:encoded><![CDATA[<p>华夫图（Waffle Chart）是一种数据可视化工具，用于呈现不同类别的占比关系。它的主要特点是将一个矩形图形（通常是正方形）分割成小块，每个小块代表一个特定的数据单元，如百分比、频率或数量。华夫图的颜色通常与数据类别相关联，使得观察者能够快速了解各个类别之间的比例关系。</p>
<p><img loading="lazy" src="img/waffle_chart_powerpoint.png" alt=""  />
</p>
<p>华夫图的用途和优点包括：</p>
<ol>
<li><strong>显示占比关系：</strong> 华夫图适用于呈现不同类别之间的占比关系。通过将矩形分割成小块，每个小块的大小与类别的相对占比成比例，观察者可以一目了然地看到各个类别的相对大小。</li>
<li><strong>易于理解：</strong> 华夫图的可视化形式非常直观，不需要复杂的解释即可传达数据信息。这使得它在简单地传达比例信息时非常有用。</li>
<li><strong>可用于分类和分组：</strong> 华夫图可用于显示多个类别或分组的占比关系。它适用于呈现类别的分布情况，例如不同地区的销售份额、产品类别的销售比例等。</li>
<li><strong>美观性：</strong> 华夫图的可视化效果通常很吸引人，能够吸引观众的注意力，从而更好地传达信息。</li>
<li><strong>简化复杂数据：</strong> 华夫图可以用于将复杂的数据转化为简单的形式，以便更好地展示主要的占比关系，而不必深入细节。</li>
<li><strong>与颜色编码相关性：</strong> 华夫图可以通过为每个小块选择不同的颜色来传达额外的信息。这可以用于显示不同类别之间的相关性、区分特定的子类别等。</li>
</ol>
<br>
<p>PyWaffle是一个开源的、使用MIT许可的Python包，用于绘制华夫饼图。 提供了一个名为&quot;Figure&quot;的构造类&quot;Waffle&quot;，可以传递给matplotlib.pyplot.figure，并生成一个matplotlib的&quot;Figure&quot;对象。</p>
<br>
<h2 id="安装">安装</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="err">!</span><span class="n">pip3</span> <span class="n">install</span> <span class="n">pywaffle</span>
</code></pre></div><p><br><br></p>
<h2 id="一快速入门">一、快速入门</h2>
<p>这是一个简单的示例，用于绘制一个 <strong>5行10列</strong> 的华夫图（Waffle Chart）。这三个数值直接绘制为块，块的数量与值中的数字相匹配，因为值的总和等于总块数（行数乘以列数）。</p>
<p>参数values接受多种格式的数字，包括<strong>列表、字典和pandas.DataFrame</strong>。</p>
<h3 id="11-列表数据">1.1 列表数据</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">from</span> <span class="nn">pywaffle</span> <span class="kn">import</span> <span class="n">Waffle</span>

<span class="n">fig</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span>
    <span class="n">FigureClass</span> <span class="o">=</span> <span class="n">Waffle</span><span class="p">,</span>
    <span class="n">rows</span> <span class="o">=</span> <span class="mi">5</span><span class="p">,</span>
    <span class="n">columns</span> <span class="o">=</span> <span class="mi">10</span><span class="p">,</span>
    <span class="n">values</span> <span class="o">=</span> <span class="p">[</span><span class="mi">30</span><span class="p">,</span> <span class="mi">16</span><span class="p">,</span> <span class="mi">4</span><span class="p">]</span>
<span class="p">)</span>

<span class="n">fig</span><span class="o">.</span><span class="n">savefig</span><span class="p">(</span><span class="s1">&#39;plot1.png&#39;</span><span class="p">,</span> <span class="n">dpi</span><span class="o">=</span><span class="mi">300</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/output_3_0.png" alt=""  />

​</p>
<p>注意：在这种情况下，参数中的 rows 和 columns 其中一个是多余的，因为图表的大小和值的总和都为50。因此，rows 和 columns 中的一个可以省略，并且仍然可以通过值的总和自动计算出它。详见“自动调整大小”以获取更多详细信息。</p>
<br>
<h3 id="12-字典数据">1.2 字典数据</h3>
<p>当将一个字典传递给 values 时，字典的键将被用作标签，并显示在图例中。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">fig</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span>
    <span class="n">FigureClass</span> <span class="o">=</span> <span class="n">Waffle</span><span class="p">,</span>
    <span class="n">rows</span> <span class="o">=</span> <span class="mi">5</span><span class="p">,</span>
    <span class="n">columns</span> <span class="o">=</span> <span class="mi">10</span><span class="p">,</span>
    <span class="n">values</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;Cat1&#39;</span><span class="p">:</span> <span class="mi">30</span><span class="p">,</span> 
              <span class="s1">&#39;Cat2&#39;</span><span class="p">:</span> <span class="mi">16</span><span class="p">,</span> 
              <span class="s1">&#39;Cat3&#39;</span><span class="p">:</span> <span class="mi">4</span><span class="p">},</span>
    <span class="n">legend</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;loc&#39;</span><span class="p">:</span> <span class="s1">&#39;upper left&#39;</span><span class="p">,</span> <span class="s1">&#39;bbox_to_anchor&#39;</span><span class="p">:</span> <span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">)}</span>
<span class="p">)</span>

<span class="n">fig</span><span class="o">.</span><span class="n">savefig</span><span class="p">(</span><span class="s1">&#39;plot2.png&#39;</span><span class="p">,</span> <span class="n">dpi</span><span class="o">=</span><span class="mi">300</span><span class="p">)</span>
</code></pre></div><p>​ <br>
<img loading="lazy" src="img/output_5_0.png" alt=""  />

​</p>
<br>
<h3 id="13-dataframe数据">1.3 DataFrame数据</h3>
<p>与字典中的值可以自动生成标签和图例不同的是，当值是一个数据框（DataFrame）时，华夫图不会默认使用数据框的行索引作为标签。因此，如果您想要使用列索引作为标签，您必须手动将索引传递给参数 labels。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">data</span> <span class="o">=</span> <span class="p">[</span><span class="mi">30</span><span class="p">,</span> <span class="mi">16</span><span class="p">,</span> <span class="mi">4</span><span class="p">]</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> 
                  <span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;Value&#39;</span><span class="p">],</span> 
                  <span class="n">index</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;Cat1&#39;</span><span class="p">,</span> <span class="s1">&#39;Cat2&#39;</span><span class="p">,</span> <span class="s1">&#39;Cat3&#39;</span><span class="p">])</span>


<span class="n">fig</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span>
    <span class="n">FigureClass</span><span class="o">=</span><span class="n">Waffle</span><span class="p">,</span>
    <span class="n">rows</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span>
    <span class="n">columns</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span>
    <span class="n">values</span><span class="o">=</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;Value&#39;</span><span class="p">],</span>
    <span class="n">labels</span><span class="o">=</span><span class="nb">list</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">index</span><span class="p">),</span>  <span class="c1"># 没有这行， Legend 不会显示</span>
    <span class="n">legend</span><span class="o">=</span><span class="p">{</span><span class="s1">&#39;loc&#39;</span><span class="p">:</span> <span class="s1">&#39;upper left&#39;</span><span class="p">,</span> <span class="s1">&#39;bbox_to_anchor&#39;</span><span class="p">:</span> <span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">)}</span>
<span class="p">)</span>
</code></pre></div><p>​ <br>
<img loading="lazy" src="img/output_7_0.png" alt=""  />

​</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">fig</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">()</span>
<span class="n">ax</span> <span class="o">=</span> <span class="n">fig</span><span class="o">.</span><span class="n">add_subplot</span><span class="p">(</span><span class="mi">111</span><span class="p">)</span>

<span class="c1"># 修改已存在的axis</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set_title</span><span class="p">(</span><span class="s2">&#34;Axis Title&#34;</span><span class="p">)</span>
<span class="c1"># 确保绘制的图形在坐标轴两个方向上的比例保持一致</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set_aspect</span><span class="p">(</span><span class="n">aspect</span><span class="o">=</span><span class="s2">&#34;equal&#34;</span><span class="p">)</span>

<span class="n">Waffle</span><span class="o">.</span><span class="n">make_waffle</span><span class="p">(</span>
    <span class="n">ax</span><span class="o">=</span><span class="n">ax</span><span class="p">,</span>  <span class="c1"># pass axis to make_waffle</span>
    <span class="n">rows</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> 
    <span class="n">columns</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> 
    <span class="n">values</span><span class="o">=</span><span class="p">[</span><span class="mi">30</span><span class="p">,</span> <span class="mi">16</span><span class="p">,</span> <span class="mi">4</span><span class="p">],</span> 
    <span class="n">title</span><span class="o">=</span><span class="p">{</span><span class="s2">&#34;label&#34;</span><span class="p">:</span> <span class="s2">&#34;Waffle Title&#34;</span><span class="p">,</span> <span class="s2">&#34;loc&#34;</span><span class="p">:</span> <span class="s2">&#34;left&#34;</span><span class="p">}</span>
<span class="p">)</span>
</code></pre></div><p>​ <br>
<img loading="lazy" src="img/output_8_0.png" alt=""  />

​</p>
<p><br><br></p>
<h2 id="二值缩放尺寸调整">二、值缩放&amp;尺寸调整</h2>
<h3 id="21-数值缩放">2.1 数值缩放</h3>
<p>在实际情况中，图表的尺寸通常与数值的总和不相等。因此，必须对数值进行缩放，以适应图表的尺寸。</p>
<p>通过将参数 rounding_rule 设置为首选的舍入规则，可以实现这一目标。它接受三种值：<strong>floor、ceil 或 nearest</strong>。</p>
<p>注意：当 rounding_rule 设置为 ceil 或 nearest 时，缩放后的值的总和可能会大于图表的尺寸。如果是这样，最后一个类别的块将不会完全显示在图表中。因此，尽管 nearest 是默认的舍入规则，但实际上，floor 是最一致的规则，因为它可以避免块的溢出。</p>
<p>在下面的示例中，通过使用 rounding_rule=floor，将值缩放为 24、23、1 作为块的数量。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">fig</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span>
    <span class="n">FigureClass</span><span class="o">=</span><span class="n">Waffle</span><span class="p">,</span>
    <span class="n">rows</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span>
    <span class="n">columns</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span>
    <span class="n">values</span><span class="o">=</span><span class="p">[</span><span class="mi">48</span><span class="p">,</span> <span class="mi">46</span><span class="p">,</span> <span class="mi">3</span><span class="p">],</span>
    <span class="n">rounding_rule</span><span class="o">=</span><span class="s1">&#39;floor&#39;</span>
<span class="p">)</span>
</code></pre></div><p>​ <br>
<img loading="lazy" src="img/output_10_0.png" alt=""  />

​</p>
<br>
<h3 id="22-自动调整尺寸">2.2 自动调整尺寸</h3>
<p>如果您想避免值的缩放，只需将一个整数传递给 rows 或 columns 参数中的一个。然后，绝对数量的值将直接用作块的数量，并且另一个参数会自动计算。</p>
<p>在以下示例中，我们将 rows 设置为 5，values 设置为 [48, 46, 3]，并将 columns 留空。然后，块的数量将与值相同。由于值的总和为 97，列数必须为 20，以适应所有块。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">fig</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span>
    <span class="n">FigureClass</span><span class="o">=</span><span class="n">Waffle</span><span class="p">,</span>
    <span class="n">rows</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span>
    <span class="n">values</span><span class="o">=</span><span class="p">[</span><span class="mi">48</span><span class="p">,</span> <span class="mi">46</span><span class="p">,</span> <span class="mi">3</span><span class="p">]</span>
<span class="p">)</span>
</code></pre></div><p>​ <br>
<img loading="lazy" src="img/output_12_0.png" alt=""  />

​</p>
<p><br><br></p>
<h2 id="三标题标签图例">三、标题、标签、图例</h2>
<ul>
<li>参数 <code>title</code> 以字典形式接受 <code>matplotlib.pyplot.title</code> 的参数。</li>
<li>参数 <code>labels</code> 以列表中的字符串标签形式接受。如果未指定，将使用 <code>values</code> 的键作为标签。</li>
<li>参数 <code>legend</code> 以字典形式接受 <code>matplotlib.pyplot.legend</code> 的参数。</li>
</ul>
<p>注意：标签也可以在参数 <code>legend</code> 下的 <code>labels</code> 键中指定。</p>
<p>这些信息描述了使用 <code>pywaffle</code> 库中某些参数的功能。这些参数用于在绘图中设置标题、标签和图例的属性。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">data</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;Cat1&#39;</span><span class="p">:</span> <span class="mi">30</span><span class="p">,</span> <span class="s1">&#39;Cat2&#39;</span><span class="p">:</span> <span class="mi">16</span><span class="p">,</span> <span class="s1">&#39;Cat3&#39;</span><span class="p">:</span> <span class="mi">4</span><span class="p">}</span>

<span class="n">fig</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span>
    <span class="n">FigureClass</span><span class="o">=</span><span class="n">Waffle</span><span class="p">,</span>
    <span class="n">rows</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span>
    <span class="n">columns</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span>
    <span class="n">values</span><span class="o">=</span><span class="n">data</span><span class="p">,</span>
    <span class="n">title</span><span class="o">=</span><span class="p">{</span>
        <span class="s1">&#39;label&#39;</span><span class="p">:</span> <span class="s1">&#39;Example plot&#39;</span><span class="p">,</span>
        <span class="s1">&#39;loc&#39;</span><span class="p">:</span> <span class="s1">&#39;center&#39;</span><span class="p">,</span>
        <span class="s1">&#39;fontdict&#39;</span><span class="p">:</span> <span class="p">{</span>
            <span class="s1">&#39;fontsize&#39;</span><span class="p">:</span> <span class="mi">20</span>
        <span class="p">}</span>
    <span class="p">},</span>
    <span class="n">labels</span><span class="o">=</span><span class="p">[</span><span class="sa">f</span><span class="s2">&#34;</span><span class="si">{</span><span class="n">k</span><span class="si">}</span><span class="s2"> (</span><span class="si">{</span><span class="nb">int</span><span class="p">(</span><span class="n">v</span> <span class="o">/</span> <span class="nb">sum</span><span class="p">(</span><span class="n">data</span><span class="o">.</span><span class="n">values</span><span class="p">())</span> <span class="o">*</span> <span class="mi">100</span><span class="p">)</span><span class="si">}</span><span class="s2">%)&#34;</span> <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">v</span> <span class="ow">in</span> <span class="n">data</span><span class="o">.</span><span class="n">items</span><span class="p">()],</span>
    <span class="n">legend</span><span class="o">=</span><span class="p">{</span>
        <span class="c1"># &#39;labels&#39;: [f&#34;{k} ({v}%)&#34; for k, v in data.items()],  # lebels could also be under legend instead</span>
        <span class="s1">&#39;loc&#39;</span><span class="p">:</span> <span class="s1">&#39;lower left&#39;</span><span class="p">,</span>
        <span class="s1">&#39;bbox_to_anchor&#39;</span><span class="p">:</span> <span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="o">-</span><span class="mf">0.4</span><span class="p">),</span>
        <span class="s1">&#39;ncol&#39;</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="n">data</span><span class="p">),</span>
        <span class="s1">&#39;framealpha&#39;</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span>
        <span class="s1">&#39;fontsize&#39;</span><span class="p">:</span> <span class="mi">12</span>
    <span class="p">}</span>
<span class="p">)</span>
</code></pre></div><p>​ <br>
<img loading="lazy" src="img/output_14_0.png" alt=""  />

​</p>
<p><br><br></p>
<h3 id="四块配色">四、块配色</h3>
<p>参数colors接受一个颜色列表或元组。其长度必须与values相同，可接受的颜色格式包括不区分大小写的十六进制RGB或RGBA，RGB或RGBA元组，单字符表示法，不区分大小写的X11/CSS4颜色名称等，只要Matplotlib能够识别即可。请参阅Matplotlib颜色文档获取完整列表。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">fig</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span>
    <span class="n">FigureClass</span><span class="o">=</span><span class="n">Waffle</span><span class="p">,</span>
    <span class="n">rows</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span>
    <span class="n">columns</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span>
    <span class="n">values</span><span class="o">=</span><span class="p">[</span><span class="mi">30</span><span class="p">,</span> <span class="mi">16</span><span class="p">,</span> <span class="mi">4</span><span class="p">],</span>
    <span class="n">colors</span><span class="o">=</span><span class="p">[</span><span class="s2">&#34;#232066&#34;</span><span class="p">,</span> <span class="s2">&#34;#983D3D&#34;</span><span class="p">,</span> <span class="s2">&#34;#DCB732&#34;</span><span class="p">]</span>
<span class="p">)</span>

</code></pre></div><p>​ <br>
<img loading="lazy" src="img/output_16_0.png" alt=""  />

​</p>
<p><br> 改变块颜色的另一种方法是通过将颜色映射（Colormap）传递给参数 cmap_name，从而批量设置颜色。</p>
<p><strong>注意</strong>：顺序型颜色映射在 PyWaffle 中不起作用。只支持质性颜色映射，包括 Pastel1、Pastel2、Paired、Accent、Dark2、Set1、Set2、Set3、tab10、tab20、tab20b、tab20c。请参阅Matplotlib中的颜色映射列表和示例以获取更多信息。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">fig</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span>
    <span class="n">FigureClass</span><span class="o">=</span><span class="n">Waffle</span><span class="p">,</span>
    <span class="n">rows</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span>
    <span class="n">columns</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span>
    <span class="n">values</span><span class="o">=</span><span class="p">[</span><span class="mi">30</span><span class="p">,</span> <span class="mi">16</span><span class="p">,</span> <span class="mi">4</span><span class="p">],</span>
    <span class="n">cmap_name</span><span class="o">=</span><span class="s2">&#34;tab10&#34;</span>
<span class="p">)</span>

</code></pre></div><p>​ <br>
<img loading="lazy" src="img/output_18_0.png" alt=""  />

​</p>
<p><br><br></p>
<h2 id="五块形状">五、块形状</h2>
<p>默认块的形状为正方形， 但是也可以用<strong>characters</strong>、<strong>icons</strong>确定为其他形状。</p>
<h3 id="51-characters">5.1 characters</h3>
<p>可以通过将所需的字符传递给 <strong>characters</strong> 参数，而不是矩形块，来使用 Unicode 字符来表示块。</p>
<p>可以通过将包含每个类别对应字符的列表或元组传递给 characters 参数，为每个类别指定不同的字符。字符列表的长度必须与 values 参数的长度相同。</p>
<p>要指定字体，请将绝对路径传递给 .ttf 或 .otf 字体文件到 <strong>font_file</strong> 参数。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">fig</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span>
    <span class="n">FigureClass</span><span class="o">=</span><span class="n">Waffle</span><span class="p">,</span>
    <span class="n">rows</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span>
    <span class="n">values</span><span class="o">=</span><span class="p">[</span><span class="mi">30</span><span class="p">,</span> <span class="mi">16</span><span class="p">,</span> <span class="mi">4</span><span class="p">],</span>
    <span class="n">colors</span><span class="o">=</span><span class="p">[</span><span class="s2">&#34;#4C8CB5&#34;</span><span class="p">,</span> <span class="s2">&#34;#B7CBD7&#34;</span><span class="p">,</span> <span class="s2">&#34;#C0C0C0&#34;</span><span class="p">],</span>
    <span class="n">characters</span><span class="o">=</span><span class="s1">&#39;❤&#39;</span><span class="p">,</span>
    <span class="n">font_size</span><span class="o">=</span><span class="mi">24</span>
<span class="p">)</span>
</code></pre></div><p>​ <br>
<img loading="lazy" src="img/output_20_0.png" alt=""  />

​</p>
<br>
<h3 id="52-图标">5.2 图标</h3>
<p>使用图标的华夫图也被称为象形图表（Pictogram Chart）。</p>
<p>PyWaffle 支持通过 Font Awesome 绘制带有图标的图表。有关 Font Awesome 如何集成到 PyWaffle 中的信息，请查看“Font Awesome 集成”页面。要在 Font Awesome 中搜索可用的图标名称，请访问 <a href="https://fontawesome.com/search">https://fontawesome.com/search</a>。</p>
<p>在使用图标时，设置块大小的参数将被忽略，包括 interval_ratio_x、interval_ratio_y 和 block_aspect_ratio。相反，使用 font_size 来设置图标的大小。有关允许的大小，请参见 FontProperties.set_size。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">fig</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span>
    <span class="n">FigureClass</span><span class="o">=</span><span class="n">Waffle</span><span class="p">,</span>
    <span class="n">rows</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span>
    <span class="n">values</span><span class="o">=</span><span class="p">[</span><span class="mi">30</span><span class="p">,</span> <span class="mi">16</span><span class="p">,</span> <span class="mi">4</span><span class="p">],</span>
    <span class="n">colors</span><span class="o">=</span><span class="p">[</span><span class="s2">&#34;#232066&#34;</span><span class="p">,</span> <span class="s2">&#34;#983D3D&#34;</span><span class="p">,</span> <span class="s2">&#34;#DCB732&#34;</span><span class="p">],</span>
    <span class="n">icons</span><span class="o">=</span><span class="s1">&#39;star&#39;</span><span class="p">,</span>
    <span class="n">font_size</span><span class="o">=</span><span class="mi">24</span> <span class="c1">#icon大小</span>
<span class="p">)</span>
</code></pre></div><p>​ <br>
<img loading="lazy" src="img/output_22_0.png" alt=""  />

​</p>
<p>每个类别可以有不同的图标，通过将图标名称的列表或元组传递给参数 icons 来实现。其长度必须与 values 的长度相同。</p>
<p>在 Font Awesome 图标库中，有不同风格的图标集，包括 Solid、Regular 和 Brands。可以通过参数 icon_style 来指定图标的风格。默认情况下，它会从 Solid 风格中搜索图标。</p>
<p>通过设置 icon_legend=True，图例中的符号将会是图标。否则，它将是一个颜色条。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">fig</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span>
    <span class="n">FigureClass</span><span class="o">=</span><span class="n">Waffle</span><span class="p">,</span>
    <span class="n">rows</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span>
    <span class="n">values</span><span class="o">=</span><span class="p">[</span><span class="mi">30</span><span class="p">,</span> <span class="mi">16</span><span class="p">,</span> <span class="mi">4</span><span class="p">],</span>
    <span class="n">colors</span><span class="o">=</span><span class="p">[</span><span class="s2">&#34;#FFA500&#34;</span><span class="p">,</span> <span class="s2">&#34;#4384FF&#34;</span><span class="p">,</span> <span class="s2">&#34;#C0C0C0&#34;</span><span class="p">],</span>
    <span class="n">icons</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;sun&#39;</span><span class="p">,</span> <span class="s1">&#39;cloud-showers-heavy&#39;</span><span class="p">,</span> <span class="s1">&#39;snowflake&#39;</span><span class="p">],</span>
    <span class="n">font_size</span><span class="o">=</span><span class="mi">20</span><span class="p">,</span>
    <span class="n">icon_style</span><span class="o">=</span><span class="s1">&#39;solid&#39;</span><span class="p">,</span>
    <span class="c1">#设置 icon_legend=True，图例中的符号将会是图标。否则，它将是一个颜色条。</span>
    <span class="n">icon_legend</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span>
    <span class="n">legend</span><span class="o">=</span><span class="p">{</span>
        <span class="s1">&#39;labels&#39;</span><span class="p">:</span> <span class="p">[</span><span class="s1">&#39;Sun&#39;</span><span class="p">,</span> <span class="s1">&#39;Shower&#39;</span><span class="p">,</span> <span class="s1">&#39;Snow&#39;</span><span class="p">],</span> 
        <span class="s1">&#39;loc&#39;</span><span class="p">:</span> <span class="s1">&#39;upper left&#39;</span><span class="p">,</span> 
        <span class="s1">&#39;bbox_to_anchor&#39;</span><span class="p">:</span> <span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
    <span class="p">}</span>
<span class="p">)</span>

</code></pre></div><p>​ <br>
<img loading="lazy" src="img/output_24_0.png" alt=""  />

​</p>
<p>Font Awesome根据图标名称和样式来定位图标。因此，不同图标的样式可能不同，您必须为所有图标单独指定图标样式。因此，icon_style 参数还可以接受一个包含样式字符串的列表或元组，以用于不同的图标。</p>
<p>在Font Awesome中，您需要根据图标的名称和样式来选择正确的图标。这些图标的样式可能包括不同的风格，因此您需要单独指定每个图标的样式。通过使用icon_style 参数，您可以为每个图标选择适当的样式，以确保它们在呈现时显示正确的外观。这在使用Font Awesome图标时很重要，因为它允许您根据需要选择不同的样式。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">fig</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span>
    <span class="n">FigureClass</span><span class="o">=</span><span class="n">Waffle</span><span class="p">,</span>
    <span class="n">rows</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span>
    <span class="n">values</span><span class="o">=</span><span class="p">[</span><span class="mi">30</span><span class="p">,</span> <span class="mi">16</span><span class="p">,</span> <span class="mi">4</span><span class="p">],</span>
    <span class="n">colors</span><span class="o">=</span><span class="p">[</span><span class="s2">&#34;#FFA500&#34;</span><span class="p">,</span> <span class="s2">&#34;#4384FF&#34;</span><span class="p">,</span> <span class="s2">&#34;#C0C0C0&#34;</span><span class="p">],</span>
    <span class="n">icons</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;sun&#39;</span><span class="p">,</span> <span class="s1">&#39;cloud-showers-heavy&#39;</span><span class="p">,</span> <span class="s1">&#39;font-awesome&#39;</span><span class="p">],</span>
    <span class="n">icon_size</span><span class="o">=</span><span class="mi">20</span><span class="p">,</span>
    <span class="n">icon_style</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;regular&#39;</span><span class="p">,</span> <span class="s1">&#39;solid&#39;</span><span class="p">,</span> <span class="s1">&#39;brands&#39;</span><span class="p">],</span>
    <span class="n">icon_legend</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span>
    <span class="n">legend</span><span class="o">=</span><span class="p">{</span>
        <span class="s1">&#39;labels&#39;</span><span class="p">:</span> <span class="p">[</span><span class="s1">&#39;Sun&#39;</span><span class="p">,</span> <span class="s1">&#39;Shower&#39;</span><span class="p">,</span> <span class="s1">&#39;Flag&#39;</span><span class="p">],</span> 
        <span class="s1">&#39;loc&#39;</span><span class="p">:</span> <span class="s1">&#39;upper left&#39;</span><span class="p">,</span> 
        <span class="s1">&#39;bbox_to_anchor&#39;</span><span class="p">:</span> <span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
    <span class="p">}</span>
<span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/output_26_0.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="总结">总结</h2>
<p>尽管Waffle图在传达占比关系方面具有优势，但也需要注意它的局限性。例如，Waffle图不适合呈现大量数据或复杂数据关系，因为小块的数量可能会变得过多，难以解读。在选择数据可视化方式时，要根据数据的特点和需求来决定是否使用Waffle图以及何时使用。</p>
<p><a href="PyWaffle.ipynb">点击下载本文代码</a></p>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>可视化 | Netflix 数据可视化最佳实践</title>
      <link>https://textdata.cn/blog/2023-08-28-best-practice-netflix-data-visualization/</link>
      <pubDate>Mon, 28 Aug 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-08-28-best-practice-netflix-data-visualization/</guid>
      <description>本笔记本的目的是练习数据可视化，并希望在此过程中传达一些最佳实践。</description>
      <content:encoded><![CDATA[<p>本笔记本的目的是练习数据可视化，并希望在此过程中传达一些最佳实践。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">- 作者JOSH
- 原文: https://www.kaggle.com/code/joshuaswords/netflix-data-visualization/notebook?scriptVersionId=58425238&amp;cellId=17
</code></pre></div><br>
<ul>
<li><a href="netflix_titles.csv">下载本文数据</a></li>
<li><a href="netflix-data-visualization.ipynb">下载本文代码</a></li>
</ul>
<p><br><br></p>
<h2 id="一数据预处理">一、数据预处理</h2>
<h3 id="11-导入数据">1.1 导入数据</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">import</span> <span class="nn">warnings</span>
<span class="n">warnings</span><span class="o">.</span><span class="n">filterwarnings</span><span class="p">(</span><span class="s2">&#34;ignore&#34;</span><span class="p">)</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;netflix_titles.csv&#39;</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/df.png" alt=""  />
</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="k">for</span> <span class="n">col</span> <span class="ow">in</span> <span class="n">df</span><span class="o">.</span><span class="n">columns</span><span class="p">:</span>
    <span class="n">null_rate</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">col</span><span class="p">]</span><span class="o">.</span><span class="n">isna</span><span class="p">()</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span><span class="o">/</span><span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">)</span><span class="o">*</span><span class="mi">100</span>
    <span class="k">if</span> <span class="n">null_rate</span><span class="o">&gt;</span><span class="mi">0</span><span class="p">:</span>
        <span class="nb">print</span><span class="p">(</span><span class="s2">&#34;</span><span class="si">{}</span><span class="s2"> null rate: </span><span class="si">{}</span><span class="s2">%&#34;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">col</span><span class="p">,</span> <span class="nb">round</span><span class="p">(</span><span class="n">null_rate</span><span class="p">,</span><span class="mi">2</span><span class="p">)))</span>
</code></pre></div><pre><code>director null rate: 29.91%
cast null rate: 9.37%
country null rate: 9.44%
date_added null rate: 0.11%
rating null rate: 0.05%
duration null rate: 0.03%
</code></pre>
<p>数据集中有5个字段存在缺失值记录，其中director字段中有近三成记录是缺失值。</p>
<p><br><br></p>
<h3 id="12-缺失值字段处理">1.2 缺失值字段处理</h3>
<p>这始终取决于场景，但在这种情况下，我会：</p>
<ul>
<li>将空白国家/地区country替换为模式（最常见）国家/地区</li>
<li>我想保留导演director字段，因为看某个导演的电影可能会很有趣</li>
<li>我想保留演员阵容cast字段，因为看某个演员的电影可能会很有趣</li>
</ul>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>

<span class="c1"># Replacments替换</span>
<span class="c1">#将空白国家/地区替换为模式（最常见）国家/地区</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;country&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;country&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;country&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">mode</span><span class="p">()[</span><span class="mi">0</span><span class="p">])</span>
<span class="c1">#我想保留演员阵容，因为看某个演员的电影可能会很有趣</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;cast&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">nan</span><span class="p">,</span> <span class="s1">&#39;No Data&#39;</span><span class="p">,</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="c1">#我想保留导演，因为看某个导演的电影可能会很有趣</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;director&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">nan</span><span class="p">,</span> <span class="s1">&#39;No Data&#39;</span><span class="p">,</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>


<span class="c1"># Drops删除</span>
<span class="n">df</span><span class="o">.</span><span class="n">dropna</span><span class="p">(</span><span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>

<span class="c1"># Drop Duplicates剔除重复</span>
<span class="n">df</span><span class="o">.</span><span class="n">drop_duplicates</span><span class="p">(</span><span class="n">inplace</span><span class="o">=</span> <span class="kc">True</span><span class="p">)</span>


<span class="c1">#查看各字段null数量</span>
<span class="n">df</span><span class="o">.</span><span class="n">isnull</span><span class="p">()</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>
</code></pre></div><pre><code>show_id         0
type            0
title           0
director        0
cast            0
country         0
date_added      0
release_year    0
rating          0
duration        0
listed_in       0
description     0
dtype: int64
</code></pre>
<p><br><br></p>
<h3 id="13-日期处理">1.3 日期处理</h3>
<p>将日期更改为datetime类型， 并新增month_added、month_name_added、year_added三个字段</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s1">&#39;date_added&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;date_added&#39;</span><span class="p">],</span> <span class="nb">format</span><span class="o">=</span><span class="s1">&#39;mixed&#39;</span><span class="p">)</span>

<span class="n">df</span><span class="p">[</span><span class="s1">&#39;month_added&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;date_added&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">dt</span><span class="o">.</span><span class="n">month</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;month_name_added&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;date_added&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">dt</span><span class="o">.</span><span class="n">month_name</span><span class="p">()</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;year_added&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;date_added&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">dt</span><span class="o">.</span><span class="n">year</span>

<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/df2.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="二可视化">二、可视化</h2>
<h3 id="21-配色">2.1 配色</h3>
<p>可视化配色统一选择奈飞Netflix标志性色， 以体现专业性，保持读者的参与性。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">import</span> <span class="nn">seaborn</span> <span class="k">as</span> <span class="nn">sns</span>
<span class="o">%</span><span class="n">matplotlib</span> <span class="n">inline</span>
<span class="n">plt</span><span class="o">.</span><span class="n">rcParams</span><span class="p">[</span><span class="s1">&#39;figure.dpi&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="mi">140</span>

<span class="c1"># Palette</span>
<span class="n">sns</span><span class="o">.</span><span class="n">palplot</span><span class="p">([</span><span class="s1">&#39;#221f1f&#39;</span><span class="p">,</span> <span class="s1">&#39;#b20710&#39;</span><span class="p">,</span> <span class="s1">&#39;#e50914&#39;</span><span class="p">,</span><span class="s1">&#39;#f5f5f1&#39;</span><span class="p">])</span>

<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s2">&#34;Netflix brand palette &#34;</span><span class="p">,</span><span class="n">loc</span><span class="o">=</span><span class="s1">&#39;left&#39;</span><span class="p">,</span><span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">,</span><span class="n">fontsize</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="mf">1.2</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/output_8_0.png" alt=""  />
</p>
<p><br><br></p>
<h3 id="22-netflix发展时间线可视化">2.2 Netflix发展时间线可视化</h3>
<p>Netflix 以 DVD 租赁起家，现在拥有超过 1.5 亿观众 - 这就是他们的故事。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1"># Timeline code from Subin An&#39;s awesome notebook</span>
<span class="c1"># https://www.kaggle.com/subinium/awesome-visualization-with-titanic-dataset</span>


<span class="kn">from</span> <span class="nn">datetime</span> <span class="kn">import</span> <span class="n">datetime</span>

<span class="c1">## these go on the numbers below</span>
<span class="n">tl_dates</span> <span class="o">=</span> <span class="p">[</span>
    <span class="s2">&#34;1997</span><span class="se">\n</span><span class="s2">Founded&#34;</span><span class="p">,</span>
    <span class="s2">&#34;1998</span><span class="se">\n</span><span class="s2">Mail Service&#34;</span><span class="p">,</span>
    <span class="s2">&#34;2003</span><span class="se">\n</span><span class="s2">Goes Public&#34;</span><span class="p">,</span>
    <span class="s2">&#34;2007</span><span class="se">\n</span><span class="s2">Streaming service&#34;</span><span class="p">,</span>
    <span class="s2">&#34;2016</span><span class="se">\n</span><span class="s2">Goes Global&#34;</span><span class="p">,</span>
    <span class="s2">&#34;2021</span><span class="se">\n</span><span class="s2">Netflix &amp; Chill&#34;</span>
<span class="p">]</span>

<span class="n">tl_x</span> <span class="o">=</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mf">5.3</span><span class="p">,</span> <span class="mi">8</span><span class="p">,</span> <span class="mi">9</span><span class="p">]</span>

<span class="c1">## these go on the numbers</span>
<span class="n">tl_sub_x</span> <span class="o">=</span> <span class="p">[</span><span class="mf">1.5</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mf">6.5</span><span class="p">,</span> <span class="mi">7</span><span class="p">]</span>


<span class="n">tl_sub_times</span> <span class="o">=</span> <span class="p">[</span>
    <span class="s2">&#34;1998&#34;</span><span class="p">,</span><span class="s2">&#34;2000&#34;</span><span class="p">,</span><span class="s2">&#34;2006&#34;</span><span class="p">,</span><span class="s2">&#34;2010&#34;</span><span class="p">,</span><span class="s2">&#34;2012&#34;</span>
<span class="p">]</span>

<span class="n">tl_text</span> <span class="o">=</span> <span class="p">[</span>
    <span class="s2">&#34;Netflix.com launched&#34;</span><span class="p">,</span>
    <span class="s2">&#34;Starts</span><span class="se">\n</span><span class="s2">Personal</span><span class="se">\n</span><span class="s2">Recommendations&#34;</span><span class="p">,</span><span class="s2">&#34;Billionth DVD Delivery&#34;</span><span class="p">,</span><span class="s2">&#34;Canadian</span><span class="se">\n</span><span class="s2">Launch&#34;</span><span class="p">,</span><span class="s2">&#34;UK Launch</span><span class="se">\n</span><span class="s2">(my birthplace)&#34;</span><span class="p">]</span>



<span class="c1"># Set figure &amp; Axes</span>
<span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">subplots</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">15</span><span class="p">,</span> <span class="mi">4</span><span class="p">),</span> <span class="n">constrained_layout</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set_ylim</span><span class="p">(</span><span class="o">-</span><span class="mi">2</span><span class="p">,</span> <span class="mf">1.75</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set_xlim</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span>


<span class="c1"># Timeline : line</span>
<span class="n">ax</span><span class="o">.</span><span class="n">axhline</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">xmin</span><span class="o">=</span><span class="mf">0.1</span><span class="p">,</span> <span class="n">xmax</span><span class="o">=</span><span class="mf">0.9</span><span class="p">,</span> <span class="n">c</span><span class="o">=</span><span class="s1">&#39;#4a4a4a&#39;</span><span class="p">,</span> <span class="n">zorder</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>


<span class="c1"># Timeline : Date Points</span>
<span class="n">ax</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">tl_x</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">tl_x</span><span class="p">)),</span> <span class="n">s</span><span class="o">=</span><span class="mi">120</span><span class="p">,</span> <span class="n">c</span><span class="o">=</span><span class="s1">&#39;#4a4a4a&#39;</span><span class="p">,</span> <span class="n">zorder</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">tl_x</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">tl_x</span><span class="p">)),</span> <span class="n">s</span><span class="o">=</span><span class="mi">30</span><span class="p">,</span> <span class="n">c</span><span class="o">=</span><span class="s1">&#39;#fafafa&#39;</span><span class="p">,</span> <span class="n">zorder</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>
<span class="c1"># Timeline : Time Points</span>
<span class="n">ax</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">tl_sub_x</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">tl_sub_x</span><span class="p">)),</span> <span class="n">s</span><span class="o">=</span><span class="mi">50</span><span class="p">,</span> <span class="n">c</span><span class="o">=</span><span class="s1">&#39;#4a4a4a&#39;</span><span class="p">,</span><span class="n">zorder</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span>

<span class="c1"># Date Text</span>
<span class="k">for</span> <span class="n">x</span><span class="p">,</span> <span class="n">date</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">tl_x</span><span class="p">,</span> <span class="n">tl_dates</span><span class="p">):</span>
    <span class="n">ax</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="o">-</span><span class="mf">0.55</span><span class="p">,</span> <span class="n">date</span><span class="p">,</span> <span class="n">ha</span><span class="o">=</span><span class="s1">&#39;center&#39;</span><span class="p">,</span> 
            <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s1">&#39;bold&#39;</span><span class="p">,</span>
            <span class="n">color</span><span class="o">=</span><span class="s1">&#39;#4a4a4a&#39;</span><span class="p">,</span><span class="n">fontsize</span><span class="o">=</span><span class="mi">12</span><span class="p">)</span>
    

<span class="c1"># Stemplot : vertical line</span>
<span class="n">levels</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">tl_sub_x</span><span class="p">))</span>    
<span class="n">levels</span><span class="p">[::</span><span class="mi">2</span><span class="p">]</span> <span class="o">=</span> <span class="mf">0.3</span>
<span class="n">levels</span><span class="p">[</span><span class="mi">1</span><span class="p">::</span><span class="mi">2</span><span class="p">]</span> <span class="o">=</span> <span class="o">-</span><span class="mf">0.3</span>
<span class="n">markerline</span><span class="p">,</span> <span class="n">stemline</span><span class="p">,</span> <span class="n">baseline</span> <span class="o">=</span> <span class="n">ax</span><span class="o">.</span><span class="n">stem</span><span class="p">(</span><span class="n">tl_sub_x</span><span class="p">,</span> <span class="n">levels</span><span class="p">,</span> <span class="n">use_line_collection</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>    
<span class="n">plt</span><span class="o">.</span><span class="n">setp</span><span class="p">(</span><span class="n">baseline</span><span class="p">,</span> <span class="n">zorder</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">setp</span><span class="p">(</span><span class="n">markerline</span><span class="p">,</span> <span class="n">marker</span><span class="o">=</span><span class="s1">&#39;,&#39;</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s1">&#39;#4a4a4a&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">setp</span><span class="p">(</span><span class="n">stemline</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s1">&#39;#4a4a4a&#39;</span><span class="p">)</span>

<span class="c1"># Text</span>
<span class="k">for</span> <span class="n">idx</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">time</span><span class="p">,</span> <span class="n">txt</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">tl_sub_x</span><span class="p">)</span><span class="o">+</span><span class="mi">1</span><span class="p">),</span> <span class="n">tl_sub_x</span><span class="p">,</span> <span class="n">tl_sub_times</span><span class="p">,</span> <span class="n">tl_text</span><span class="p">):</span>
    <span class="n">ax</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="mf">1.3</span><span class="o">*</span><span class="p">(</span><span class="n">idx</span><span class="o">%</span><span class="mi">2</span><span class="p">)</span><span class="o">-</span><span class="mf">0.5</span><span class="p">,</span> <span class="n">time</span><span class="p">,</span> <span class="n">ha</span><span class="o">=</span><span class="s1">&#39;center&#39;</span><span class="p">,</span> 
            <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s1">&#39;bold&#39;</span><span class="p">,</span>
            <span class="n">color</span><span class="o">=</span><span class="s1">&#39;#4a4a4a&#39;</span> <span class="k">if</span> <span class="n">idx</span><span class="o">!=</span><span class="nb">len</span><span class="p">(</span><span class="n">tl_sub_x</span><span class="p">)</span> <span class="k">else</span> <span class="s1">&#39;#b20710&#39;</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">11</span><span class="p">)</span>
    
    <span class="n">ax</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="mf">1.3</span><span class="o">*</span><span class="p">(</span><span class="n">idx</span><span class="o">%</span><span class="mi">2</span><span class="p">)</span><span class="o">-</span><span class="mf">0.6</span><span class="p">,</span> <span class="n">txt</span><span class="p">,</span> <span class="n">va</span><span class="o">=</span><span class="s1">&#39;top&#39;</span><span class="p">,</span> <span class="n">ha</span><span class="o">=</span><span class="s1">&#39;center&#39;</span><span class="p">,</span> 
        <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">,</span><span class="n">color</span><span class="o">=</span><span class="s1">&#39;#4a4a4a&#39;</span> <span class="k">if</span> <span class="n">idx</span><span class="o">!=</span><span class="nb">len</span><span class="p">(</span><span class="n">tl_sub_x</span><span class="p">)</span> <span class="k">else</span> <span class="s1">&#39;#b20710&#39;</span><span class="p">)</span>



<span class="c1"># Spine</span>
<span class="k">for</span> <span class="n">spine</span> <span class="ow">in</span> <span class="p">[</span><span class="s2">&#34;left&#34;</span><span class="p">,</span> <span class="s2">&#34;top&#34;</span><span class="p">,</span> <span class="s2">&#34;right&#34;</span><span class="p">,</span> <span class="s2">&#34;bottom&#34;</span><span class="p">]:</span>
    <span class="n">ax</span><span class="o">.</span><span class="n">spines</span><span class="p">[</span><span class="n">spine</span><span class="p">]</span><span class="o">.</span><span class="n">set_visible</span><span class="p">(</span><span class="kc">False</span><span class="p">)</span>

<span class="c1"># Ticks    </span>
<span class="n">ax</span><span class="o">.</span><span class="n">set_xticks</span><span class="p">([])</span> 
<span class="n">ax</span><span class="o">.</span><span class="n">set_yticks</span><span class="p">([])</span> 

<span class="c1"># Title</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set_title</span><span class="p">(</span><span class="s2">&#34;Netflix through the years&#34;</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s2">&#34;bold&#34;</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">16</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s1">&#39;#4a4a4a&#39;</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="mf">2.4</span><span class="p">,</span><span class="mf">1.57</span><span class="p">,</span><span class="s2">&#34;From DVD rentals to a global audience of over 150m people - is it time for Netflix to Chill?&#34;</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">12</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s1">&#39;#4a4a4a&#39;</span><span class="p">)</span>

<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/output_10_0.png" alt="png"  />
</p>
<p><br><br></p>
<h3 id="23-内容分布">2.3 内容分布</h3>
<p>现在我们已经了解了 Netflix 如何主宰我们的电视屏幕，让我们看看他们提供的内容&hellip;&hellip;</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">x</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s1">&#39;type&#39;</span><span class="p">])[</span><span class="s1">&#39;type&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">count</span><span class="p">()</span>
<span class="n">y</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
<span class="n">r</span> <span class="o">=</span> <span class="p">((</span><span class="n">x</span><span class="o">/</span><span class="n">y</span><span class="p">))</span><span class="o">.</span><span class="n">round</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>

<span class="n">mf_ratio</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">r</span><span class="p">)</span><span class="o">.</span><span class="n">T</span>


<span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mf">6.5</span><span class="p">,</span> <span class="mf">2.5</span><span class="p">))</span>

<span class="n">ax</span><span class="o">.</span><span class="n">barh</span><span class="p">(</span><span class="n">mf_ratio</span><span class="o">.</span><span class="n">index</span><span class="p">,</span> <span class="n">mf_ratio</span><span class="p">[</span><span class="s1">&#39;Movie&#39;</span><span class="p">],</span> 
        <span class="n">color</span><span class="o">=</span><span class="s1">&#39;#b20710&#39;</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.9</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s1">&#39;Male&#39;</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">barh</span><span class="p">(</span><span class="n">mf_ratio</span><span class="o">.</span><span class="n">index</span><span class="p">,</span> <span class="n">mf_ratio</span><span class="p">[</span><span class="s1">&#39;TV Show&#39;</span><span class="p">],</span> <span class="n">left</span><span class="o">=</span><span class="n">mf_ratio</span><span class="p">[</span><span class="s1">&#39;Movie&#39;</span><span class="p">],</span> 
        <span class="n">color</span><span class="o">=</span><span class="s1">&#39;#221f1f&#39;</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.9</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s1">&#39;Female&#39;</span><span class="p">)</span>

<span class="n">ax</span><span class="o">.</span><span class="n">set_xlim</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set_xticks</span><span class="p">([])</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set_yticks</span><span class="p">([])</span>
<span class="c1">#ax.set_yticklabels(mf_ratio.index, fontfamily=&#39;serif&#39;, fontsize=11)</span>


<span class="c1"># movie percentage</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">mf_ratio</span><span class="o">.</span><span class="n">index</span><span class="p">:</span>
    <span class="n">ax</span><span class="o">.</span><span class="n">annotate</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;</span><span class="si">{</span><span class="nb">int</span><span class="p">(</span><span class="n">mf_ratio</span><span class="p">[</span><span class="s1">&#39;Movie&#39;</span><span class="p">][</span><span class="n">i</span><span class="p">]</span><span class="o">*</span><span class="mi">100</span><span class="p">)</span><span class="si">}</span><span class="s2">%&#34;</span><span class="p">,</span> 
                   <span class="n">xy</span><span class="o">=</span><span class="p">(</span><span class="n">mf_ratio</span><span class="p">[</span><span class="s1">&#39;Movie&#39;</span><span class="p">][</span><span class="n">i</span><span class="p">]</span><span class="o">/</span><span class="mi">2</span><span class="p">,</span> <span class="n">i</span><span class="p">),</span>
                   <span class="n">va</span> <span class="o">=</span> <span class="s1">&#39;center&#39;</span><span class="p">,</span> <span class="n">ha</span><span class="o">=</span><span class="s1">&#39;center&#39;</span><span class="p">,</span><span class="n">fontsize</span><span class="o">=</span><span class="mi">40</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s1">&#39;light&#39;</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">,</span>
                   <span class="n">color</span><span class="o">=</span><span class="s1">&#39;white&#39;</span><span class="p">)</span>

    <span class="n">ax</span><span class="o">.</span><span class="n">annotate</span><span class="p">(</span><span class="s2">&#34;Movie&#34;</span><span class="p">,</span> 
                   <span class="n">xy</span><span class="o">=</span><span class="p">(</span><span class="n">mf_ratio</span><span class="p">[</span><span class="s1">&#39;Movie&#39;</span><span class="p">][</span><span class="n">i</span><span class="p">]</span><span class="o">/</span><span class="mi">2</span><span class="p">,</span> <span class="o">-</span><span class="mf">0.25</span><span class="p">),</span>
                   <span class="n">va</span> <span class="o">=</span> <span class="s1">&#39;center&#39;</span><span class="p">,</span> <span class="n">ha</span><span class="o">=</span><span class="s1">&#39;center&#39;</span><span class="p">,</span><span class="n">fontsize</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s1">&#39;light&#39;</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">,</span>
                   <span class="n">color</span><span class="o">=</span><span class="s1">&#39;white&#39;</span><span class="p">)</span>
    
    
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">mf_ratio</span><span class="o">.</span><span class="n">index</span><span class="p">:</span>
    <span class="n">ax</span><span class="o">.</span><span class="n">annotate</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;</span><span class="si">{</span><span class="nb">int</span><span class="p">(</span><span class="n">mf_ratio</span><span class="p">[</span><span class="s1">&#39;TV Show&#39;</span><span class="p">][</span><span class="n">i</span><span class="p">]</span><span class="o">*</span><span class="mi">100</span><span class="p">)</span><span class="si">}</span><span class="s2">%&#34;</span><span class="p">,</span> 
                   <span class="n">xy</span><span class="o">=</span><span class="p">(</span><span class="n">mf_ratio</span><span class="p">[</span><span class="s1">&#39;Movie&#39;</span><span class="p">][</span><span class="n">i</span><span class="p">]</span><span class="o">+</span><span class="n">mf_ratio</span><span class="p">[</span><span class="s1">&#39;TV Show&#39;</span><span class="p">][</span><span class="n">i</span><span class="p">]</span><span class="o">/</span><span class="mi">2</span><span class="p">,</span> <span class="n">i</span><span class="p">),</span>
                   <span class="n">va</span> <span class="o">=</span> <span class="s1">&#39;center&#39;</span><span class="p">,</span> <span class="n">ha</span><span class="o">=</span><span class="s1">&#39;center&#39;</span><span class="p">,</span><span class="n">fontsize</span><span class="o">=</span><span class="mi">40</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s1">&#39;light&#39;</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">,</span>
                   <span class="n">color</span><span class="o">=</span><span class="s1">&#39;white&#39;</span><span class="p">)</span>
    <span class="n">ax</span><span class="o">.</span><span class="n">annotate</span><span class="p">(</span><span class="s2">&#34;TV Show&#34;</span><span class="p">,</span> 
                   <span class="n">xy</span><span class="o">=</span><span class="p">(</span><span class="n">mf_ratio</span><span class="p">[</span><span class="s1">&#39;Movie&#39;</span><span class="p">][</span><span class="n">i</span><span class="p">]</span><span class="o">+</span><span class="n">mf_ratio</span><span class="p">[</span><span class="s1">&#39;TV Show&#39;</span><span class="p">][</span><span class="n">i</span><span class="p">]</span><span class="o">/</span><span class="mi">2</span><span class="p">,</span> <span class="o">-</span><span class="mf">0.25</span><span class="p">),</span>
                   <span class="n">va</span> <span class="o">=</span> <span class="s1">&#39;center&#39;</span><span class="p">,</span> <span class="n">ha</span><span class="o">=</span><span class="s1">&#39;center&#39;</span><span class="p">,</span><span class="n">fontsize</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s1">&#39;light&#39;</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">,</span>
                   <span class="n">color</span><span class="o">=</span><span class="s1">&#39;white&#39;</span><span class="p">)</span>

    
<span class="c1"># Title &amp; Subtitle</span>
<span class="n">fig</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="mf">0.125</span><span class="p">,</span><span class="mf">1.03</span><span class="p">,</span><span class="s1">&#39;Movie &amp; TV Show distribution&#39;</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">,</span><span class="n">fontsize</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s1">&#39;bold&#39;</span><span class="p">)</span>
<span class="n">fig</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="mf">0.125</span><span class="p">,</span><span class="mf">0.92</span><span class="p">,</span><span class="s1">&#39;We see vastly more movies than TV shows on Netflix.&#39;</span><span class="p">,</span><span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">,</span><span class="n">fontsize</span><span class="o">=</span><span class="mi">12</span><span class="p">)</span>  

<span class="k">for</span> <span class="n">s</span> <span class="ow">in</span> <span class="p">[</span><span class="s1">&#39;top&#39;</span><span class="p">,</span> <span class="s1">&#39;left&#39;</span><span class="p">,</span> <span class="s1">&#39;right&#39;</span><span class="p">,</span> <span class="s1">&#39;bottom&#39;</span><span class="p">]:</span>
    <span class="n">ax</span><span class="o">.</span><span class="n">spines</span><span class="p">[</span><span class="n">s</span><span class="p">]</span><span class="o">.</span><span class="n">set_visible</span><span class="p">(</span><span class="kc">False</span><span class="p">)</span>
    


<span class="c1">#ax.legend(loc=&#39;lower center&#39;, ncol=3, bbox_to_anchor=(0.5, -0.06))</span>

<span class="c1"># Removing legend due to labelled plot</span>
<span class="n">ax</span><span class="o">.</span><span class="n">legend</span><span class="p">()</span><span class="o">.</span><span class="n">set_visible</span><span class="p">(</span><span class="kc">False</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/output_12_0.png" alt=""  />
</p>
<p><br><br></p>
<h3 id="24-国别内容制作量">2.4 国别内容制作量</h3>
<p>所以我们现在知道 Netflix 上的电影比电视节目多得多（这让我感到惊讶！）。</p>
<p>如果我们按国家查看内容呢？</p>
<p>我想美国将拥有最多的内容。 我想知道我的国家英国会如何比较？</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1"># Quick feature engineering</span>

<span class="c1"># Helper column for various plots</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;count&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span>

<span class="c1"># Many productions have several countries listed - this will skew our results , we&#39;ll grab the first one mentioned</span>

<span class="c1"># Lets retrieve just the first country</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;first_country&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;country&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s2">&#34;,&#34;</span><span class="p">)[</span><span class="mi">0</span><span class="p">])</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;first_country&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>

<span class="c1"># Rating ages from this notebook: https://www.kaggle.com/andreshg/eda-beginner-to-expert-plotly (thank you!)</span>

<span class="n">ratings_ages</span> <span class="o">=</span> <span class="p">{</span>
    <span class="s1">&#39;TV-PG&#39;</span><span class="p">:</span> <span class="s1">&#39;Older Kids&#39;</span><span class="p">,</span>
    <span class="s1">&#39;TV-MA&#39;</span><span class="p">:</span> <span class="s1">&#39;Adults&#39;</span><span class="p">,</span>
    <span class="s1">&#39;TV-Y7-FV&#39;</span><span class="p">:</span> <span class="s1">&#39;Older Kids&#39;</span><span class="p">,</span>
    <span class="s1">&#39;TV-Y7&#39;</span><span class="p">:</span> <span class="s1">&#39;Older Kids&#39;</span><span class="p">,</span>
    <span class="s1">&#39;TV-14&#39;</span><span class="p">:</span> <span class="s1">&#39;Teens&#39;</span><span class="p">,</span>
    <span class="s1">&#39;R&#39;</span><span class="p">:</span> <span class="s1">&#39;Adults&#39;</span><span class="p">,</span>
    <span class="s1">&#39;TV-Y&#39;</span><span class="p">:</span> <span class="s1">&#39;Kids&#39;</span><span class="p">,</span>
    <span class="s1">&#39;NR&#39;</span><span class="p">:</span> <span class="s1">&#39;Adults&#39;</span><span class="p">,</span>
    <span class="s1">&#39;PG-13&#39;</span><span class="p">:</span> <span class="s1">&#39;Teens&#39;</span><span class="p">,</span>
    <span class="s1">&#39;TV-G&#39;</span><span class="p">:</span> <span class="s1">&#39;Kids&#39;</span><span class="p">,</span>
    <span class="s1">&#39;PG&#39;</span><span class="p">:</span> <span class="s1">&#39;Older Kids&#39;</span><span class="p">,</span>
    <span class="s1">&#39;G&#39;</span><span class="p">:</span> <span class="s1">&#39;Kids&#39;</span><span class="p">,</span>
    <span class="s1">&#39;UR&#39;</span><span class="p">:</span> <span class="s1">&#39;Adults&#39;</span><span class="p">,</span>
    <span class="s1">&#39;NC-17&#39;</span><span class="p">:</span> <span class="s1">&#39;Adults&#39;</span>
<span class="p">}</span>

<span class="n">df</span><span class="p">[</span><span class="s1">&#39;target_ages&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;rating&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="n">ratings_ages</span><span class="p">)</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;target_ages&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">unique</span><span class="p">()</span>

<span class="c1"># Genre</span>

<span class="n">df</span><span class="p">[</span><span class="s1">&#39;genre&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;listed_in&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span> <span class="p">:</span>  <span class="n">x</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s1">&#39; ,&#39;</span><span class="p">,</span><span class="s1">&#39;,&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s1">&#39;, &#39;</span><span class="p">,</span><span class="s1">&#39;,&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">&#39;,&#39;</span><span class="p">))</span> 

<span class="c1"># Reducing name length</span>

<span class="n">df</span><span class="p">[</span><span class="s1">&#39;first_country&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s1">&#39;United States&#39;</span><span class="p">,</span> <span class="s1">&#39;USA&#39;</span><span class="p">,</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;first_country&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s1">&#39;United Kingdom&#39;</span><span class="p">,</span> <span class="s1">&#39;UK&#39;</span><span class="p">,</span><span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;first_country&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s1">&#39;South Korea&#39;</span><span class="p">,</span> <span class="s1">&#39;S. Korea&#39;</span><span class="p">,</span><span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">data</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">&#39;first_country&#39;</span><span class="p">)[</span><span class="s1">&#39;count&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span><span class="o">.</span><span class="n">sort_values</span><span class="p">(</span><span class="n">ascending</span><span class="o">=</span><span class="kc">False</span><span class="p">)[:</span><span class="mi">10</span><span class="p">]</span>

<span class="c1"># Plot</span>

<span class="n">color_map</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;#f5f5f1&#39;</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">10</span><span class="p">)]</span>
<span class="n">color_map</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">color_map</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">color_map</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">=</span>  <span class="s1">&#39;#b20710&#39;</span> <span class="c1"># color highlight</span>

<span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">12</span><span class="p">,</span> <span class="mi">6</span><span class="p">))</span>
<span class="n">ax</span><span class="o">.</span><span class="n">bar</span><span class="p">(</span><span class="n">data</span><span class="o">.</span><span class="n">index</span><span class="p">,</span> <span class="n">data</span><span class="p">,</span> <span class="n">width</span><span class="o">=</span><span class="mf">0.5</span><span class="p">,</span> 
       <span class="n">edgecolor</span><span class="o">=</span><span class="s1">&#39;darkgray&#39;</span><span class="p">,</span>
       <span class="n">linewidth</span><span class="o">=</span><span class="mf">0.6</span><span class="p">,</span><span class="n">color</span><span class="o">=</span><span class="n">color_map</span><span class="p">)</span>

<span class="c1">#annotations</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">data</span><span class="o">.</span><span class="n">index</span><span class="p">:</span>
    <span class="n">ax</span><span class="o">.</span><span class="n">annotate</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;</span><span class="si">{</span><span class="n">data</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="si">}</span><span class="s2">&#34;</span><span class="p">,</span> 
                   <span class="n">xy</span><span class="o">=</span><span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="n">data</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">+</span> <span class="mi">150</span><span class="p">),</span> <span class="c1">#i like to change this to roughly 5% of the highest cat</span>
                   <span class="n">va</span> <span class="o">=</span> <span class="s1">&#39;center&#39;</span><span class="p">,</span> <span class="n">ha</span><span class="o">=</span><span class="s1">&#39;center&#39;</span><span class="p">,</span><span class="n">fontweight</span><span class="o">=</span><span class="s1">&#39;light&#39;</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">)</span>



<span class="c1"># Remove border from plot</span>

<span class="k">for</span> <span class="n">s</span> <span class="ow">in</span> <span class="p">[</span><span class="s1">&#39;top&#39;</span><span class="p">,</span> <span class="s1">&#39;left&#39;</span><span class="p">,</span> <span class="s1">&#39;right&#39;</span><span class="p">]:</span>
    <span class="n">ax</span><span class="o">.</span><span class="n">spines</span><span class="p">[</span><span class="n">s</span><span class="p">]</span><span class="o">.</span><span class="n">set_visible</span><span class="p">(</span><span class="kc">False</span><span class="p">)</span>
    
<span class="c1"># Tick labels</span>

<span class="n">ax</span><span class="o">.</span><span class="n">set_xticklabels</span><span class="p">(</span><span class="n">data</span><span class="o">.</span><span class="n">index</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">,</span> <span class="n">rotation</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>

<span class="c1"># Title and sub-title</span>

<span class="n">fig</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="mf">0.09</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="s1">&#39;Top 10 countries on Netflix&#39;</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s1">&#39;bold&#39;</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">)</span>
<span class="n">fig</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="mf">0.09</span><span class="p">,</span> <span class="mf">0.95</span><span class="p">,</span> <span class="s1">&#39;The three most frequent countries have been highlighted.&#39;</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">12</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s1">&#39;light&#39;</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">)</span>

<span class="n">fig</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="mf">1.1</span><span class="p">,</span> <span class="mf">1.01</span><span class="p">,</span> <span class="s1">&#39;Insight&#39;</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s1">&#39;bold&#39;</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">)</span>

<span class="n">fig</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="mf">1.1</span><span class="p">,</span> <span class="mf">0.67</span><span class="p">,</span> <span class="s1">&#39;&#39;&#39;
</span><span class="s1">The most prolific producers of
</span><span class="s1">content for Netflix are, primarily,
</span><span class="s1">the USA, with India and the UK
</span><span class="s1">a significant distance behind.
</span><span class="s1">
</span><span class="s1">It makes sense that the USA produces 
</span><span class="s1">the most content as, afterall, 
</span><span class="s1">Netflix is a US company.
</span><span class="s1">&#39;&#39;&#39;</span>
         <span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">12</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s1">&#39;light&#39;</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">)</span>

<span class="n">ax</span><span class="o">.</span><span class="n">grid</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="s1">&#39;y&#39;</span><span class="p">,</span> <span class="n">linestyle</span><span class="o">=</span><span class="s1">&#39;-&#39;</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.4</span><span class="p">)</span>   

<span class="n">grid_y_ticks</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">4000</span><span class="p">,</span> <span class="mi">500</span><span class="p">)</span> <span class="c1"># y ticks, min, max, then step</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set_yticks</span><span class="p">(</span><span class="n">grid_y_ticks</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set_axisbelow</span><span class="p">(</span><span class="kc">True</span><span class="p">)</span>

<span class="c1">#Axis labels</span>

<span class="c1">#plt.xlabel(&#34;Country&#34;, fontsize=12, fontweight=&#39;light&#39;, fontfamily=&#39;serif&#39;,loc=&#39;left&#39;,y=-1.5)</span>
<span class="c1">#plt.ylabel(&#34;Count&#34;, fontsize=12, fontweight=&#39;light&#39;, fontfamily=&#39;serif&#39;)</span>
 <span class="c1">#plt.legend(loc=&#39;upper right&#39;)</span>
    
<span class="c1"># thicken the bottom line if you want to</span>
<span class="n">plt</span><span class="o">.</span><span class="n">axhline</span><span class="p">(</span><span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span> <span class="n">color</span> <span class="o">=</span> <span class="s1">&#39;black&#39;</span><span class="p">,</span> <span class="n">linewidth</span> <span class="o">=</span> <span class="mf">1.3</span><span class="p">,</span> <span class="n">alpha</span> <span class="o">=</span> <span class="mf">.7</span><span class="p">)</span>

<span class="n">ax</span><span class="o">.</span><span class="n">tick_params</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="s1">&#39;both&#39;</span><span class="p">,</span> <span class="n">which</span><span class="o">=</span><span class="s1">&#39;major&#39;</span><span class="p">,</span> <span class="n">labelsize</span><span class="o">=</span><span class="mi">12</span><span class="p">)</span>


<span class="kn">import</span> <span class="nn">matplotlib.lines</span> <span class="k">as</span> <span class="nn">lines</span>
<span class="n">l1</span> <span class="o">=</span> <span class="n">lines</span><span class="o">.</span><span class="n">Line2D</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">],</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">],</span> <span class="n">transform</span><span class="o">=</span><span class="n">fig</span><span class="o">.</span><span class="n">transFigure</span><span class="p">,</span> <span class="n">figure</span><span class="o">=</span><span class="n">fig</span><span class="p">,</span><span class="n">color</span><span class="o">=</span><span class="s1">&#39;black&#39;</span><span class="p">,</span><span class="n">lw</span><span class="o">=</span><span class="mf">0.2</span><span class="p">)</span>
<span class="n">fig</span><span class="o">.</span><span class="n">lines</span><span class="o">.</span><span class="n">extend</span><span class="p">([</span><span class="n">l1</span><span class="p">])</span>

<span class="n">ax</span><span class="o">.</span><span class="n">tick_params</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="sa">u</span><span class="s1">&#39;both&#39;</span><span class="p">,</span> <span class="n">which</span><span class="o">=</span><span class="sa">u</span><span class="s1">&#39;both&#39;</span><span class="p">,</span><span class="n">length</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>

<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/output_15_0.png" alt=""  />
</p>
<p><br><br></p>
<h3 id="25-各国家不同类别内容占比">2.5 各国家不同类别内容占比</h3>
<p>正如预测的那样，美国占据主导地位。</p>
<p>英国也是顶级竞争者，但仍落后印度一些。</p>
<p>不同国家/地区的内容有何不同？</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">country_order</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;first_country&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">value_counts</span><span class="p">()[:</span><span class="mi">11</span><span class="p">]</span><span class="o">.</span><span class="n">index</span>
<span class="n">data_q2q3</span> <span class="o">=</span> <span class="n">df</span><span class="p">[[</span><span class="s1">&#39;type&#39;</span><span class="p">,</span> <span class="s1">&#39;first_country&#39;</span><span class="p">]]</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">&#39;first_country&#39;</span><span class="p">)[</span><span class="s1">&#39;type&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">value_counts</span><span class="p">()</span><span class="o">.</span><span class="n">unstack</span><span class="p">()</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="n">country_order</span><span class="p">]</span>
<span class="n">data_q2q3</span><span class="p">[</span><span class="s1">&#39;sum&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">data_q2q3</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">data_q2q3_ratio</span> <span class="o">=</span> <span class="p">(</span><span class="n">data_q2q3</span><span class="o">.</span><span class="n">T</span> <span class="o">/</span> <span class="n">data_q2q3</span><span class="p">[</span><span class="s1">&#39;sum&#39;</span><span class="p">])</span><span class="o">.</span><span class="n">T</span><span class="p">[[</span><span class="s1">&#39;Movie&#39;</span><span class="p">,</span> <span class="s1">&#39;TV Show&#39;</span><span class="p">]]</span><span class="o">.</span><span class="n">sort_values</span><span class="p">(</span><span class="n">by</span><span class="o">=</span><span class="s1">&#39;Movie&#39;</span><span class="p">,</span><span class="n">ascending</span><span class="o">=</span><span class="kc">False</span><span class="p">)[::</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>




<span class="c1">###</span>
<span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">15</span><span class="p">,</span> <span class="mi">8</span><span class="p">),)</span>

<span class="n">ax</span><span class="o">.</span><span class="n">barh</span><span class="p">(</span><span class="n">data_q2q3_ratio</span><span class="o">.</span><span class="n">index</span><span class="p">,</span> <span class="n">data_q2q3_ratio</span><span class="p">[</span><span class="s1">&#39;Movie&#39;</span><span class="p">],</span> 
        <span class="n">color</span><span class="o">=</span><span class="s1">&#39;#b20710&#39;</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.8</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s1">&#39;Movie&#39;</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">barh</span><span class="p">(</span><span class="n">data_q2q3_ratio</span><span class="o">.</span><span class="n">index</span><span class="p">,</span> <span class="n">data_q2q3_ratio</span><span class="p">[</span><span class="s1">&#39;TV Show&#39;</span><span class="p">],</span> <span class="n">left</span><span class="o">=</span><span class="n">data_q2q3_ratio</span><span class="p">[</span><span class="s1">&#39;Movie&#39;</span><span class="p">],</span> 
        <span class="n">color</span><span class="o">=</span><span class="s1">&#39;#221f1f&#39;</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.8</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s1">&#39;TV Show&#39;</span><span class="p">)</span>


<span class="n">ax</span><span class="o">.</span><span class="n">set_xlim</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set_xticks</span><span class="p">([])</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set_yticklabels</span><span class="p">(</span><span class="n">data_q2q3_ratio</span><span class="o">.</span><span class="n">index</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">11</span><span class="p">)</span>

<span class="c1"># male percentage</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">data_q2q3_ratio</span><span class="o">.</span><span class="n">index</span><span class="p">:</span>
    <span class="n">ax</span><span class="o">.</span><span class="n">annotate</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;</span><span class="si">{</span><span class="n">data_q2q3_ratio</span><span class="p">[</span><span class="s1">&#39;Movie&#39;</span><span class="p">][</span><span class="n">i</span><span class="p">]</span><span class="o">*</span><span class="mi">100</span><span class="si">:</span><span class="s2">.3</span><span class="si">}</span><span class="s2">%&#34;</span><span class="p">,</span> 
                   <span class="n">xy</span><span class="o">=</span><span class="p">(</span><span class="n">data_q2q3_ratio</span><span class="p">[</span><span class="s1">&#39;Movie&#39;</span><span class="p">][</span><span class="n">i</span><span class="p">]</span><span class="o">/</span><span class="mi">2</span><span class="p">,</span> <span class="n">i</span><span class="p">),</span>
                   <span class="n">va</span> <span class="o">=</span> <span class="s1">&#39;center&#39;</span><span class="p">,</span> <span class="n">ha</span><span class="o">=</span><span class="s1">&#39;center&#39;</span><span class="p">,</span><span class="n">fontsize</span><span class="o">=</span><span class="mi">12</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s1">&#39;light&#39;</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">,</span>
                   <span class="n">color</span><span class="o">=</span><span class="s1">&#39;white&#39;</span><span class="p">)</span>

<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">data_q2q3_ratio</span><span class="o">.</span><span class="n">index</span><span class="p">:</span>
    <span class="n">ax</span><span class="o">.</span><span class="n">annotate</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;</span><span class="si">{</span><span class="n">data_q2q3_ratio</span><span class="p">[</span><span class="s1">&#39;TV Show&#39;</span><span class="p">][</span><span class="n">i</span><span class="p">]</span><span class="o">*</span><span class="mi">100</span><span class="si">:</span><span class="s2">.3</span><span class="si">}</span><span class="s2">%&#34;</span><span class="p">,</span> 
                   <span class="n">xy</span><span class="o">=</span><span class="p">(</span><span class="n">data_q2q3_ratio</span><span class="p">[</span><span class="s1">&#39;Movie&#39;</span><span class="p">][</span><span class="n">i</span><span class="p">]</span><span class="o">+</span><span class="n">data_q2q3_ratio</span><span class="p">[</span><span class="s1">&#39;TV Show&#39;</span><span class="p">][</span><span class="n">i</span><span class="p">]</span><span class="o">/</span><span class="mi">2</span><span class="p">,</span> <span class="n">i</span><span class="p">),</span>
                   <span class="n">va</span> <span class="o">=</span> <span class="s1">&#39;center&#39;</span><span class="p">,</span> <span class="n">ha</span><span class="o">=</span><span class="s1">&#39;center&#39;</span><span class="p">,</span><span class="n">fontsize</span><span class="o">=</span><span class="mi">12</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s1">&#39;light&#39;</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">,</span>
                   <span class="n">color</span><span class="o">=</span><span class="s1">&#39;white&#39;</span><span class="p">)</span>
    

<span class="n">fig</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="mf">0.13</span><span class="p">,</span> <span class="mf">0.93</span><span class="p">,</span> <span class="s1">&#39;Top 10 countries Movie &amp; TV Show split&#39;</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s1">&#39;bold&#39;</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">)</span>   
<span class="n">fig</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="mf">0.131</span><span class="p">,</span> <span class="mf">0.89</span><span class="p">,</span> <span class="s1">&#39;Percent Stacked Bar Chart&#39;</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">12</span><span class="p">,</span><span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">)</span>   

<span class="k">for</span> <span class="n">s</span> <span class="ow">in</span> <span class="p">[</span><span class="s1">&#39;top&#39;</span><span class="p">,</span> <span class="s1">&#39;left&#39;</span><span class="p">,</span> <span class="s1">&#39;right&#39;</span><span class="p">,</span> <span class="s1">&#39;bottom&#39;</span><span class="p">]:</span>
    <span class="n">ax</span><span class="o">.</span><span class="n">spines</span><span class="p">[</span><span class="n">s</span><span class="p">]</span><span class="o">.</span><span class="n">set_visible</span><span class="p">(</span><span class="kc">False</span><span class="p">)</span>
    
<span class="c1">#ax.legend(loc=&#39;lower center&#39;, ncol=3, bbox_to_anchor=(0.5, -0.06))</span>

<span class="n">fig</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="mf">0.75</span><span class="p">,</span><span class="mf">0.9</span><span class="p">,</span><span class="s2">&#34;Movie&#34;</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s2">&#34;bold&#34;</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s1">&#39;#b20710&#39;</span><span class="p">)</span>
<span class="n">fig</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="mf">0.81</span><span class="p">,</span><span class="mf">0.9</span><span class="p">,</span><span class="s2">&#34;|&#34;</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s2">&#34;bold&#34;</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s1">&#39;black&#39;</span><span class="p">)</span>
<span class="n">fig</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="mf">0.82</span><span class="p">,</span><span class="mf">0.9</span><span class="p">,</span><span class="s2">&#34;TV Show&#34;</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s2">&#34;bold&#34;</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s1">&#39;#221f1f&#39;</span><span class="p">)</span>


<span class="n">fig</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="mf">1.1</span><span class="p">,</span> <span class="mf">0.93</span><span class="p">,</span> <span class="s1">&#39;Insight&#39;</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s1">&#39;bold&#39;</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">)</span>

<span class="n">fig</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="mf">1.1</span><span class="p">,</span> <span class="mf">0.44</span><span class="p">,</span> <span class="s1">&#39;&#39;&#39;
</span><span class="s1">Interestingly, Netflix in India
</span><span class="s1">is made up nearly entirely of Movies. 
</span><span class="s1">
</span><span class="s1">Bollywood is big business, and perhaps
</span><span class="s1">the main focus of this industry is Movies
</span><span class="s1">and not TV Shows.
</span><span class="s1">
</span><span class="s1">South Korean Netflix on the other hand is 
</span><span class="s1">almost entirely TV Shows.
</span><span class="s1">
</span><span class="s1">The underlying resons for the difference 
</span><span class="s1">in content must be due to market research
</span><span class="s1">conducted by Netflix.
</span><span class="s1">&#39;&#39;&#39;</span>
         <span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">12</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s1">&#39;light&#39;</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">)</span>



<span class="kn">import</span> <span class="nn">matplotlib.lines</span> <span class="k">as</span> <span class="nn">lines</span>
<span class="n">l1</span> <span class="o">=</span> <span class="n">lines</span><span class="o">.</span><span class="n">Line2D</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">],</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">],</span> <span class="n">transform</span><span class="o">=</span><span class="n">fig</span><span class="o">.</span><span class="n">transFigure</span><span class="p">,</span> <span class="n">figure</span><span class="o">=</span><span class="n">fig</span><span class="p">,</span><span class="n">color</span><span class="o">=</span><span class="s1">&#39;black&#39;</span><span class="p">,</span><span class="n">lw</span><span class="o">=</span><span class="mf">0.2</span><span class="p">)</span>
<span class="n">fig</span><span class="o">.</span><span class="n">lines</span><span class="o">.</span><span class="n">extend</span><span class="p">([</span><span class="n">l1</span><span class="p">])</span>




<span class="n">ax</span><span class="o">.</span><span class="n">tick_params</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="s1">&#39;both&#39;</span><span class="p">,</span> <span class="n">which</span><span class="o">=</span><span class="s1">&#39;major&#39;</span><span class="p">,</span> <span class="n">labelsize</span><span class="o">=</span><span class="mi">12</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">tick_params</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="sa">u</span><span class="s1">&#39;both&#39;</span><span class="p">,</span> <span class="n">which</span><span class="o">=</span><span class="sa">u</span><span class="s1">&#39;both&#39;</span><span class="p">,</span><span class="n">length</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>

<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/output_17_0.png" alt=""  />
</p>
<br>
<p>正如我在对情节的见解中所指出的，看到电视节目和电影的划分如何因国家/地区而异真的很有趣。</p>
<p>韩国以电视节目为主, 我是韩国电影的超级粉丝，我知道他们有很棒的电影选择。</p>
<p>同样，印度也以电影为主。 我认为这可能是由于宝莱坞 - 如果您有任何其他想法，请在下面评论！</p>
<p><br><br></p>
<h3 id="26-评分">2.6 评分</h3>
<p>让我们简单看看评分是如何分配的</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">order</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">&#39;rating&#39;</span><span class="p">)[</span><span class="s1">&#39;count&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span><span class="o">.</span><span class="n">sort_values</span><span class="p">(</span><span class="n">ascending</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span><span class="o">.</span><span class="n">reset_index</span><span class="p">())</span>
<span class="n">rating_order</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">order</span><span class="p">[</span><span class="s1">&#39;rating&#39;</span><span class="p">])</span>

<span class="n">mf</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">&#39;type&#39;</span><span class="p">)[</span><span class="s1">&#39;rating&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">value_counts</span><span class="p">()</span><span class="o">.</span><span class="n">unstack</span><span class="p">()</span><span class="o">.</span><span class="n">sort_index</span><span class="p">()</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="nb">int</span><span class="p">)[</span><span class="n">rating_order</span><span class="p">]</span>

<span class="n">movie</span> <span class="o">=</span> <span class="n">mf</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="s1">&#39;Movie&#39;</span><span class="p">]</span>
<span class="n">tv</span> <span class="o">=</span> <span class="o">-</span> <span class="n">mf</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="s1">&#39;TV Show&#39;</span><span class="p">]</span>


<span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">12</span><span class="p">,</span> <span class="mi">6</span><span class="p">))</span>
<span class="n">ax</span><span class="o">.</span><span class="n">bar</span><span class="p">(</span><span class="n">movie</span><span class="o">.</span><span class="n">index</span><span class="p">,</span> <span class="n">movie</span><span class="p">,</span> <span class="n">width</span><span class="o">=</span><span class="mf">0.5</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s1">&#39;#b20710&#39;</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.8</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s1">&#39;Movie&#39;</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">bar</span><span class="p">(</span><span class="n">tv</span><span class="o">.</span><span class="n">index</span><span class="p">,</span> <span class="n">tv</span><span class="p">,</span> <span class="n">width</span><span class="o">=</span><span class="mf">0.5</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s1">&#39;#221f1f&#39;</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.8</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s1">&#39;TV Show&#39;</span><span class="p">)</span>
<span class="c1">#ax.set_ylim(-35, 50)</span>

<span class="c1"># Annotations</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">tv</span><span class="o">.</span><span class="n">index</span><span class="p">:</span>
    <span class="n">ax</span><span class="o">.</span><span class="n">annotate</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;</span><span class="si">{</span><span class="o">-</span><span class="n">tv</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="si">}</span><span class="s2">&#34;</span><span class="p">,</span> 
                   <span class="n">xy</span><span class="o">=</span><span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="n">tv</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">-</span> <span class="mi">60</span><span class="p">),</span>
                   <span class="n">va</span> <span class="o">=</span> <span class="s1">&#39;center&#39;</span><span class="p">,</span> <span class="n">ha</span><span class="o">=</span><span class="s1">&#39;center&#39;</span><span class="p">,</span><span class="n">fontweight</span><span class="o">=</span><span class="s1">&#39;light&#39;</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">,</span>
                   <span class="n">color</span><span class="o">=</span><span class="s1">&#39;#4a4a4a&#39;</span><span class="p">)</span>   

<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">movie</span><span class="o">.</span><span class="n">index</span><span class="p">:</span>
    <span class="n">ax</span><span class="o">.</span><span class="n">annotate</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;</span><span class="si">{</span><span class="n">movie</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="si">}</span><span class="s2">&#34;</span><span class="p">,</span> 
                   <span class="n">xy</span><span class="o">=</span><span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="n">movie</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">+</span> <span class="mi">60</span><span class="p">),</span>
                   <span class="n">va</span> <span class="o">=</span> <span class="s1">&#39;center&#39;</span><span class="p">,</span> <span class="n">ha</span><span class="o">=</span><span class="s1">&#39;center&#39;</span><span class="p">,</span><span class="n">fontweight</span><span class="o">=</span><span class="s1">&#39;light&#39;</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">,</span>
                   <span class="n">color</span><span class="o">=</span><span class="s1">&#39;#4a4a4a&#39;</span><span class="p">)</span>
    
 

<span class="k">for</span> <span class="n">s</span> <span class="ow">in</span> <span class="p">[</span><span class="s1">&#39;top&#39;</span><span class="p">,</span> <span class="s1">&#39;left&#39;</span><span class="p">,</span> <span class="s1">&#39;right&#39;</span><span class="p">,</span> <span class="s1">&#39;bottom&#39;</span><span class="p">]:</span>
    <span class="n">ax</span><span class="o">.</span><span class="n">spines</span><span class="p">[</span><span class="n">s</span><span class="p">]</span><span class="o">.</span><span class="n">set_visible</span><span class="p">(</span><span class="kc">False</span><span class="p">)</span>

<span class="n">ax</span><span class="o">.</span><span class="n">set_xticklabels</span><span class="p">(</span><span class="n">mf</span><span class="o">.</span><span class="n">columns</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set_yticks</span><span class="p">([])</span>    

<span class="n">ax</span><span class="o">.</span><span class="n">legend</span><span class="p">()</span><span class="o">.</span><span class="n">set_visible</span><span class="p">(</span><span class="kc">False</span><span class="p">)</span>
<span class="n">fig</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="mf">0.16</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="s1">&#39;Rating distribution by Film &amp; TV Show&#39;</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s1">&#39;bold&#39;</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">)</span>
<span class="n">fig</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="mf">0.16</span><span class="p">,</span> <span class="mf">0.89</span><span class="p">,</span> 
<span class="s1">&#39;&#39;&#39;We observe that some ratings are only applicable to Movies. 
</span><span class="s1">The most common for both Movies &amp; TV Shows are TV-MA and TV-14.
</span><span class="s1">&#39;&#39;&#39;</span>

<span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">12</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s1">&#39;light&#39;</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">)</span>


<span class="n">fig</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="mf">0.755</span><span class="p">,</span><span class="mf">0.924</span><span class="p">,</span><span class="s2">&#34;Movie&#34;</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s2">&#34;bold&#34;</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s1">&#39;#b20710&#39;</span><span class="p">)</span>
<span class="n">fig</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="mf">0.815</span><span class="p">,</span><span class="mf">0.924</span><span class="p">,</span><span class="s2">&#34;|&#34;</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s2">&#34;bold&#34;</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s1">&#39;black&#39;</span><span class="p">)</span>
<span class="n">fig</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="mf">0.825</span><span class="p">,</span><span class="mf">0.924</span><span class="p">,</span><span class="s2">&#34;TV Show&#34;</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s2">&#34;bold&#34;</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s1">&#39;#221f1f&#39;</span><span class="p">)</span>

<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/output_19_0.png" alt=""  />
</p>
<p><br><br></p>
<h3 id="27-这些年来内容是如何添加的">2.7 这些年来内容是如何添加的？</h3>
<p>正如我们在分析开始时的时间线中看到的那样，Netflix 于 2016 年走向全球, 电影内容的增加是显着的。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">12</span><span class="p">,</span> <span class="mi">6</span><span class="p">))</span>
<span class="n">color</span> <span class="o">=</span> <span class="p">[</span><span class="s2">&#34;#b20710&#34;</span><span class="p">,</span> <span class="s2">&#34;#221f1f&#34;</span><span class="p">]</span>

<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">mtv</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;type&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">value_counts</span><span class="p">()</span><span class="o">.</span><span class="n">index</span><span class="p">):</span>
    <span class="n">mtv_rel</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;type&#39;</span><span class="p">]</span><span class="o">==</span><span class="n">mtv</span><span class="p">][</span><span class="s1">&#39;year_added&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">value_counts</span><span class="p">()</span><span class="o">.</span><span class="n">sort_index</span><span class="p">()</span>
    <span class="n">ax</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">mtv_rel</span><span class="o">.</span><span class="n">index</span><span class="p">,</span> <span class="n">mtv_rel</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="n">color</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">label</span><span class="o">=</span><span class="n">mtv</span><span class="p">)</span>
    <span class="n">ax</span><span class="o">.</span><span class="n">fill_between</span><span class="p">(</span><span class="n">mtv_rel</span><span class="o">.</span><span class="n">index</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">mtv_rel</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="n">color</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.9</span><span class="p">)</span>
    
<span class="n">ax</span><span class="o">.</span><span class="n">yaxis</span><span class="o">.</span><span class="n">tick_right</span><span class="p">()</span>
    
<span class="n">ax</span><span class="o">.</span><span class="n">axhline</span><span class="p">(</span><span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span> <span class="n">color</span> <span class="o">=</span> <span class="s1">&#39;black&#39;</span><span class="p">,</span> <span class="n">linewidth</span> <span class="o">=</span> <span class="mf">1.3</span><span class="p">,</span> <span class="n">alpha</span> <span class="o">=</span> <span class="mf">.7</span><span class="p">)</span>

<span class="c1">#ax.set_ylim(0, 50)</span>
<span class="c1">#ax.legend(loc=&#39;upper left&#39;)</span>
<span class="k">for</span> <span class="n">s</span> <span class="ow">in</span> <span class="p">[</span><span class="s1">&#39;top&#39;</span><span class="p">,</span> <span class="s1">&#39;right&#39;</span><span class="p">,</span><span class="s1">&#39;bottom&#39;</span><span class="p">,</span><span class="s1">&#39;left&#39;</span><span class="p">]:</span>
    <span class="n">ax</span><span class="o">.</span><span class="n">spines</span><span class="p">[</span><span class="n">s</span><span class="p">]</span><span class="o">.</span><span class="n">set_visible</span><span class="p">(</span><span class="kc">False</span><span class="p">)</span>

<span class="n">ax</span><span class="o">.</span><span class="n">grid</span><span class="p">(</span><span class="kc">False</span><span class="p">)</span>

<span class="n">ax</span><span class="o">.</span><span class="n">set_xlim</span><span class="p">(</span><span class="mi">2008</span><span class="p">,</span><span class="mi">2020</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xticks</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">2008</span><span class="p">,</span> <span class="mi">2021</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span>

<span class="n">fig</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="mf">0.13</span><span class="p">,</span> <span class="mf">0.85</span><span class="p">,</span> <span class="s1">&#39;Movies &amp; TV Shows added over time&#39;</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s1">&#39;bold&#39;</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">)</span>
<span class="n">fig</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="mf">0.13</span><span class="p">,</span> <span class="mf">0.59</span><span class="p">,</span> 
<span class="s1">&#39;&#39;&#39;We see a slow start for Netflix over several years. 
</span><span class="s1">Things begin to pick up in 2015 and then there is a 
</span><span class="s1">rapid increase from 2016.
</span><span class="s1">
</span><span class="s1">It looks like content additions have slowed down in 2020, 
</span><span class="s1">likely due to the COVID-19 pandemic.
</span><span class="s1">&#39;&#39;&#39;</span>

<span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">12</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s1">&#39;light&#39;</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">)</span>


<span class="n">fig</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="mf">0.13</span><span class="p">,</span><span class="mf">0.2</span><span class="p">,</span><span class="s2">&#34;Movie&#34;</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s2">&#34;bold&#34;</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s1">&#39;#b20710&#39;</span><span class="p">)</span>
<span class="n">fig</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="mf">0.19</span><span class="p">,</span><span class="mf">0.2</span><span class="p">,</span><span class="s2">&#34;|&#34;</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s2">&#34;bold&#34;</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s1">&#39;black&#39;</span><span class="p">)</span>
<span class="n">fig</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="mf">0.2</span><span class="p">,</span><span class="mf">0.2</span><span class="p">,</span><span class="s2">&#34;TV Show&#34;</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s2">&#34;bold&#34;</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s1">&#39;#221f1f&#39;</span><span class="p">)</span>

<span class="n">ax</span><span class="o">.</span><span class="n">tick_params</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="sa">u</span><span class="s1">&#39;both&#39;</span><span class="p">,</span> <span class="n">which</span><span class="o">=</span><span class="sa">u</span><span class="s1">&#39;both&#39;</span><span class="p">,</span><span class="n">length</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>

<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/output_21_0.png" alt=""  />
</p>
<p><br><br></p>
<h3 id="28-我们可以查看相同的图但作为累积总数">2.8 我们可以查看相同的图，但作为累积总数&hellip;&hellip;</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">data_sub</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">&#39;type&#39;</span><span class="p">)[</span><span class="s1">&#39;year_added&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">value_counts</span><span class="p">()</span><span class="o">.</span><span class="n">unstack</span><span class="p">()</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span><span class="o">.</span><span class="n">loc</span><span class="p">[[</span><span class="s1">&#39;TV Show&#39;</span><span class="p">,</span><span class="s1">&#39;Movie&#39;</span><span class="p">]]</span><span class="o">.</span><span class="n">cumsum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span><span class="o">.</span><span class="n">T</span>

<span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">12</span><span class="p">,</span> <span class="mi">6</span><span class="p">))</span>
<span class="n">color</span> <span class="o">=</span> <span class="p">[</span><span class="s2">&#34;#b20710&#34;</span><span class="p">,</span> <span class="s2">&#34;#221f1f&#34;</span><span class="p">]</span>

<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">mtv</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;type&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">value_counts</span><span class="p">()</span><span class="o">.</span><span class="n">index</span><span class="p">):</span>
    <span class="n">mtv_rel</span> <span class="o">=</span> <span class="n">data_sub</span><span class="p">[</span><span class="n">mtv</span><span class="p">]</span>
    <span class="n">ax</span><span class="o">.</span><span class="n">fill_between</span><span class="p">(</span><span class="n">mtv_rel</span><span class="o">.</span><span class="n">index</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">mtv_rel</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="n">color</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">label</span><span class="o">=</span><span class="n">mtv</span><span class="p">,</span><span class="n">alpha</span><span class="o">=</span><span class="mf">0.9</span><span class="p">)</span>
    

    
<span class="n">ax</span><span class="o">.</span><span class="n">yaxis</span><span class="o">.</span><span class="n">tick_right</span><span class="p">()</span>
    
<span class="n">ax</span><span class="o">.</span><span class="n">axhline</span><span class="p">(</span><span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span> <span class="n">color</span> <span class="o">=</span> <span class="s1">&#39;black&#39;</span><span class="p">,</span> <span class="n">linewidth</span> <span class="o">=</span> <span class="mf">1.3</span><span class="p">,</span> <span class="n">alpha</span> <span class="o">=</span> <span class="mf">.7</span><span class="p">)</span>

<span class="c1">#ax.set_ylim(0, 50)</span>
<span class="c1">#ax.legend(loc=&#39;upper left&#39;)</span>
<span class="k">for</span> <span class="n">s</span> <span class="ow">in</span> <span class="p">[</span><span class="s1">&#39;top&#39;</span><span class="p">,</span> <span class="s1">&#39;right&#39;</span><span class="p">,</span><span class="s1">&#39;bottom&#39;</span><span class="p">,</span><span class="s1">&#39;left&#39;</span><span class="p">]:</span>
    <span class="n">ax</span><span class="o">.</span><span class="n">spines</span><span class="p">[</span><span class="n">s</span><span class="p">]</span><span class="o">.</span><span class="n">set_visible</span><span class="p">(</span><span class="kc">False</span><span class="p">)</span>

<span class="n">ax</span><span class="o">.</span><span class="n">grid</span><span class="p">(</span><span class="kc">False</span><span class="p">)</span>

<span class="n">ax</span><span class="o">.</span><span class="n">set_xlim</span><span class="p">(</span><span class="mi">2008</span><span class="p">,</span><span class="mi">2020</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xticks</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">2008</span><span class="p">,</span> <span class="mi">2021</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span>

<span class="n">fig</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="mf">0.13</span><span class="p">,</span> <span class="mf">0.85</span><span class="p">,</span> <span class="s1">&#39;Movies &amp; TV Shows added over time [Cumulative Total]&#39;</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s1">&#39;bold&#39;</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">)</span>
<span class="n">fig</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="mf">0.13</span><span class="p">,</span> <span class="mf">0.58</span><span class="p">,</span> 
<span class="s1">&#39;&#39;&#39;Netflix peak global content amount was in 2019.
</span><span class="s1">
</span><span class="s1">It appears that Netflix has focused more attention
</span><span class="s1">on increasing Movie content that TV Shows. 
</span><span class="s1">Movies have increased much more dramatically
</span><span class="s1">than TV shows.
</span><span class="s1">&#39;&#39;&#39;</span>

<span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">12</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s1">&#39;light&#39;</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">)</span>



<span class="n">fig</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="mf">0.13</span><span class="p">,</span><span class="mf">0.2</span><span class="p">,</span><span class="s2">&#34;Movie&#34;</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s2">&#34;bold&#34;</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s1">&#39;#b20710&#39;</span><span class="p">)</span>
<span class="n">fig</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="mf">0.19</span><span class="p">,</span><span class="mf">0.2</span><span class="p">,</span><span class="s2">&#34;|&#34;</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s2">&#34;bold&#34;</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s1">&#39;black&#39;</span><span class="p">)</span>
<span class="n">fig</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="mf">0.2</span><span class="p">,</span><span class="mf">0.2</span><span class="p">,</span><span class="s2">&#34;TV Show&#34;</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s2">&#34;bold&#34;</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s1">&#39;#221f1f&#39;</span><span class="p">)</span>

<span class="n">ax</span><span class="o">.</span><span class="n">tick_params</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="sa">u</span><span class="s1">&#39;both&#39;</span><span class="p">,</span> <span class="n">which</span><span class="o">=</span><span class="sa">u</span><span class="s1">&#39;both&#39;</span><span class="p">,</span><span class="n">length</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>


<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/output_23_0.png" alt=""  />
</p>
<p><br><br></p>
<h3 id="28-逐月">2.8 逐月¶</h3>
<p>我们已经看到这些年来内容是如何增加的，但是平均而言，是否有某些月份倾向于享受更多内容的添加？</p>
<p>我将以多种方式展示这一点 - 累积年视图，以及径向图&hellip;&hellip;</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">month_order</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;January&#39;</span><span class="p">,</span>
 <span class="s1">&#39;February&#39;</span><span class="p">,</span>
 <span class="s1">&#39;March&#39;</span><span class="p">,</span>
 <span class="s1">&#39;April&#39;</span><span class="p">,</span>
 <span class="s1">&#39;May&#39;</span><span class="p">,</span>
 <span class="s1">&#39;June&#39;</span><span class="p">,</span>
 <span class="s1">&#39;July&#39;</span><span class="p">,</span>
 <span class="s1">&#39;August&#39;</span><span class="p">,</span>
 <span class="s1">&#39;September&#39;</span><span class="p">,</span>
 <span class="s1">&#39;October&#39;</span><span class="p">,</span>
 <span class="s1">&#39;November&#39;</span><span class="p">,</span>
 <span class="s1">&#39;December&#39;</span><span class="p">]</span>

<span class="n">df</span><span class="p">[</span><span class="s1">&#39;month_name_added&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">Categorical</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;month_name_added&#39;</span><span class="p">],</span> <span class="n">categories</span><span class="o">=</span><span class="n">month_order</span><span class="p">,</span> <span class="n">ordered</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>


<span class="n">data_sub</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">&#39;type&#39;</span><span class="p">)[</span><span class="s1">&#39;month_name_added&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">value_counts</span><span class="p">()</span><span class="o">.</span><span class="n">unstack</span><span class="p">()</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span><span class="o">.</span><span class="n">loc</span><span class="p">[[</span><span class="s1">&#39;TV Show&#39;</span><span class="p">,</span><span class="s1">&#39;Movie&#39;</span><span class="p">]]</span><span class="o">.</span><span class="n">cumsum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span><span class="o">.</span><span class="n">T</span>

<span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">12</span><span class="p">,</span> <span class="mi">6</span><span class="p">))</span>
<span class="n">color</span> <span class="o">=</span> <span class="p">[</span><span class="s2">&#34;#b20710&#34;</span><span class="p">,</span> <span class="s2">&#34;#221f1f&#34;</span><span class="p">]</span>

<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">mtv</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;type&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">value_counts</span><span class="p">()</span><span class="o">.</span><span class="n">index</span><span class="p">):</span>
    <span class="n">mtv_rel</span> <span class="o">=</span> <span class="n">data_sub</span><span class="p">[</span><span class="n">mtv</span><span class="p">]</span>
    <span class="n">ax</span><span class="o">.</span><span class="n">fill_between</span><span class="p">(</span><span class="n">mtv_rel</span><span class="o">.</span><span class="n">index</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">mtv_rel</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="n">color</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">label</span><span class="o">=</span><span class="n">mtv</span><span class="p">,</span><span class="n">alpha</span><span class="o">=</span><span class="mf">0.9</span><span class="p">)</span>
    

    
<span class="n">ax</span><span class="o">.</span><span class="n">yaxis</span><span class="o">.</span><span class="n">tick_right</span><span class="p">()</span>
    
<span class="n">ax</span><span class="o">.</span><span class="n">axhline</span><span class="p">(</span><span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span> <span class="n">color</span> <span class="o">=</span> <span class="s1">&#39;black&#39;</span><span class="p">,</span> <span class="n">linewidth</span> <span class="o">=</span> <span class="mf">1.3</span><span class="p">,</span> <span class="n">alpha</span> <span class="o">=</span> <span class="mf">.4</span><span class="p">)</span>

<span class="c1">#ax.set_ylim(0, 50)</span>
<span class="c1">#ax.legend(loc=&#39;upper left&#39;)</span>
<span class="k">for</span> <span class="n">s</span> <span class="ow">in</span> <span class="p">[</span><span class="s1">&#39;top&#39;</span><span class="p">,</span> <span class="s1">&#39;right&#39;</span><span class="p">,</span><span class="s1">&#39;bottom&#39;</span><span class="p">,</span><span class="s1">&#39;left&#39;</span><span class="p">]:</span>
    <span class="n">ax</span><span class="o">.</span><span class="n">spines</span><span class="p">[</span><span class="n">s</span><span class="p">]</span><span class="o">.</span><span class="n">set_visible</span><span class="p">(</span><span class="kc">False</span><span class="p">)</span>

<span class="n">ax</span><span class="o">.</span><span class="n">grid</span><span class="p">(</span><span class="kc">False</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set_xticklabels</span><span class="p">(</span><span class="n">data_sub</span><span class="o">.</span><span class="n">index</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">,</span> <span class="n">rotation</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">margins</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span> <span class="c1"># remove white spaces next to margins</span>

<span class="c1">#ax.set_xlim(2008,2020)</span>
<span class="c1">#plt.xticks(np.arange(2008, 2021, 1))</span>

<span class="n">fig</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="mf">0.13</span><span class="p">,</span> <span class="mf">0.95</span><span class="p">,</span> <span class="s1">&#39;Content added by month [Cumulative Total]&#39;</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s1">&#39;bold&#39;</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">)</span>
<span class="n">fig</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="mf">0.13</span><span class="p">,</span> <span class="mf">0.905</span><span class="p">,</span> 
<span class="s2">&#34;The end &amp; beginnings of each year seem to be Netflix&#39;s preference for adding content.&#34;</span>

<span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">12</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s1">&#39;light&#39;</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">)</span>



<span class="n">fig</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="mf">0.13</span><span class="p">,</span><span class="mf">0.855</span><span class="p">,</span><span class="s2">&#34;Movie&#34;</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s2">&#34;bold&#34;</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s1">&#39;#b20710&#39;</span><span class="p">)</span>
<span class="n">fig</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="mf">0.19</span><span class="p">,</span><span class="mf">0.855</span><span class="p">,</span><span class="s2">&#34;|&#34;</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s2">&#34;bold&#34;</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s1">&#39;black&#39;</span><span class="p">)</span>
<span class="n">fig</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="mf">0.2</span><span class="p">,</span><span class="mf">0.855</span><span class="p">,</span><span class="s2">&#34;TV Show&#34;</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s2">&#34;bold&#34;</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s1">&#39;#221f1f&#39;</span><span class="p">)</span>


<span class="n">ax</span><span class="o">.</span><span class="n">tick_params</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="sa">u</span><span class="s1">&#39;both&#39;</span><span class="p">,</span> <span class="n">which</span><span class="o">=</span><span class="sa">u</span><span class="s1">&#39;both&#39;</span><span class="p">,</span><span class="n">length</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>

<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/output_25_0.png" alt=""  />
</p>
<p><br><br></p>
<h3 id="29-有没有一种更有趣的方式来查看全年内容的添加情况">2.9 有没有一种更有趣的方式来查看全年内容的添加情况？</h3>
<p>有时可视化应该是引人注目的——我认为这种视觉效果可以实现这一点，即使它不是最精确的。通过突出显示某些月份，读者的注意力可以准确地吸引到我们想要的地方。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">data_sub2</span> <span class="o">=</span> <span class="n">data_sub</span>

<span class="n">data_sub2</span><span class="p">[</span><span class="s1">&#39;Value&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">data_sub2</span><span class="p">[</span><span class="s1">&#39;Movie&#39;</span><span class="p">]</span> <span class="o">+</span> <span class="n">data_sub2</span><span class="p">[</span><span class="s1">&#39;TV Show&#39;</span><span class="p">]</span>
<span class="n">data_sub2</span> <span class="o">=</span> <span class="n">data_sub2</span><span class="o">.</span><span class="n">reset_index</span><span class="p">()</span>

<span class="n">df_polar</span> <span class="o">=</span> <span class="n">data_sub2</span><span class="o">.</span><span class="n">sort_values</span><span class="p">(</span><span class="n">by</span><span class="o">=</span><span class="s1">&#39;month_name_added&#39;</span><span class="p">,</span><span class="n">ascending</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>


<span class="n">color_map</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;#221f1f&#39;</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">12</span><span class="p">)]</span>
<span class="n">color_map</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">color_map</span><span class="p">[</span><span class="mi">11</span><span class="p">]</span> <span class="o">=</span>  <span class="s1">&#39;#b20710&#39;</span> <span class="c1"># color highlight</span>


<span class="c1"># initialize the figure</span>
<span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span><span class="mi">8</span><span class="p">))</span>
<span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">subplot</span><span class="p">(</span><span class="mi">111</span><span class="p">,</span> <span class="n">polar</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">axis</span><span class="p">(</span><span class="s1">&#39;off&#39;</span><span class="p">)</span>

<span class="c1"># Constants = parameters controling the plot layout:</span>
<span class="n">upperLimit</span> <span class="o">=</span> <span class="mi">30</span>
<span class="n">lowerLimit</span> <span class="o">=</span> <span class="mi">1</span>
<span class="n">labelPadding</span> <span class="o">=</span> <span class="mi">30</span>

<span class="c1"># Compute max and min in the dataset</span>
<span class="nb">max</span> <span class="o">=</span> <span class="n">df_polar</span><span class="p">[</span><span class="s1">&#39;Value&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">()</span>

<span class="c1"># Let&#39;s compute heights: they are a conversion of each item value in those new coordinates</span>
<span class="c1"># In our example, 0 in the dataset will be converted to the lowerLimit (10)</span>
<span class="c1"># The maximum will be converted to the upperLimit (100)</span>
<span class="n">slope</span> <span class="o">=</span> <span class="p">(</span><span class="nb">max</span> <span class="o">-</span> <span class="n">lowerLimit</span><span class="p">)</span> <span class="o">/</span> <span class="nb">max</span>
<span class="n">heights</span> <span class="o">=</span> <span class="n">slope</span> <span class="o">*</span> <span class="n">df_polar</span><span class="o">.</span><span class="n">Value</span> <span class="o">+</span> <span class="n">lowerLimit</span>

<span class="c1"># Compute the width of each bar. In total we have 2*Pi = 360°</span>
<span class="n">width</span> <span class="o">=</span> <span class="mi">2</span><span class="o">*</span><span class="n">np</span><span class="o">.</span><span class="n">pi</span> <span class="o">/</span> <span class="nb">len</span><span class="p">(</span><span class="n">df_polar</span><span class="o">.</span><span class="n">index</span><span class="p">)</span>

<span class="c1"># Compute the angle each bar is centered on:</span>
<span class="n">indexes</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">df_polar</span><span class="o">.</span><span class="n">index</span><span class="p">)</span><span class="o">+</span><span class="mi">1</span><span class="p">))</span>
<span class="n">angles</span> <span class="o">=</span> <span class="p">[</span><span class="n">element</span> <span class="o">*</span> <span class="n">width</span> <span class="k">for</span> <span class="n">element</span> <span class="ow">in</span> <span class="n">indexes</span><span class="p">]</span>
<span class="n">angles</span>

<span class="c1"># Draw bars</span>
<span class="n">bars</span> <span class="o">=</span> <span class="n">ax</span><span class="o">.</span><span class="n">bar</span><span class="p">(</span>
    <span class="n">x</span><span class="o">=</span><span class="n">angles</span><span class="p">,</span> 
    <span class="n">height</span><span class="o">=</span><span class="n">heights</span><span class="p">,</span> 
    <span class="n">width</span><span class="o">=</span><span class="n">width</span><span class="p">,</span> 
    <span class="n">bottom</span><span class="o">=</span><span class="n">lowerLimit</span><span class="p">,</span>
    <span class="n">linewidth</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> 
    <span class="n">edgecolor</span><span class="o">=</span><span class="s2">&#34;white&#34;</span><span class="p">,</span>
    <span class="n">color</span><span class="o">=</span><span class="n">color_map</span><span class="p">,</span><span class="n">alpha</span><span class="o">=</span><span class="mf">0.8</span>
<span class="p">)</span>

<span class="c1"># Add labels</span>
<span class="k">for</span> <span class="n">bar</span><span class="p">,</span> <span class="n">angle</span><span class="p">,</span> <span class="n">height</span><span class="p">,</span> <span class="n">label</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">bars</span><span class="p">,</span><span class="n">angles</span><span class="p">,</span> <span class="n">heights</span><span class="p">,</span> <span class="n">df_polar</span><span class="p">[</span><span class="s2">&#34;month_name_added&#34;</span><span class="p">]):</span>

    <span class="c1"># Labels are rotated. Rotation must be specified in degrees :(</span>
    <span class="n">rotation</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">rad2deg</span><span class="p">(</span><span class="n">angle</span><span class="p">)</span>

    <span class="c1"># Flip some labels upside down</span>
    <span class="n">alignment</span> <span class="o">=</span> <span class="s2">&#34;&#34;</span>
    <span class="k">if</span> <span class="n">angle</span> <span class="o">&gt;=</span> <span class="n">np</span><span class="o">.</span><span class="n">pi</span><span class="o">/</span><span class="mi">2</span> <span class="ow">and</span> <span class="n">angle</span> <span class="o">&lt;</span> <span class="mi">3</span><span class="o">*</span><span class="n">np</span><span class="o">.</span><span class="n">pi</span><span class="o">/</span><span class="mi">2</span><span class="p">:</span>
        <span class="n">alignment</span> <span class="o">=</span> <span class="s2">&#34;right&#34;</span>
        <span class="n">rotation</span> <span class="o">=</span> <span class="n">rotation</span> <span class="o">+</span> <span class="mi">180</span>
    <span class="k">else</span><span class="p">:</span> 
        <span class="n">alignment</span> <span class="o">=</span> <span class="s2">&#34;left&#34;</span>

    <span class="c1"># Finally add the labels</span>
    <span class="n">ax</span><span class="o">.</span><span class="n">text</span><span class="p">(</span>
        <span class="n">x</span><span class="o">=</span><span class="n">angle</span><span class="p">,</span> 
        <span class="n">y</span><span class="o">=</span><span class="n">lowerLimit</span> <span class="o">+</span> <span class="n">bar</span><span class="o">.</span><span class="n">get_height</span><span class="p">()</span> <span class="o">+</span> <span class="n">labelPadding</span><span class="p">,</span> 
        <span class="n">s</span><span class="o">=</span><span class="n">label</span><span class="p">,</span> 
        <span class="n">ha</span><span class="o">=</span><span class="n">alignment</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span><span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">,</span>
        <span class="n">va</span><span class="o">=</span><span class="s1">&#39;center&#39;</span><span class="p">,</span> 
        <span class="n">rotation</span><span class="o">=</span><span class="n">rotation</span><span class="p">,</span> 
        <span class="n">rotation_mode</span><span class="o">=</span><span class="s2">&#34;anchor&#34;</span><span class="p">)</span> 
</code></pre></div><p><img loading="lazy" src="img/output_27_0.png" alt=""  />
</p>
<p>是的，十二月和一月绝对是新内容的最佳月份。 也许 Netflix 知道人们在这段时间有很多休假时间，是吸引人们的好时机？</p>
<p>二月是最糟糕的——为什么会这样呢？ 欢迎提出想法！</p>
<p><br><br></p>
<h3 id="210-电影类型">2.10 电影类型</h3>
<p>现在让我们稍微探讨一下电影类型&hellip;&hellip;</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1"># Genres</span>
<span class="kn">from</span> <span class="nn">sklearn.preprocessing</span> <span class="kn">import</span> <span class="n">MultiLabelBinarizer</span> 

<span class="kn">import</span> <span class="nn">matplotlib.colors</span>


<span class="c1"># Custom colour map based on Netflix palette</span>
<span class="n">cmap</span> <span class="o">=</span> <span class="n">matplotlib</span><span class="o">.</span><span class="n">colors</span><span class="o">.</span><span class="n">LinearSegmentedColormap</span><span class="o">.</span><span class="n">from_list</span><span class="p">(</span><span class="s2">&#34;&#34;</span><span class="p">,</span> <span class="p">[</span><span class="s1">&#39;#221f1f&#39;</span><span class="p">,</span> <span class="s1">&#39;#b20710&#39;</span><span class="p">,</span><span class="s1">&#39;#f5f5f1&#39;</span><span class="p">])</span>



<span class="k">def</span> <span class="nf">genre_heatmap</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">title</span><span class="p">):</span>
    <span class="n">df</span><span class="p">[</span><span class="s1">&#39;genre&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;listed_in&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span> <span class="p">:</span>  <span class="n">x</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s1">&#39; ,&#39;</span><span class="p">,</span><span class="s1">&#39;,&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s1">&#39;, &#39;</span><span class="p">,</span><span class="s1">&#39;,&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">&#39;,&#39;</span><span class="p">))</span> 
    <span class="n">Types</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;genre&#39;</span><span class="p">]:</span> <span class="n">Types</span> <span class="o">+=</span> <span class="n">i</span>
    <span class="n">Types</span> <span class="o">=</span> <span class="nb">set</span><span class="p">(</span><span class="n">Types</span><span class="p">)</span>
    <span class="nb">print</span><span class="p">(</span><span class="s2">&#34;There are </span><span class="si">{}</span><span class="s2"> types in the Netflix </span><span class="si">{}</span><span class="s2"> Dataset&#34;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">Types</span><span class="p">),</span><span class="n">title</span><span class="p">))</span>    
    <span class="n">test</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;genre&#39;</span><span class="p">]</span>
    <span class="n">mlb</span> <span class="o">=</span> <span class="n">MultiLabelBinarizer</span><span class="p">()</span>
    <span class="n">res</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">mlb</span><span class="o">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">test</span><span class="p">),</span> <span class="n">columns</span><span class="o">=</span><span class="n">mlb</span><span class="o">.</span><span class="n">classes_</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="n">test</span><span class="o">.</span><span class="n">index</span><span class="p">)</span>
    <span class="n">corr</span> <span class="o">=</span> <span class="n">res</span><span class="o">.</span><span class="n">corr</span><span class="p">()</span>
    <span class="n">mask</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros_like</span><span class="p">(</span><span class="n">corr</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="nb">bool</span><span class="p">)</span>
    <span class="n">mask</span><span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">triu_indices_from</span><span class="p">(</span><span class="n">mask</span><span class="p">)]</span> <span class="o">=</span> <span class="kc">True</span>
    <span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">subplots</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">7</span><span class="p">))</span>
    <span class="n">fig</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="mf">.54</span><span class="p">,</span><span class="mf">.88</span><span class="p">,</span><span class="s1">&#39;Genre correlation&#39;</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">,</span><span class="n">fontweight</span><span class="o">=</span><span class="s1">&#39;bold&#39;</span><span class="p">,</span><span class="n">fontsize</span><span class="o">=</span><span class="mi">15</span><span class="p">)</span>
    <span class="n">fig</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="mf">.75</span><span class="p">,</span><span class="mf">.665</span><span class="p">,</span>
            <span class="s1">&#39;&#39;&#39;
</span><span class="s1">             It is interesting that Independant Movies
</span><span class="s1">             tend to be Dramas. 
</span><span class="s1">             
</span><span class="s1">             Another observation is that 
</span><span class="s1">             Internatinal Movies are rarely
</span><span class="s1">             in the Children&#39;s genre.
</span><span class="s1">             &#39;&#39;&#39;</span><span class="p">,</span> <span class="n">fontfamily</span><span class="o">=</span><span class="s1">&#39;serif&#39;</span><span class="p">,</span><span class="n">fontsize</span><span class="o">=</span><span class="mi">12</span><span class="p">,</span><span class="n">ha</span><span class="o">=</span><span class="s1">&#39;right&#39;</span><span class="p">)</span>
    <span class="n">pl</span> <span class="o">=</span> <span class="n">sns</span><span class="o">.</span><span class="n">heatmap</span><span class="p">(</span><span class="n">corr</span><span class="p">,</span> <span class="n">mask</span><span class="o">=</span><span class="n">mask</span><span class="p">,</span> <span class="n">cmap</span><span class="o">=</span><span class="n">cmap</span><span class="p">,</span> <span class="n">vmax</span><span class="o">=</span><span class="mf">.3</span><span class="p">,</span> <span class="n">vmin</span><span class="o">=-</span><span class="mf">.3</span><span class="p">,</span> <span class="n">center</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">square</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">linewidths</span><span class="o">=</span><span class="mf">2.5</span><span class="p">)</span>
    
    <span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
    
    
<span class="n">df_tv</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s2">&#34;type&#34;</span><span class="p">]</span> <span class="o">==</span> <span class="s2">&#34;TV Show&#34;</span><span class="p">]</span>
<span class="n">df_movies</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s2">&#34;type&#34;</span><span class="p">]</span> <span class="o">==</span> <span class="s2">&#34;Movie&#34;</span><span class="p">]</span>


<span class="n">genre_heatmap</span><span class="p">(</span><span class="n">df_movies</span><span class="p">,</span> <span class="s1">&#39;Movie&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div><pre><code>There are 20 types in the Netflix Movie Dataset
</code></pre>
<p><img loading="lazy" src="img/output_29_1.png" alt=""  />
</p>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>CAR2023 | 文本分析在会计中的应用</title>
      <link>https://textdata.cn/blog/2023-08-26-text-analysis-in-accounting/</link>
      <pubDate>Sat, 26 Aug 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-08-26-text-analysis-in-accounting/</guid>
      <description>&lt;h2 id=&#34;一文本分析在会计领域中的应用&#34;&gt;一、文本分析在会计领域中的应用&lt;/h2&gt;
&lt;p&gt;Bochkay, Khrystyna, Stephen V. Brown, Andrew J. Leone, and Jennifer Wu Tucker. &amp;ldquo;Textual analysis in accounting: What&amp;rsquo;s next?.&amp;rdquo; &lt;em&gt;Contemporary accounting research&lt;/em&gt; 40, no. 2 (2023): 765-805.&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;11-摘要&#34;&gt;1.1 摘要&lt;/h3&gt;
&lt;p&gt;自然语言是商务沟通的重要形式。 文本分析是指用自然语言处理（NLP）技术处理文本数据，从而得到某些感兴趣的测量值(信息)。 我们调查了顶级会计期刊的出版物，并描述了会计文本分析的趋势和现状。 我们将可用的 NLP 方法组织在一个统一的框架中。 会计研究者经常使用文本分析来衡量披露情绪、可读性和披露数量； 比较披露信息以确定相似性或差异；识别前瞻性信息； 并检测主题。 对于每一项任务，我们都解释了传统方法和基于机器学习（尤其是深度学习）的新方法。 我们讨论如何建立基于文本的测量的构造有效性以及研究人员在实施 NLP 模型时面临的典型决策。 最后，我们讨论了未来研究的机会。 我们的结论是：(i) 文本分析已发展成为一种重要的研究方法，(ii) 会计研究人员应该增加对机器学习（尤其是深度学习）的了解和使用，以进行文本分析。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;12-发文趋势&#34;&gt;1.2 发文趋势&lt;/h3&gt;
&lt;p&gt;会计顶刊文本分析发文量如下图&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The Accounting Review(TAR),&lt;/p&gt;
&lt;p&gt;Journal of Accounting Research(JAR),&lt;/p&gt;
&lt;p&gt;Journal of Accounting andEconomics(JAE),&lt;/p&gt;
&lt;p&gt;Contemporary Accounting Research(CAR),&lt;/p&gt;
&lt;p&gt;Review of Accounting Studies(RAST),&lt;/p&gt;
&lt;p&gt;Accounting, Organizations,and Society(AOS),&lt;/p&gt;
&lt;p&gt;ManagementScience(MS).&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/fig-1.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/fig-2.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;13-数据源所用指标&#34;&gt;1.3 数据源&amp;amp;所用指标&lt;/h3&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/fig-3.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二文本分析操作方法&#34;&gt;二、文本分析操作方法&lt;/h2&gt;
&lt;p&gt;文本分析各方法指南&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;数据获取&amp;amp;预处理步骤&lt;/li&gt;
&lt;li&gt;词典选择(构建)步骤&lt;/li&gt;
&lt;li&gt;监督机器学习步骤&lt;/li&gt;
&lt;/ol&gt;
&lt;br&gt;
&lt;h3 id=&#34;21-数据获取预处理&#34;&gt;2.1 数据获取&amp;amp;预处理&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;数据获取&lt;/strong&gt; 使用人工手动或爬虫从EDGAR、公司网站、社交媒体等数据源采集下载&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;数据清洗&lt;/strong&gt;  剔除HTML中的标签、非文本字符、特殊字符(如&amp;amp; ￥ $ 等)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;分词&lt;/strong&gt; 将文本转为颗粒度为词语的成分&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;文档筛选&lt;/strong&gt;  字符数太短的文档删除掉&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;停用词&lt;/strong&gt; 剔除文本中的停用词，如(中文如“的他呢了地”，英文如 the、in、a)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;合并同类项(stemming&amp;amp;lemmatization)&lt;/strong&gt;  文本中出现的increasing, increases, and increased，都整理为increase。&lt;/li&gt;
&lt;/ol&gt;
&lt;br&gt;
&lt;h3 id=&#34;22-词典选择构建步骤&#34;&gt;2.2 词典选择(构建)步骤&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;选择词典&lt;/strong&gt; 选择符合研究目的的词典，如做文本的情感分析，可以选择用积极词典和消极词典。&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;词频统计&lt;/strong&gt;  统计词语是否出现，还是统计词语出现次数&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;词语权重&lt;/strong&gt;  确定所有计数是否具有相同的权重，或者某些单词或短语应获得更高或更低的权重（例如，更常见的单词获得更低的权重）。&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;词典验证&lt;/strong&gt; 将字典在识别相关内容方面的表现与人工注释者进行比较。&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;确定指标&lt;/strong&gt; 确定最终感兴趣变量的标量（例如，文档中的总单词数）&lt;/li&gt;
&lt;/ol&gt;
&lt;br&gt;
&lt;h3 id=&#34;23-监督机器学习步骤&#34;&gt;2.3 监督机器学习步骤&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;结果变量&lt;/strong&gt;  决定如何表示感兴趣的变量：(i) 连续变量或 (ii) 分类变量&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;标注数据集&lt;/strong&gt;  收集带标注信息的样本数据（例如，带标签的单词、句子、段落或文章）。标注平台（例如，Prodigy、Amazon Mechanical Turk、TagEditor、SMART 和 piaf）&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;分为训练集、测试集&lt;/strong&gt;  将带标注的数据集拆分为子样本以进行训练、验证和测试。 确保感兴趣的变量的每个类别都有很好的代表性&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;模型选择&lt;/strong&gt;  如果采用深度学习模型，请决定模型（例如 BERT）以及是否对模型进行微调。 如果使用传统机器学习，请选择特定模型（例如 NB、SVM 或 RF）&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;文本特征工程&lt;/strong&gt; 使用词袋法或者词嵌入&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;评估模型&lt;/strong&gt;  确定评估模型性能的指标。 选项包括准确度、精确度、召回率、F 分数和 ROC-AUC&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;模型拟合&lt;/strong&gt;  使用带注释的数据拟合模型，检查验证数据上的模型性能，并确定是否需要更多带注释的示例。 这是一个迭代的过程&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;确定指标&lt;/strong&gt; 确定最终感兴趣变量的标量（例如，文档中的总单词数）&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
</description>
      <content:encoded><![CDATA[<h2 id="一文本分析在会计领域中的应用">一、文本分析在会计领域中的应用</h2>
<p>Bochkay, Khrystyna, Stephen V. Brown, Andrew J. Leone, and Jennifer Wu Tucker. &ldquo;Textual analysis in accounting: What&rsquo;s next?.&rdquo; <em>Contemporary accounting research</em> 40, no. 2 (2023): 765-805.</p>
<br>
<h3 id="11-摘要">1.1 摘要</h3>
<p>自然语言是商务沟通的重要形式。 文本分析是指用自然语言处理（NLP）技术处理文本数据，从而得到某些感兴趣的测量值(信息)。 我们调查了顶级会计期刊的出版物，并描述了会计文本分析的趋势和现状。 我们将可用的 NLP 方法组织在一个统一的框架中。 会计研究者经常使用文本分析来衡量披露情绪、可读性和披露数量； 比较披露信息以确定相似性或差异；识别前瞻性信息； 并检测主题。 对于每一项任务，我们都解释了传统方法和基于机器学习（尤其是深度学习）的新方法。 我们讨论如何建立基于文本的测量的构造有效性以及研究人员在实施 NLP 模型时面临的典型决策。 最后，我们讨论了未来研究的机会。 我们的结论是：(i) 文本分析已发展成为一种重要的研究方法，(ii) 会计研究人员应该增加对机器学习（尤其是深度学习）的了解和使用，以进行文本分析。</p>
<br>
<h3 id="12-发文趋势">1.2 发文趋势</h3>
<p>会计顶刊文本分析发文量如下图</p>
<blockquote>
<p>The Accounting Review(TAR),</p>
<p>Journal of Accounting Research(JAR),</p>
<p>Journal of Accounting andEconomics(JAE),</p>
<p>Contemporary Accounting Research(CAR),</p>
<p>Review of Accounting Studies(RAST),</p>
<p>Accounting, Organizations,and Society(AOS),</p>
<p>ManagementScience(MS).</p>
</blockquote>
<p><img loading="lazy" src="img/fig-1.png" alt=""  />
</p>
<p><img loading="lazy" src="img/fig-2.png" alt=""  />
</p>
<br>
<h3 id="13-数据源所用指标">1.3 数据源&amp;所用指标</h3>
<p><img loading="lazy" src="img/fig-3.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="二文本分析操作方法">二、文本分析操作方法</h2>
<p>文本分析各方法指南</p>
<ol>
<li>数据获取&amp;预处理步骤</li>
<li>词典选择(构建)步骤</li>
<li>监督机器学习步骤</li>
</ol>
<br>
<h3 id="21-数据获取预处理">2.1 数据获取&amp;预处理</h3>
<ol>
<li><strong>数据获取</strong> 使用人工手动或爬虫从EDGAR、公司网站、社交媒体等数据源采集下载</li>
<li><strong>数据清洗</strong>  剔除HTML中的标签、非文本字符、特殊字符(如&amp; ￥ $ 等)</li>
<li><strong>分词</strong> 将文本转为颗粒度为词语的成分</li>
<li><strong>文档筛选</strong>  字符数太短的文档删除掉</li>
<li><strong>停用词</strong> 剔除文本中的停用词，如(中文如“的他呢了地”，英文如 the、in、a)</li>
<li><strong>合并同类项(stemming&amp;lemmatization)</strong>  文本中出现的increasing, increases, and increased，都整理为increase。</li>
</ol>
<br>
<h3 id="22-词典选择构建步骤">2.2 词典选择(构建)步骤</h3>
<ol>
<li><strong>选择词典</strong> 选择符合研究目的的词典，如做文本的情感分析，可以选择用积极词典和消极词典。</li>
<li><strong>词频统计</strong>  统计词语是否出现，还是统计词语出现次数</li>
<li><strong>词语权重</strong>  确定所有计数是否具有相同的权重，或者某些单词或短语应获得更高或更低的权重（例如，更常见的单词获得更低的权重）。</li>
<li><strong>词典验证</strong> 将字典在识别相关内容方面的表现与人工注释者进行比较。</li>
<li><strong>确定指标</strong> 确定最终感兴趣变量的标量（例如，文档中的总单词数）</li>
</ol>
<br>
<h3 id="23-监督机器学习步骤">2.3 监督机器学习步骤</h3>
<ol>
<li><strong>结果变量</strong>  决定如何表示感兴趣的变量：(i) 连续变量或 (ii) 分类变量</li>
<li><strong>标注数据集</strong>  收集带标注信息的样本数据（例如，带标签的单词、句子、段落或文章）。标注平台（例如，Prodigy、Amazon Mechanical Turk、TagEditor、SMART 和 piaf）</li>
<li><strong>分为训练集、测试集</strong>  将带标注的数据集拆分为子样本以进行训练、验证和测试。 确保感兴趣的变量的每个类别都有很好的代表性</li>
<li><strong>模型选择</strong>  如果采用深度学习模型，请决定模型（例如 BERT）以及是否对模型进行微调。 如果使用传统机器学习，请选择特定模型（例如 NB、SVM 或 RF）</li>
<li><strong>文本特征工程</strong> 使用词袋法或者词嵌入</li>
<li><strong>评估模型</strong>  确定评估模型性能的指标。 选项包括准确度、精确度、召回率、F 分数和 ROC-AUC</li>
<li><strong>模型拟合</strong>  使用带注释的数据拟合模型，检查验证数据上的模型性能，并确定是否需要更多带注释的示例。 这是一个迭代的过程</li>
<li><strong>确定指标</strong> 确定最终感兴趣变量的标量（例如，文档中的总单词数）</li>
</ol>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>Pandas库 | 对高管数据xlsx中的简介字段做文本分析</title>
      <link>https://textdata.cn/blog/2023-08-07-using-str-contains-method-to-judge-some-specific-content-in-excel/</link>
      <pubDate>Mon, 07 Aug 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-08-07-using-str-contains-method-to-judge-some-specific-content-in-excel/</guid>
      <description>&lt;h2 id=&#34;一高管数据集&#34;&gt;一、高管数据集&lt;/h2&gt;
&lt;h3 id=&#34;11-介绍&#34;&gt;1.1 介绍&lt;/h3&gt;
&lt;p&gt;&lt;a href=&#34;https://textdata.cn/blog/2022-11-25-senior-manager-resume-dataset/&#34;&gt;数据集 | 90w条中国上市公司高管数据&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;90w 条中国上市公司高管简历，数据源-新浪财经，统计的日期范围&lt;strong&gt;1990-2021&lt;/strong&gt;年。&lt;/p&gt;
&lt;h3 id=&#34;12-字段&#34;&gt;1.2 字段&lt;/h3&gt;
&lt;p&gt;数据集的字段含，大多是从「个人简历」中计算衍生出来的。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;- ID
- 姓名
- 证券代码
- 统计截止日期
- 个人简历
- 国籍
- 籍贯
- 籍贯所在地区代码
- 出生地
- 出生地所在地区代码
- 性别
- 年龄
- 毕业院校
- 学历  1=中专及中专以下； 2=大专； 3=本科； 4=硕士研究生； 5=博士研究生； 6=其他（以其他形式公布的学历，如荣誉博士、函授等）； 7=MBA/EMBA
- 专业
- 职称
- 是否领取薪酬
- 报告期报酬总额
- 年末持股数
- 是否高管团队成员
- 是否董事会成员
- 是否独立董事
- 是否兼任董事长和CEO
- 是否监事
- 具体职务
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;13-应用价值&#34;&gt;1.3 应用价值&lt;/h3&gt;
&lt;p&gt;这里粘贴部分应用高管数据论文&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;何瑛,于文蕾,戴逸驰,王砚羽.高管职业经历与企业创新[J].管理世界,2019,35(11):174-192.&lt;/li&gt;
&lt;li&gt;杨林,和欣,顾红芳.高管团队经验、动态能力与企业战略突变：管理自主权的调节效应[J].管理世界,2020,36(06):168-188+201+252.&lt;/li&gt;
&lt;li&gt;周楷唐,麻志明,吴联生.高管学术经历与公司债务融资成本[J].经济研究,2017,52(07):169-183.&lt;/li&gt;
&lt;li&gt;陆瑶,张叶青,黎波,赵浩宇.高管个人特征与公司业绩——基于机器学习的经验证据[J].管理科学学报,2020,23(02):120-140.&lt;/li&gt;
&lt;li&gt;柳光强,孔高文.高管经管教育背景与企业内部薪酬差距[J].会计研究,2021,(03):110-121.&lt;/li&gt;
&lt;li&gt;郑建明,孙诗璐,李金甜.高管文化背景与企业债务成本——基于劳模文化的视角[J].会计研究,2021,(03):137-145.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二代码案例&#34;&gt;二、代码案例&lt;/h2&gt;
&lt;p&gt;用Python实现以下五个技术难题，主要对高管简介进行操作&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;读取xlsx文件（90w高管数据）&lt;/li&gt;
&lt;li&gt;简介文本中是否含指定词语(例如找出有【清华大学】求学经历的高管)&lt;/li&gt;
&lt;li&gt;大学高管数量排行榜&lt;/li&gt;
&lt;li&gt;统计文本中指定词语出现次数(例如统计每位高管内【大学】出现次数)&lt;/li&gt;
&lt;li&gt;找出每位高管的出生年份(用正则表达式)&lt;/li&gt;
&lt;li&gt;统计每位高管经历的时间点个数
&amp;hellip;&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;21-导入数据&#34;&gt;2.1 导入数据&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_excel&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;高管数据.xlsx&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#剔除「个人简历」字段中的缺失值&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;dropna&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;subset&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;个人简历&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;inplace&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;head&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;22-简介文本长度&#34;&gt;2.2 简介文本长度&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;个人简历&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#新增一个字段length，将简介文本长度保存到length中&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#df[&amp;#39;length&amp;#39;] = df[&amp;#39;个人简历&amp;#39;].str.len()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;pre&gt;&lt;code&gt;0         161
1         154
2         395
3         306
4         335
         ... 
900882     40
900883     54
900884     71
900885     41
900886     62
Name: 个人简历, Length: 736970, dtype: int64
&lt;/code&gt;&lt;/pre&gt;
&lt;br&gt;
&lt;h3 id=&#34;23-简介文本中是否含指定词语&#34;&gt;2.3 简介文本中是否含指定词语&lt;/h3&gt;
&lt;p&gt;例如找出有【清华大学】求学经历的高管,这里直接使用**Series.str.contains()**方法来直接搜某字段(Series)是否含某个词&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;len(df[df[&#39;个人简历&#39;].str.contains(&#39;清华大学&#39;)])&lt;/code&gt; 保留有「清华大学」学习经历的高管&lt;/li&gt;
&lt;li&gt;&lt;code&gt;len(df[df[&#39;个人简历&#39;].str.contains(&#39;北京大学&#39;)])&lt;/code&gt; 保留有「北京大学」学习经历的高管&lt;/li&gt;
&lt;li&gt;&lt;code&gt;len(df[df[&#39;个人简历&#39;].str.contains(&#39;清华大学|北京大学&#39;)])&lt;/code&gt; 保留有「清华大学」或「北京大学」学习经历的高管&lt;/li&gt;
&lt;li&gt;&lt;code&gt;len(df[df[&#39;个人简历&#39;].str.contains(&#39;清华大学&#39;) &amp;amp; df[&#39;个人简历&#39;].str.contains(&#39;北京大学&#39;)])&lt;/code&gt; 保留同时有「清华大学」和「北京大学」学习经历的高管&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;第三个(北大清华)表达式的数量应该是最多的(前两者之和)， 第四个表达式是最少。 注意, 逻辑【或|】【且&amp;amp;】可以有任意多个&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;#统计有【清华大学】学习经历的高管人数&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;个人简历&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;contains&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;清华大学&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;10377
&lt;/code&gt;&lt;/pre&gt;
&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;个人简历&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;contains&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;北京大学&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;8709
&lt;/code&gt;&lt;/pre&gt;
&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;个人简历&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;contains&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;清华大学|北京大学&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;18647
&lt;/code&gt;&lt;/pre&gt;
&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;个人简历&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;contains&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;清华大学&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;个人简历&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;contains&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;北京大学&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;439
&lt;/code&gt;&lt;/pre&gt;
&lt;br&gt;
&lt;h3 id=&#34;24-大学高管数量排行榜&#34;&gt;2.4 大学高管数量排行榜&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;#测试列表(凭记忆手动输入的大学，各位可以自己设计测试列表)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;test_universitys&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;清华大学&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;北京大学&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;中国人民大学&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;浙江大学&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
                    &lt;span class=&#34;s1&#34;&gt;&amp;#39;上海交通大学&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;西安交通大学&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;同济大学&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;南开大学&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;天津大学&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
                    &lt;span class=&#34;s1&#34;&gt;&amp;#39;武汉大学&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;华中科技大学&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;中国科学技术大学&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;南京大学&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
                    &lt;span class=&#34;s1&#34;&gt;&amp;#39;中山大学&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;中南大学&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;四川大学&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;重庆大学&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;兰州大学&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;湖南大学&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
                    &lt;span class=&#34;s1&#34;&gt;&amp;#39;山东大学&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;吉林大学&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;大连理工大大学&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;东北大学&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;北京航空航天大学&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;中国地质大学&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;


&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;大学高管人数排行&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;uni_infos&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[]&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;university&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;test_universitys&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;num&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;个人简历&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;contains&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;university&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)])&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;uni_infos&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;append&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;((&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;university&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;num&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
    
&lt;span class=&#34;n&#34;&gt;uni_infos&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;sorted&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;uni_infos&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;key&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;k&#34;&gt;lambda&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;k&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;k&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;reverse&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;uni_infos&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;大学高管人数排行

[(&#39;清华大学&#39;, 10377),
 (&#39;北京大学&#39;, 8709),
 (&#39;中国人民大学&#39;, 7012),
 (&#39;浙江大学&#39;, 5816),
 (&#39;中山大学&#39;, 4065),
 (&#39;上海交通大学&#39;, 3844),
 (&#39;武汉大学&#39;, 3578),
 (&#39;南京大学&#39;, 3272),
 (&#39;西安交通大学&#39;, 2972),
 (&#39;南开大学&#39;, 2716),
 (&#39;湖南大学&#39;, 2502),
 (&#39;华中科技大学&#39;, 2356),
 (&#39;同济大学&#39;, 2089),
 (&#39;吉林大学&#39;, 2044),
 (&#39;四川大学&#39;, 1934),
 (&#39;山东大学&#39;, 1847),
 (&#39;中南大学&#39;, 1615),
 (&#39;天津大学&#39;, 1598),
 (&#39;重庆大学&#39;, 1440),
 (&#39;北京航空航天大学&#39;, 1334),
 (&#39;东北大学&#39;, 1241),
 (&#39;中国科学技术大学&#39;, 842),
 (&#39;兰州大学&#39;, 745),
 (&#39;中国地质大学&#39;, 437),
 (&#39;大连理工大大学&#39;, 0)]
&lt;/code&gt;&lt;/pre&gt;
&lt;br&gt;
&lt;h3 id=&#34;25-统计文本中指定词语出现次数&#34;&gt;2.5 统计文本中指定词语出现次数&lt;/h3&gt;
&lt;p&gt;例如统计每位高管内【大学】出现次数&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;个人简历&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;count&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;大学&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;0         0
1         2
2         0
3         0
4         0
         ..
900882    0
900883    0
900884    0
900885    0
900886    0
Name: 个人简历, Length: 736970, dtype: int64
&lt;/code&gt;&lt;/pre&gt;
&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;高管总人数: &amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#简历中无「大学」字眼&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;无大学经历高管人数:&amp;#39;&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;个人简历&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;count&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;大学&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;==&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]))&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#简历中有「大学」字眼&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;有大学经历高管人数:&amp;#39;&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;个人简历&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;count&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;大学&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;高管总人数:  736970
无大学经历高管人数: 515172
有大学经历高管人数: 221798
&lt;/code&gt;&lt;/pre&gt;
&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;#有些企业单位名字中带有「大学」，但这类企业非常少。&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#「大学」词语出现次数可以近似看做学习经历次数&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#如此， 1可以看做本科学历，2看做研究生学历， 3看做博士学历&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;个人简历&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;count&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;大学&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;value_counts&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;normalize&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;plot&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;kind&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;bar&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;output_14_1.png&#34; alt=&#34;png&#34;  /&gt;

​&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;26-找出每位高管的出生年份用正则表达式&#34;&gt;2.6 找出每位高管的出生年份(用正则表达式)&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;个人简历&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;findall&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;\d&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{4}&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;0                                                    [1969]
1                      [1965, 1984, 1986, 1990, 1994, 1995]
2         [1972, 1998, 1999, 2000, 2015, 2002, 2016, 200...
3         [1960, 1982, 1989, 1990, 1991, 1991, 2002, 200...
4         [1962, 2009, 1985, 1996, 1996, 2008, 1993, 200...
                                ...                        
900882                                                   []
900883                                                   []
900884                                                   []
900885                                                   []
900886                                                   []
Name: 个人简历, Length: 736970, dtype: object
&lt;/code&gt;&lt;/pre&gt;
&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;k&#34;&gt;def&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;birth_year&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;years&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
    &lt;span class=&#34;k&#34;&gt;try&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
        &lt;span class=&#34;c1&#34;&gt;#返回出生年份&lt;/span&gt;
        &lt;span class=&#34;k&#34;&gt;return&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;years&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
    &lt;span class=&#34;k&#34;&gt;except&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
        &lt;span class=&#34;c1&#34;&gt;#没有年份的，返回0&lt;/span&gt;
        &lt;span class=&#34;k&#34;&gt;return&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;0&lt;/span&gt;
    
    
&lt;span class=&#34;c1&#34;&gt;#高管出生年份&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;个人简历&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;findall&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;\d&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{4}&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;apply&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;birth_year&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;0         1969
1         1965
2         1972
3         1960
4         1962
          ... 
900882       0
900883       0
900884       0
900885       0
900886       0
Name: 个人简历, Length: 736970, dtype: object
&lt;/code&gt;&lt;/pre&gt;
&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;#高管时间点个数(感觉可以看做经历的个数)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;个人简历&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;findall&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;\d&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{4}&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;apply&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;k&#34;&gt;lambda&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ys&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;set&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ys&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;0          1
1          6
2         10
3         10
4          8
          ..
900882     0
900883     0
900884     0
900885     0
900886     0
Name: 个人简历, Length: 736970, dtype: int64
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;获取代码&#34;&gt;获取代码&lt;/h2&gt;
&lt;p&gt;&lt;a href=&#34;top-ceo.ipynb&#34;&gt;&lt;strong&gt;点击下载ipynb代码文件&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
</description>
      <content:encoded><![CDATA[<h2 id="一高管数据集">一、高管数据集</h2>
<h3 id="11-介绍">1.1 介绍</h3>
<p><a href="https://textdata.cn/blog/2022-11-25-senior-manager-resume-dataset/">数据集 | 90w条中国上市公司高管数据</a></p>
<p>90w 条中国上市公司高管简历，数据源-新浪财经，统计的日期范围<strong>1990-2021</strong>年。</p>
<h3 id="12-字段">1.2 字段</h3>
<p>数据集的字段含，大多是从「个人简历」中计算衍生出来的。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">- ID
- 姓名
- 证券代码
- 统计截止日期
- 个人简历
- 国籍
- 籍贯
- 籍贯所在地区代码
- 出生地
- 出生地所在地区代码
- 性别
- 年龄
- 毕业院校
- 学历  1=中专及中专以下； 2=大专； 3=本科； 4=硕士研究生； 5=博士研究生； 6=其他（以其他形式公布的学历，如荣誉博士、函授等）； 7=MBA/EMBA
- 专业
- 职称
- 是否领取薪酬
- 报告期报酬总额
- 年末持股数
- 是否高管团队成员
- 是否董事会成员
- 是否独立董事
- 是否兼任董事长和CEO
- 是否监事
- 具体职务
</code></pre></div><br>
<h3 id="13-应用价值">1.3 应用价值</h3>
<p>这里粘贴部分应用高管数据论文</p>
<ul>
<li>何瑛,于文蕾,戴逸驰,王砚羽.高管职业经历与企业创新[J].管理世界,2019,35(11):174-192.</li>
<li>杨林,和欣,顾红芳.高管团队经验、动态能力与企业战略突变：管理自主权的调节效应[J].管理世界,2020,36(06):168-188+201+252.</li>
<li>周楷唐,麻志明,吴联生.高管学术经历与公司债务融资成本[J].经济研究,2017,52(07):169-183.</li>
<li>陆瑶,张叶青,黎波,赵浩宇.高管个人特征与公司业绩——基于机器学习的经验证据[J].管理科学学报,2020,23(02):120-140.</li>
<li>柳光强,孔高文.高管经管教育背景与企业内部薪酬差距[J].会计研究,2021,(03):110-121.</li>
<li>郑建明,孙诗璐,李金甜.高管文化背景与企业债务成本——基于劳模文化的视角[J].会计研究,2021,(03):137-145.</li>
</ul>
<p><br><br></p>
<h2 id="二代码案例">二、代码案例</h2>
<p>用Python实现以下五个技术难题，主要对高管简介进行操作</p>
<ol>
<li>读取xlsx文件（90w高管数据）</li>
<li>简介文本中是否含指定词语(例如找出有【清华大学】求学经历的高管)</li>
<li>大学高管数量排行榜</li>
<li>统计文本中指定词语出现次数(例如统计每位高管内【大学】出现次数)</li>
<li>找出每位高管的出生年份(用正则表达式)</li>
<li>统计每位高管经历的时间点个数
&hellip;</li>
</ol>
<h3 id="21-导入数据">2.1 导入数据</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_excel</span><span class="p">(</span><span class="s1">&#39;高管数据.xlsx&#39;</span><span class="p">)</span>
<span class="c1">#剔除「个人简历」字段中的缺失值</span>
<span class="n">df</span><span class="o">.</span><span class="n">dropna</span><span class="p">(</span><span class="n">subset</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;个人简历&#39;</span><span class="p">],</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><br>
<h3 id="22-简介文本长度">2.2 简介文本长度</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s1">&#39;个人简历&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">len</span><span class="p">()</span>

<span class="c1">#新增一个字段length，将简介文本长度保存到length中</span>
<span class="c1">#df[&#39;length&#39;] = df[&#39;个人简历&#39;].str.len()</span>
</code></pre></div><pre><code>0         161
1         154
2         395
3         306
4         335
         ... 
900882     40
900883     54
900884     71
900885     41
900886     62
Name: 个人简历, Length: 736970, dtype: int64
</code></pre>
<br>
<h3 id="23-简介文本中是否含指定词语">2.3 简介文本中是否含指定词语</h3>
<p>例如找出有【清华大学】求学经历的高管,这里直接使用**Series.str.contains()**方法来直接搜某字段(Series)是否含某个词</p>
<ul>
<li><code>len(df[df['个人简历'].str.contains('清华大学')])</code> 保留有「清华大学」学习经历的高管</li>
<li><code>len(df[df['个人简历'].str.contains('北京大学')])</code> 保留有「北京大学」学习经历的高管</li>
<li><code>len(df[df['个人简历'].str.contains('清华大学|北京大学')])</code> 保留有「清华大学」或「北京大学」学习经历的高管</li>
<li><code>len(df[df['个人简历'].str.contains('清华大学') &amp; df['个人简历'].str.contains('北京大学')])</code> 保留同时有「清华大学」和「北京大学」学习经历的高管</li>
</ul>
<p>第三个(北大清华)表达式的数量应该是最多的(前两者之和)， 第四个表达式是最少。 注意, 逻辑【或|】【且&amp;】可以有任意多个</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#统计有【清华大学】学习经历的高管人数</span>
<span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;个人简历&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">&#39;清华大学&#39;</span><span class="p">)])</span>
</code></pre></div><p>Run</p>
<pre><code>10377
</code></pre>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;个人简历&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">&#39;北京大学&#39;</span><span class="p">)])</span>
</code></pre></div><p>Run</p>
<pre><code>8709
</code></pre>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;个人简历&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">&#39;清华大学|北京大学&#39;</span><span class="p">)])</span>
</code></pre></div><p>Run</p>
<pre><code>18647
</code></pre>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;个人简历&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">&#39;清华大学&#39;</span><span class="p">)</span> <span class="o">&amp;</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;个人简历&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">&#39;北京大学&#39;</span><span class="p">)])</span>
</code></pre></div><p>Run</p>
<pre><code>439
</code></pre>
<br>
<h3 id="24-大学高管数量排行榜">2.4 大学高管数量排行榜</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#测试列表(凭记忆手动输入的大学，各位可以自己设计测试列表)</span>
<span class="n">test_universitys</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;清华大学&#39;</span><span class="p">,</span> <span class="s1">&#39;北京大学&#39;</span><span class="p">,</span> <span class="s1">&#39;中国人民大学&#39;</span><span class="p">,</span> <span class="s1">&#39;浙江大学&#39;</span><span class="p">,</span> 
                    <span class="s1">&#39;上海交通大学&#39;</span><span class="p">,</span> <span class="s1">&#39;西安交通大学&#39;</span><span class="p">,</span> <span class="s1">&#39;同济大学&#39;</span><span class="p">,</span> <span class="s1">&#39;南开大学&#39;</span><span class="p">,</span> <span class="s1">&#39;天津大学&#39;</span><span class="p">,</span> 
                    <span class="s1">&#39;武汉大学&#39;</span><span class="p">,</span> <span class="s1">&#39;华中科技大学&#39;</span><span class="p">,</span> <span class="s1">&#39;中国科学技术大学&#39;</span><span class="p">,</span> <span class="s1">&#39;南京大学&#39;</span><span class="p">,</span>
                    <span class="s1">&#39;中山大学&#39;</span><span class="p">,</span> <span class="s1">&#39;中南大学&#39;</span><span class="p">,</span> <span class="s1">&#39;四川大学&#39;</span><span class="p">,</span> <span class="s1">&#39;重庆大学&#39;</span><span class="p">,</span> <span class="s1">&#39;兰州大学&#39;</span><span class="p">,</span> <span class="s1">&#39;湖南大学&#39;</span><span class="p">,</span> 
                    <span class="s1">&#39;山东大学&#39;</span><span class="p">,</span> <span class="s1">&#39;吉林大学&#39;</span><span class="p">,</span> <span class="s1">&#39;大连理工大大学&#39;</span><span class="p">,</span> <span class="s1">&#39;东北大学&#39;</span><span class="p">,</span> <span class="s1">&#39;北京航空航天大学&#39;</span><span class="p">,</span> <span class="s1">&#39;中国地质大学&#39;</span><span class="p">]</span>


<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;大学高管人数排行&#39;</span><span class="p">)</span>

<span class="n">uni_infos</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">university</span> <span class="ow">in</span> <span class="n">test_universitys</span><span class="p">:</span>
    <span class="n">num</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;个人简历&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="n">university</span><span class="p">)])</span>
    <span class="n">uni_infos</span><span class="o">.</span><span class="n">append</span><span class="p">((</span><span class="n">university</span><span class="p">,</span> <span class="n">num</span><span class="p">))</span>
    
<span class="n">uni_infos</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">uni_infos</span><span class="p">,</span> <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">k</span><span class="p">:</span><span class="n">k</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">reverse</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">uni_infos</span>
</code></pre></div><p>Run</p>
<pre><code>大学高管人数排行

[('清华大学', 10377),
 ('北京大学', 8709),
 ('中国人民大学', 7012),
 ('浙江大学', 5816),
 ('中山大学', 4065),
 ('上海交通大学', 3844),
 ('武汉大学', 3578),
 ('南京大学', 3272),
 ('西安交通大学', 2972),
 ('南开大学', 2716),
 ('湖南大学', 2502),
 ('华中科技大学', 2356),
 ('同济大学', 2089),
 ('吉林大学', 2044),
 ('四川大学', 1934),
 ('山东大学', 1847),
 ('中南大学', 1615),
 ('天津大学', 1598),
 ('重庆大学', 1440),
 ('北京航空航天大学', 1334),
 ('东北大学', 1241),
 ('中国科学技术大学', 842),
 ('兰州大学', 745),
 ('中国地质大学', 437),
 ('大连理工大大学', 0)]
</code></pre>
<br>
<h3 id="25-统计文本中指定词语出现次数">2.5 统计文本中指定词语出现次数</h3>
<p>例如统计每位高管内【大学】出现次数</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s1">&#39;个人简历&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">count</span><span class="p">(</span><span class="s1">&#39;大学&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<pre><code>0         0
1         2
2         0
3         0
4         0
         ..
900882    0
900883    0
900884    0
900885    0
900886    0
Name: 个人简历, Length: 736970, dtype: int64
</code></pre>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="nb">print</span><span class="p">(</span><span class="s1">&#39;高管总人数: &#39;</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">))</span>
<span class="c1">#简历中无「大学」字眼</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;无大学经历高管人数:&#39;</span> <span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;个人简历&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">count</span><span class="p">(</span><span class="s1">&#39;大学&#39;</span><span class="p">)</span><span class="o">==</span><span class="mi">0</span><span class="p">]))</span>
<span class="c1">#简历中有「大学」字眼</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;有大学经历高管人数:&#39;</span> <span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;个人简历&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">count</span><span class="p">(</span><span class="s1">&#39;大学&#39;</span><span class="p">)</span><span class="o">&gt;</span><span class="mi">0</span><span class="p">]))</span>
</code></pre></div><p>Run</p>
<pre><code>高管总人数:  736970
无大学经历高管人数: 515172
有大学经历高管人数: 221798
</code></pre>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#有些企业单位名字中带有「大学」，但这类企业非常少。</span>
<span class="c1">#「大学」词语出现次数可以近似看做学习经历次数</span>
<span class="c1">#如此， 1可以看做本科学历，2看做研究生学历， 3看做博士学历</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;个人简历&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">count</span><span class="p">(</span><span class="s1">&#39;大学&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">value_counts</span><span class="p">(</span><span class="n">normalize</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">kind</span><span class="o">=</span><span class="s1">&#39;bar&#39;</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="output_14_1.png" alt="png"  />

​</p>
<br>
<h3 id="26-找出每位高管的出生年份用正则表达式">2.6 找出每位高管的出生年份(用正则表达式)</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s1">&#39;个人简历&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">findall</span><span class="p">(</span><span class="s1">&#39;\d</span><span class="si">{4}</span><span class="s1">&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<pre><code>0                                                    [1969]
1                      [1965, 1984, 1986, 1990, 1994, 1995]
2         [1972, 1998, 1999, 2000, 2015, 2002, 2016, 200...
3         [1960, 1982, 1989, 1990, 1991, 1991, 2002, 200...
4         [1962, 2009, 1985, 1996, 1996, 2008, 1993, 200...
                                ...                        
900882                                                   []
900883                                                   []
900884                                                   []
900885                                                   []
900886                                                   []
Name: 个人简历, Length: 736970, dtype: object
</code></pre>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">birth_year</span><span class="p">(</span><span class="n">years</span><span class="p">):</span>
    <span class="k">try</span><span class="p">:</span>
        <span class="c1">#返回出生年份</span>
        <span class="k">return</span> <span class="n">years</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
    <span class="k">except</span><span class="p">:</span>
        <span class="c1">#没有年份的，返回0</span>
        <span class="k">return</span> <span class="mi">0</span>
    
    
<span class="c1">#高管出生年份</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;个人简历&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">findall</span><span class="p">(</span><span class="s1">&#39;\d</span><span class="si">{4}</span><span class="s1">&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">birth_year</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<pre><code>0         1969
1         1965
2         1972
3         1960
4         1962
          ... 
900882       0
900883       0
900884       0
900885       0
900886       0
Name: 个人简历, Length: 736970, dtype: object
</code></pre>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#高管时间点个数(感觉可以看做经历的个数)</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;个人简历&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">findall</span><span class="p">(</span><span class="s1">&#39;\d</span><span class="si">{4}</span><span class="s1">&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">ys</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="nb">set</span><span class="p">(</span><span class="n">ys</span><span class="p">)))</span>
</code></pre></div><p>Run</p>
<pre><code>0          1
1          6
2         10
3         10
4          8
          ..
900882     0
900883     0
900884     0
900885     0
900886     0
Name: 个人简历, Length: 736970, dtype: int64
</code></pre>
<p><br><br></p>
<h2 id="获取代码">获取代码</h2>
<p><a href="top-ceo.ipynb"><strong>点击下载ipynb代码文件</strong></a></p>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>免费下载 | 进阶Python学习资料</title>
      <link>https://textdata.cn/blog/2023-07-19-advanced-python-mastery/</link>
      <pubDate>Wed, 19 Jul 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-07-19-advanced-python-mastery/</guid>
      <description>&lt;h2 id=&#34;一advanced-python-mastery&#34;&gt;一、Advanced Python Mastery&lt;/h2&gt;
&lt;h3 id=&#34;11-概要&#34;&gt;1.1 概要&lt;/h3&gt;
&lt;p&gt;这是一门&lt;strong&gt;以练习为导向的高级 Python 编程课程&lt;/strong&gt;，十多年来在企业培训循环中经过了数百次实战测试。 由 David Beazley 撰写，Python Cookbook 第三版 (O&amp;rsquo;Reilly) 和 Python Distilled (Addison-Wesley) 的作者。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;12-目标受众&#34;&gt;1.2 目标受众&lt;/h3&gt;
&lt;p&gt;本课程适合那些想要超越简短脚本而编写更复杂程序的 Python 程序员。 主题重点关注流行库和框架中使用的编程技术。 主要目标是更好地理解 Python 语言本身，以便您能够理解其他人的代码，并将新发现的知识应用到您自己的项目中。&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;13-如何参加课程&#34;&gt;1.3 如何参加课程&lt;/h3&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/resource.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;要学习本课程，您应该首先将 GitHub 存储库分叉/克隆到您自己的计算机上。&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;假设您在适当的 Python 开发环境中本地工作。 这意味着正确安装 Python、编辑器/IDE 以及您通常安装以在 Python 上工作的任何其他工具。 由于使用多个文件和模块导入，不建议使用Notebooks。&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;PythonMastery.pdf&lt;/strong&gt; 文件包含详细的演示幻灯片(共548页)。 课程练习和建议的时间安排都有明确的说明。 您需要将其保留在身边（我建议您下载并使用本地 PDF 查看器进行查看）。 从这里开始！&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Excercise/&lt;/strong&gt; 包含所有课程练习。&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Solutions/&lt;/strong&gt;  已完全制定出解决方案代码。&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Data/&lt;/strong&gt; 包含课程中使用的一些数据文件。&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;成功完成该课程可能需要 &lt;strong&gt;30-50 小时&lt;/strong&gt;的工作。&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二获取学习资料&#34;&gt;二、获取学习资料&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Github下载  &lt;a href=&#34;https://github.com/dabeaz-course/python-mastery.git&#34;&gt;https://github.com/dabeaz-course/python-mastery.git&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;百度网盘 链接: &lt;a href=&#34;https://pan.baidu.com/s/1bwWM33rM37a2Uq0lUbSqnA&#34;&gt;https://pan.baidu.com/s/1bwWM33rM37a2Uq0lUbSqnA&lt;/a&gt; 提取码: 7z87&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
</description>
      <content:encoded><![CDATA[<h2 id="一advanced-python-mastery">一、Advanced Python Mastery</h2>
<h3 id="11-概要">1.1 概要</h3>
<p>这是一门<strong>以练习为导向的高级 Python 编程课程</strong>，十多年来在企业培训循环中经过了数百次实战测试。 由 David Beazley 撰写，Python Cookbook 第三版 (O&rsquo;Reilly) 和 Python Distilled (Addison-Wesley) 的作者。</p>
<br>
<h3 id="12-目标受众">1.2 目标受众</h3>
<p>本课程适合那些想要超越简短脚本而编写更复杂程序的 Python 程序员。 主题重点关注流行库和框架中使用的编程技术。 主要目标是更好地理解 Python 语言本身，以便您能够理解其他人的代码，并将新发现的知识应用到您自己的项目中。</p>
<br>
<h3 id="13-如何参加课程">1.3 如何参加课程</h3>
<p><img loading="lazy" src="img/resource.png" alt=""  />
</p>
<ul>
<li>
<p>要学习本课程，您应该首先将 GitHub 存储库分叉/克隆到您自己的计算机上。</p>
</li>
<li>
<p>假设您在适当的 Python 开发环境中本地工作。 这意味着正确安装 Python、编辑器/IDE 以及您通常安装以在 Python 上工作的任何其他工具。 由于使用多个文件和模块导入，不建议使用Notebooks。</p>
</li>
<li>
<p><strong>PythonMastery.pdf</strong> 文件包含详细的演示幻灯片(共548页)。 课程练习和建议的时间安排都有明确的说明。 您需要将其保留在身边（我建议您下载并使用本地 PDF 查看器进行查看）。 从这里开始！</p>
</li>
<li>
<p><strong>Excercise/</strong> 包含所有课程练习。</p>
</li>
<li>
<p><strong>Solutions/</strong>  已完全制定出解决方案代码。</p>
</li>
<li>
<p><strong>Data/</strong> 包含课程中使用的一些数据文件。</p>
</li>
<li>
<p>成功完成该课程可能需要 <strong>30-50 小时</strong>的工作。</p>
</li>
</ul>
<p><br><br></p>
<h2 id="二获取学习资料">二、获取学习资料</h2>
<ul>
<li>Github下载  <a href="https://github.com/dabeaz-course/python-mastery.git">https://github.com/dabeaz-course/python-mastery.git</a></li>
<li>百度网盘 链接: <a href="https://pan.baidu.com/s/1bwWM33rM37a2Uq0lUbSqnA">https://pan.baidu.com/s/1bwWM33rM37a2Uq0lUbSqnA</a> 提取码: 7z87</li>
</ul>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>2个免费可用的chatGPT替代产品</title>
      <link>https://textdata.cn/blog/2023-07-12-free-alternative-solution-for-chatgpt-product/</link>
      <pubDate>Wed, 12 Jul 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-07-12-free-alternative-solution-for-chatgpt-product/</guid>
      <description>&lt;p&gt;最近半年，由于注册繁琐，且网络不稳定，国内不能很好的利用chatGPT辅助我们进行生产力的提升。现在(未来）分享几个替代方案&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;百度&lt;strong&gt;文心一言&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;科大讯飞&lt;strong&gt;讯飞星火&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;阿里&lt;strong&gt;通义千问&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;new Bing&lt;/strong&gt; 底层用的也是chatGPT&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Claude&lt;/strong&gt;  chatGPT的竞品&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;strong&gt;国产的这几个，大家需要申请和等待， 如果着急，可以直接跳过看new bing和Claude&lt;/strong&gt;。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;1-文心一言&#34;&gt;1. 文心一言&lt;/h2&gt;
&lt;p&gt;百度是国内目前功能最强大的生成式AI供应商，但我至今没得到体验的机会。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/yiyan.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;2-讯飞星火&#34;&gt;2. 讯飞星火&lt;/h2&gt;
&lt;p&gt;科大讯飞是在自然语言处理领域有着不错的技术积累， 在语音类场景，有着中文最大的语音数据积累。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/xinghuo.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;经过测试发现， 任意的回答都有对应的语音服务， 听起来非常舒服，人声优美。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/xinghuo2.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;3-通义千问&#34;&gt;3. 通义千问&lt;/h2&gt;
&lt;p&gt;阿里旗下产品， 大厂实力有保证，可惜依然需要等待，但我至今没得到体验的机会。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/qianwen.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;4-new-bing&#34;&gt;4. New Bing&lt;/h2&gt;
&lt;p&gt;安装Microsoft Edge， 点击打开 &lt;a href=&#34;https://www.bing.com/new&#34;&gt;新必应 - 了解详细信息 (bing.com)&lt;/a&gt;， 进入页面。&lt;/p&gt;
&lt;p&gt;new bing底层用的也是chatGPT， 结合搜索结果， 这是目前new bing的一大优点。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/bing-1.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/bing-2.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/bing-3.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;5-claude&#34;&gt;5. Claude&lt;/h2&gt;
&lt;p&gt;Claude注册简单，使用方便， 简单的问答实验效果不错。支持文件上传，最大可上传5个文件 ，每个文件10M。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/Claude-home.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/Claude.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
</description>
      <content:encoded><![CDATA[<p>最近半年，由于注册繁琐，且网络不稳定，国内不能很好的利用chatGPT辅助我们进行生产力的提升。现在(未来）分享几个替代方案</p>
<ol>
<li>百度<strong>文心一言</strong></li>
<li>科大讯飞<strong>讯飞星火</strong></li>
<li>阿里<strong>通义千问</strong></li>
<li><strong>new Bing</strong> 底层用的也是chatGPT</li>
<li><strong>Claude</strong>  chatGPT的竞品</li>
</ol>
<p><strong>国产的这几个，大家需要申请和等待， 如果着急，可以直接跳过看new bing和Claude</strong>。</p>
<p><br><br></p>
<h2 id="1-文心一言">1. 文心一言</h2>
<p>百度是国内目前功能最强大的生成式AI供应商，但我至今没得到体验的机会。</p>
<p><img loading="lazy" src="img/yiyan.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="2-讯飞星火">2. 讯飞星火</h2>
<p>科大讯飞是在自然语言处理领域有着不错的技术积累， 在语音类场景，有着中文最大的语音数据积累。</p>
<p><img loading="lazy" src="img/xinghuo.png" alt=""  />
</p>
<p>经过测试发现， 任意的回答都有对应的语音服务， 听起来非常舒服，人声优美。</p>
<p><img loading="lazy" src="img/xinghuo2.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="3-通义千问">3. 通义千问</h2>
<p>阿里旗下产品， 大厂实力有保证，可惜依然需要等待，但我至今没得到体验的机会。</p>
<p><img loading="lazy" src="img/qianwen.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="4-new-bing">4. New Bing</h2>
<p>安装Microsoft Edge， 点击打开 <a href="https://www.bing.com/new">新必应 - 了解详细信息 (bing.com)</a>， 进入页面。</p>
<p>new bing底层用的也是chatGPT， 结合搜索结果， 这是目前new bing的一大优点。</p>
<p><img loading="lazy" src="img/bing-1.png" alt=""  />
</p>
<p><img loading="lazy" src="img/bing-2.png" alt=""  />
</p>
<p><img loading="lazy" src="img/bing-3.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="5-claude">5. Claude</h2>
<p>Claude注册简单，使用方便， 简单的问答实验效果不错。支持文件上传，最大可上传5个文件 ，每个文件10M。</p>
<p><img loading="lazy" src="img/Claude-home.png" alt=""  />
</p>
<p><img loading="lazy" src="img/Claude.png" alt=""  />
</p>
<br>
<br>
]]></content:encoded>
    </item>
    
    <item>
      <title>mercury | 在jupyter notebook中创建Web应用程序</title>
      <link>https://textdata.cn/blog/2023-06-12-mercury-fast-webapp/</link>
      <pubDate>Mon, 12 Jun 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-06-12-mercury-fast-webapp/</guid>
      <description>&lt;h2 id=&#34;一介绍&#34;&gt;一、介绍&lt;/h2&gt;
&lt;p&gt;Mercury允许您在Python笔记本中添加交互式小部件，因此您可以将笔记本共享为Web应用程序。&lt;/p&gt;
&lt;h3 id=&#34;11-功能&#34;&gt;1.1 功能&lt;/h3&gt;
&lt;p&gt;Mercury提供了一套带有简单单元格重新执行的小部件,您可以使用Mercury构建以下内容：&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;将您的笔记本转化为漂亮的Web应用程序，&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;创建带有小部件的交互式演示文稿，您可以在展示过程中重新计算幻灯片，&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;将笔记本作为静态网站进行共享，&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;使用小部件构建数据丰富的仪表板，&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;创建具有PDF导出、自动调度和电子邮件通知功能的报告（即将推出），&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;将Python笔记本作为REST API端点提供服务（即将推出）。&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;br&gt;
&lt;h3 id=&#34;12-特点&#34;&gt;1.2 特点&lt;/h3&gt;
&lt;p&gt;Mercury的特点包括：&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;使用Python代码添加小部件-无需前端经验！&lt;/li&gt;
&lt;li&gt;隐藏或显示笔记本的代码，&lt;/li&gt;
&lt;li&gt;将已执行的笔记本导出为PDF或HTML，&lt;/li&gt;
&lt;li&gt;共享多个笔记本-没有限制！&lt;/li&gt;
&lt;li&gt;将笔记本嵌入到任何网站中，&lt;/li&gt;
&lt;li&gt;轻松在笔记本中上传和下载文件，&lt;/li&gt;
&lt;li&gt;为笔记本添加身份验证（即将推出），&lt;/li&gt;
&lt;li&gt;计划自动笔记本执行（即将推出）。&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h3 id=&#34;13-安装&#34;&gt;1.3 安装&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;pip&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;install&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;mercury&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;14-运行&#34;&gt;1.4 运行&lt;/h3&gt;
&lt;p&gt;命令行执行&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;mercury&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;run&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;请访问 &lt;strong&gt;127.0.0.1:8000&lt;/strong&gt; 查看演示笔记本。&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二示例&#34;&gt;二、示例&lt;/h2&gt;
&lt;p&gt;下面是一个简单的代码示例，创建一个小部件并显示其值。您可以在Jupyter Notebook中与小部件进行交互。小部件的值将会被更新。但是，要在其他单元格中看到更新，您需要手动执行它们。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;mercury&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;mr&lt;/span&gt; 

&lt;span class=&#34;c1&#34;&gt;#创建一个文本小部件：&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;name&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;mr&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;value&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;Piotr&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;label&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;What is your name?&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; 



&lt;span class=&#34;c1&#34;&gt;# 打印小部件的值：&lt;/span&gt;

&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;sa&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;Hello &lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;name&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;value&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Jupyter Notebook中的代码截图&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/hello-world-notebook-ola.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;三mercury-应用程序&#34;&gt;三、Mercury 应用程序&lt;/h2&gt;
&lt;p&gt;使用 Mercury 将笔记本作为 Web 应用程序运行。小部件更改后，单元格会自动重新执行。Mercury 仅重新执行具有小部件定义及其下方的单元格。在示例中，小部件更新后，单元格 2 和 3 会被重新执行。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/hello-world-app-ola.gif&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
</description>
      <content:encoded><![CDATA[<h2 id="一介绍">一、介绍</h2>
<p>Mercury允许您在Python笔记本中添加交互式小部件，因此您可以将笔记本共享为Web应用程序。</p>
<h3 id="11-功能">1.1 功能</h3>
<p>Mercury提供了一套带有简单单元格重新执行的小部件,您可以使用Mercury构建以下内容：</p>
<ul>
<li>
<p>将您的笔记本转化为漂亮的Web应用程序，</p>
</li>
<li>
<p>创建带有小部件的交互式演示文稿，您可以在展示过程中重新计算幻灯片，</p>
</li>
<li>
<p>将笔记本作为静态网站进行共享，</p>
</li>
<li>
<p>使用小部件构建数据丰富的仪表板，</p>
</li>
<li>
<p>创建具有PDF导出、自动调度和电子邮件通知功能的报告（即将推出），</p>
</li>
<li>
<p>将Python笔记本作为REST API端点提供服务（即将推出）。</p>
</li>
</ul>
<br>
<h3 id="12-特点">1.2 特点</h3>
<p>Mercury的特点包括：</p>
<ul>
<li>使用Python代码添加小部件-无需前端经验！</li>
<li>隐藏或显示笔记本的代码，</li>
<li>将已执行的笔记本导出为PDF或HTML，</li>
<li>共享多个笔记本-没有限制！</li>
<li>将笔记本嵌入到任何网站中，</li>
<li>轻松在笔记本中上传和下载文件，</li>
<li>为笔记本添加身份验证（即将推出），</li>
<li>计划自动笔记本执行（即将推出）。</li>
</ul>
<p><br><br></p>
<h3 id="13-安装">1.3 安装</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">pip</span> <span class="n">install</span> <span class="n">mercury</span>
</code></pre></div><br>
<h3 id="14-运行">1.4 运行</h3>
<p>命令行执行</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">mercury</span> <span class="n">run</span>
</code></pre></div><p>请访问 <strong>127.0.0.1:8000</strong> 查看演示笔记本。</p>
<p><br><br></p>
<h2 id="二示例">二、示例</h2>
<p>下面是一个简单的代码示例，创建一个小部件并显示其值。您可以在Jupyter Notebook中与小部件进行交互。小部件的值将会被更新。但是，要在其他单元格中看到更新，您需要手动执行它们。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">mercury</span> <span class="k">as</span> <span class="nn">mr</span> 

<span class="c1">#创建一个文本小部件：</span>

<span class="n">name</span> <span class="o">=</span> <span class="n">mr</span><span class="o">.</span><span class="n">Text</span><span class="p">(</span><span class="n">value</span><span class="o">=</span><span class="s2">&#34;Piotr&#34;</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s2">&#34;What is your name?&#34;</span><span class="p">)</span> 



<span class="c1"># 打印小部件的值：</span>

<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;Hello </span><span class="si">{</span><span class="n">name</span><span class="o">.</span><span class="n">value</span><span class="si">}</span><span class="s2">&#34;</span><span class="p">)</span> 
</code></pre></div><p>Jupyter Notebook中的代码截图</p>
<p><img loading="lazy" src="img/hello-world-notebook-ola.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="三mercury-应用程序">三、Mercury 应用程序</h2>
<p>使用 Mercury 将笔记本作为 Web 应用程序运行。小部件更改后，单元格会自动重新执行。Mercury 仅重新执行具有小部件定义及其下方的单元格。在示例中，小部件更新后，单元格 2 和 3 会被重新执行。</p>
<p><img loading="lazy" src="img/hello-world-app-ola.gif" alt=""  />
</p>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>可视化 | 使用ggdag包绘制有向图</title>
      <link>https://textdata.cn/blog/2023-06-02-r-ggdag/</link>
      <pubDate>Fri, 02 Jun 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-06-02-r-ggdag/</guid>
      <description>在tidyverse环境中使用ggdag可以轻松地使用dagitty。您可以直接整理dagitty对象，或使用方便的函数使用更接近R语言风格的语法创建DAGs。</description>
      <content:encoded><![CDATA[<p>在tidyverse环境中使用ggdag可以轻松地使用dagitty。您可以直接整理dagitty对象，或使用方便的函数使用更接近R语言风格的语法创建DAGs。</p>
<p><br><br></p>
<h2 id="安装">安装</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">install.packages(&#34;ggdag&#34;)
</code></pre></div><p><br><br></p>
<h2 id="代码">代码</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-R" data-lang="R"><span class="nf">library</span><span class="p">(</span><span class="n">dagitty</span><span class="p">)</span>
<span class="nf">library</span><span class="p">(</span><span class="n">ggdag</span><span class="p">)</span>
<span class="nf">library</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">)</span>

<span class="n">dag</span> <span class="o">&lt;-</span> <span class="nf">dagitty</span><span class="p">(</span><span class="s">&#34;dag{y &lt;- z -&gt; x}&#34;</span><span class="p">)</span>
<span class="nf">tidy_dagitty</span><span class="p">(</span><span class="n">dag</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/df.png" alt=""  />
</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="nf">ggdag</span><span class="p">(</span><span class="n">dag</span><span class="p">,</span> <span class="n">layout</span> <span class="o">=</span> <span class="s">&#34;circle&#34;</span><span class="p">)</span>
</code></pre></div><p><img loading="lazy" src="img/1.png" alt=""  />
</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="nf">library</span><span class="p">(</span><span class="n">ggdag</span><span class="p">)</span>
<span class="nf">library</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">)</span>

<span class="c1">#  example from the dagitty package</span>
<span class="n">dag</span> <span class="o">&lt;-</span> <span class="n">dagitty</span><span class="o">::</span><span class="nf">dagitty</span><span class="p">(</span><span class="s">&#34;dag {
</span><span class="s">    y &lt;- x &lt;- z1 &lt;- v -&gt; z2 -&gt; y
</span><span class="s">    z1 &lt;- w1 &lt;-&gt; w2 -&gt; z2
</span><span class="s">    x &lt;- w1 -&gt; y
</span><span class="s">    x &lt;- w2 -&gt; y
</span><span class="s">    x [exposure]
</span><span class="s">    y [outcome]
</span><span class="s">  }&#34;</span><span class="p">)</span>

<span class="n">tidy_dag</span> <span class="o">&lt;-</span> <span class="nf">tidy_dagitty</span><span class="p">(</span><span class="n">dag</span><span class="p">)</span>
<span class="nf">ggdag</span><span class="p">(</span><span class="n">tidy_dag</span><span class="p">)</span> <span class="o">+</span>
  <span class="nf">theme_dag</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/2.png" alt=""  />
</p>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>可视化 | 使用groupby或resample按月份分组绘制高管违规量趋势图</title>
      <link>https://textdata.cn/blog/2023-05-31-resample-groupby-in-pandas/</link>
      <pubDate>Wed, 31 May 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-05-31-resample-groupby-in-pandas/</guid>
      <description>&lt;p&gt;在数据分析和处理中，经常需要按照月份对时间序列数据进行分组和聚合。今天以高管违规数据为例， 想根据这份数据绘制月度高管违规量趋势，需要按照月份对数据进行分组，可以使用resample或groupby。本文知识点&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;resample实现&lt;/li&gt;
&lt;li&gt;groupby实现&lt;/li&gt;
&lt;li&gt;resample和groupby运算结果是什么数据类型&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;一resample实现步骤&#34;&gt;一、resample实现步骤&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;导入xlsx数据 (&lt;a href=&#34;https://mp.weixin.qq.com/s/IFarSFd7v22PL2EjJdicXw&#34;&gt;点击跳转获取数据&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;使用set_index函数将&amp;rsquo;公告日期&amp;rsquo;列设置为索引，以便能够使用时间序列的功能。&lt;/li&gt;
&lt;li&gt;使用resample函数并指定频率为&amp;rsquo;M&#39;（表示按照月份）来对时间序列数据进行分组。使用了size函数获取每组的记录数&lt;/li&gt;
&lt;li&gt;打印了分组结果df_resampled，其中每一行代表一个月份的总和。&lt;/li&gt;
&lt;/ol&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_excel&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;上市公司高管违规-原始数据.xlsx&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;公告日期&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;to_datetime&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;公告日期&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;set_index&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;公告日期&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;inplace&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;head&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/df.png&#34; alt=&#34;&#34;  /&gt;
&lt;br&gt;&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# 将&amp;#39;date&amp;#39;列设置为索引&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#df.set_index(&amp;#39;公告日期&amp;#39;, inplace=True)&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;# 按月份对时间序列数据进行分组&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df_resampled&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;resample&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;M&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;size&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#df_resampled = df.resample(&amp;#39;30D&amp;#39;).size()&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;# 打印分组结果&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df_resampled&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;    公告日期
    1997-01-31      1
    1997-02-28      0
    1997-03-31      0
    1997-04-30      0
    1997-05-31      0
                 ... 
    2022-08-31    453
    2022-09-30    479
    2022-10-31    216
    2022-11-30    525
    2022-12-31    343
    Freq: M, Length: 312, dtype: int64
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;plt&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;matplotlib&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;matplotlib_inline&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;matplotlib_inline&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;backend_inline&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;set_matplotlib_formats&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;png&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;svg&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;scienceplots&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;platform&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;numpy&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;np&lt;/span&gt;


&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;style&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;use&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;([&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;science&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;no-latex&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;cjk-sc-font&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;system&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;platform&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;system&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;  &lt;span class=&#34;c1&#34;&gt;# 获取操作系统类型&lt;/span&gt;

&lt;span class=&#34;k&#34;&gt;if&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;system&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;==&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;Windows&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;font&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;family&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;SimHei&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;elif&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;system&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;==&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;Darwin&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;font&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;family&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;Arial Unicode MS&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;else&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;font&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;family&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;sans-serif&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;matplotlib&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;rc&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;font&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;**&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;font&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;  &lt;span class=&#34;c1&#34;&gt;# 设置全局字体&lt;/span&gt;


&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;figure&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;figsize&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;10&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;5&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;xlabel&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;日期&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ylabel&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;月度违规量&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;title&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;月度上市公司高管违规量(1997-2022)&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df_resampled&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;plot&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/output_3_1.svg&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二groupby实现步骤&#34;&gt;二、groupby实现步骤&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;导入xlsx数据&lt;/li&gt;
&lt;li&gt;使用set_index函数将&amp;rsquo;公告日期&amp;rsquo;列设置为索引，以便能够使用时间序列的功能。&lt;/li&gt;
&lt;li&gt;使用groupby函数并指定频率为&amp;rsquo;M&#39;（表示按照月份）来对时间序列数据进行分组。使用了size函数获取每组的记录数&lt;/li&gt;
&lt;li&gt;打印了分组结果df_resampled，其中每一行代表一个月份的总和。&lt;/li&gt;
&lt;/ol&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df2&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_excel&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;上市公司高管违规-原始数据.xlsx&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;公告日期&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;to_datetime&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;公告日期&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;# 将&amp;#39;date&amp;#39;列设置为索引&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df2&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;set_index&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;公告日期&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;inplace&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;# 按月份对时间序列数据进行分组&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df2_grouped&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df2&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;groupby&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Grouper&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;freq&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;M&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;size&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;# 打印分组结果&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df2_grouped&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;    公告日期
    1997-01-31      1
    1997-02-28      0
    1997-03-31      0
    1997-04-30      0
    1997-05-31      0
                 ... 
    2022-08-31    453
    2022-09-30    479
    2022-10-31    216
    2022-11-30    525
    2022-12-31    343
    Freq: M, Length: 312, dtype: int64
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;plt&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;matplotlib&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;matplotlib_inline&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;matplotlib_inline&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;backend_inline&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;set_matplotlib_formats&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;png&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;svg&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;scienceplots&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;platform&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;numpy&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;np&lt;/span&gt;


&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;style&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;use&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;([&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;science&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;no-latex&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;cjk-sc-font&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;system&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;platform&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;system&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;  &lt;span class=&#34;c1&#34;&gt;# 获取操作系统类型&lt;/span&gt;

&lt;span class=&#34;k&#34;&gt;if&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;system&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;==&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;Windows&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;font&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;family&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;SimHei&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;elif&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;system&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;==&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;Darwin&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;font&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;family&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;Arial Unicode MS&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;else&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;font&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;family&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;sans-serif&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;matplotlib&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;rc&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;font&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;**&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;font&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;  &lt;span class=&#34;c1&#34;&gt;# 设置全局字体&lt;/span&gt;


&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;figure&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;figsize&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;10&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;5&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;xlabel&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;日期&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ylabel&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;月度违规量&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;title&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;月度上市公司高管违规量(1997-2022)&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df2_grouped&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;plot&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/output_3_1.svg&#34; alt=&#34;&#34;  /&gt;

​&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;三深入理解&#34;&gt;三、深入理解&lt;/h2&gt;
&lt;p&gt;df2.resample(&amp;lsquo;M&amp;rsquo;)或df2.groupby(pd.Grouper(freq=&amp;lsquo;M&amp;rsquo;))  返回的结果是什么类型的数据，有什么特点。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df2&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;resample&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;M&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;    &amp;lt;pandas.core.resample.DatetimeIndexResampler object at 0x7fece05fb160&amp;gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df2&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;groupby&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Grouper&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;freq&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;M&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;    &amp;lt;pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fed31571df0&amp;gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;p&gt;只要遇到 &amp;lt; object at 0x7fece05fb160&amp;gt;，不知道这内部是什么。可以使用for循环拆解这个黑盒子&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;x&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df2&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;resample&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;M&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
    &lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;type&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;&amp;lt;class &amp;#39;tuple&amp;#39;&amp;gt; 2
&amp;lt;class &amp;#39;tuple&amp;#39;&amp;gt; 2
&amp;lt;class &amp;#39;tuple&amp;#39;&amp;gt; 2
&amp;lt;class &amp;#39;tuple&amp;#39;&amp;gt; 2
&amp;lt;class &amp;#39;tuple&amp;#39;&amp;gt; 2
&amp;lt;class &amp;#39;tuple&amp;#39;&amp;gt; 2
......
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;x&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df2&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;groupby&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Grouper&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;freq&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;M&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)):&lt;/span&gt;
    &lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;type&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;&amp;lt;class &amp;#39;tuple&amp;#39;&amp;gt; 2
&amp;lt;class &amp;#39;tuple&amp;#39;&amp;gt; 2
&amp;lt;class &amp;#39;tuple&amp;#39;&amp;gt; 2
&amp;lt;class &amp;#39;tuple&amp;#39;&amp;gt; 2
&amp;lt;class &amp;#39;tuple&amp;#39;&amp;gt; 2
&amp;lt;class &amp;#39;tuple&amp;#39;&amp;gt; 2
......
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;经过检查发现df2.resample(&amp;lsquo;M&amp;rsquo;)或df2.groupby(pd.Grouper(freq=&amp;lsquo;M&amp;rsquo;)) 内部都是由tuple组成的，而每个tuple又由「日期」和对应的「dataframe」组成。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;x&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df2&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;resample&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;M&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
    &lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;type&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]),&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;type&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;&amp;lt;class &amp;#39;pandas._libs.tslibs.timestamps.Timestamp&amp;#39;&amp;gt; &amp;lt;class &amp;#39;pandas.core.frame.DataFrame&amp;#39;&amp;gt;
&amp;lt;class &amp;#39;pandas._libs.tslibs.timestamps.Timestamp&amp;#39;&amp;gt; &amp;lt;class &amp;#39;pandas.core.frame.DataFrame&amp;#39;&amp;gt;
&amp;lt;class &amp;#39;pandas._libs.tslibs.timestamps.Timestamp&amp;#39;&amp;gt; &amp;lt;class &amp;#39;pandas.core.frame.DataFrame&amp;#39;&amp;gt;
&amp;lt;class &amp;#39;pandas._libs.tslibs.timestamps.Timestamp&amp;#39;&amp;gt; &amp;lt;class &amp;#39;pandas.core.frame.DataFrame&amp;#39;&amp;gt;
&amp;lt;class &amp;#39;pandas._libs.tslibs.timestamps.Timestamp&amp;#39;&amp;gt; &amp;lt;class &amp;#39;pandas.core.frame.DataFrame&amp;#39;&amp;gt;
&amp;lt;class &amp;#39;pandas._libs.tslibs.timestamps.Timestamp&amp;#39;&amp;gt; &amp;lt;class &amp;#39;pandas.core.frame.DataFrame&amp;#39;&amp;gt;
......
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
</description>
      <content:encoded><![CDATA[<p>在数据分析和处理中，经常需要按照月份对时间序列数据进行分组和聚合。今天以高管违规数据为例， 想根据这份数据绘制月度高管违规量趋势，需要按照月份对数据进行分组，可以使用resample或groupby。本文知识点</p>
<ol>
<li>resample实现</li>
<li>groupby实现</li>
<li>resample和groupby运算结果是什么数据类型</li>
</ol>
<p><br><br></p>
<h2 id="一resample实现步骤">一、resample实现步骤</h2>
<ol>
<li>导入xlsx数据 (<a href="https://mp.weixin.qq.com/s/IFarSFd7v22PL2EjJdicXw">点击跳转获取数据</a>)</li>
<li>使用set_index函数将&rsquo;公告日期&rsquo;列设置为索引，以便能够使用时间序列的功能。</li>
<li>使用resample函数并指定频率为&rsquo;M'（表示按照月份）来对时间序列数据进行分组。使用了size函数获取每组的记录数</li>
<li>打印了分组结果df_resampled，其中每一行代表一个月份的总和。</li>
</ol>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_excel</span><span class="p">(</span><span class="s1">&#39;上市公司高管违规-原始数据.xlsx&#39;</span><span class="p">)</span>

<span class="n">df</span><span class="p">[</span><span class="s1">&#39;公告日期&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;公告日期&#39;</span><span class="p">])</span>
<span class="n">df</span><span class="o">.</span><span class="n">set_index</span><span class="p">(</span><span class="s1">&#39;公告日期&#39;</span><span class="p">,</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/df.png" alt=""  />
<br></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1"># 将&#39;date&#39;列设置为索引</span>
<span class="c1">#df.set_index(&#39;公告日期&#39;, inplace=True)</span>

<span class="c1"># 按月份对时间序列数据进行分组</span>
<span class="n">df_resampled</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">resample</span><span class="p">(</span><span class="s1">&#39;M&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">size</span><span class="p">()</span>
<span class="c1">#df_resampled = df.resample(&#39;30D&#39;).size()</span>

<span class="c1"># 打印分组结果</span>
<span class="nb">print</span><span class="p">(</span><span class="n">df_resampled</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">    公告日期
    1997-01-31      1
    1997-02-28      0
    1997-03-31      0
    1997-04-30      0
    1997-05-31      0
                 ... 
    2022-08-31    453
    2022-09-30    479
    2022-10-31    216
    2022-11-30    525
    2022-12-31    343
    Freq: M, Length: 312, dtype: int64
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">import</span> <span class="nn">matplotlib</span>
<span class="kn">import</span> <span class="nn">matplotlib_inline</span>
<span class="n">matplotlib_inline</span><span class="o">.</span><span class="n">backend_inline</span><span class="o">.</span><span class="n">set_matplotlib_formats</span><span class="p">(</span><span class="s1">&#39;png&#39;</span><span class="p">,</span> <span class="s1">&#39;svg&#39;</span><span class="p">)</span>
<span class="kn">import</span> <span class="nn">scienceplots</span>
<span class="kn">import</span> <span class="nn">platform</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>


<span class="n">plt</span><span class="o">.</span><span class="n">style</span><span class="o">.</span><span class="n">use</span><span class="p">([</span><span class="s1">&#39;science&#39;</span><span class="p">,</span> <span class="s1">&#39;no-latex&#39;</span><span class="p">,</span> <span class="s1">&#39;cjk-sc-font&#39;</span><span class="p">])</span>
<span class="n">system</span> <span class="o">=</span> <span class="n">platform</span><span class="o">.</span><span class="n">system</span><span class="p">()</span>  <span class="c1"># 获取操作系统类型</span>

<span class="k">if</span> <span class="n">system</span> <span class="o">==</span> <span class="s1">&#39;Windows&#39;</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;SimHei&#39;</span><span class="p">}</span>
<span class="k">elif</span> <span class="n">system</span> <span class="o">==</span> <span class="s1">&#39;Darwin&#39;</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;Arial Unicode MS&#39;</span><span class="p">}</span>
<span class="k">else</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;sans-serif&#39;</span><span class="p">}</span>
<span class="n">matplotlib</span><span class="o">.</span><span class="n">rc</span><span class="p">(</span><span class="s1">&#39;font&#39;</span><span class="p">,</span> <span class="o">**</span><span class="n">font</span><span class="p">)</span>  <span class="c1"># 设置全局字体</span>


<span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s1">&#39;日期&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s1">&#39;月度违规量&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">&#39;月度上市公司高管违规量(1997-2022)&#39;</span><span class="p">)</span>
<span class="n">df_resampled</span><span class="o">.</span><span class="n">plot</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/output_3_1.svg" alt=""  />
</p>
<p><br><br></p>
<h2 id="二groupby实现步骤">二、groupby实现步骤</h2>
<ol>
<li>导入xlsx数据</li>
<li>使用set_index函数将&rsquo;公告日期&rsquo;列设置为索引，以便能够使用时间序列的功能。</li>
<li>使用groupby函数并指定频率为&rsquo;M'（表示按照月份）来对时间序列数据进行分组。使用了size函数获取每组的记录数</li>
<li>打印了分组结果df_resampled，其中每一行代表一个月份的总和。</li>
</ol>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df2</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_excel</span><span class="p">(</span><span class="s1">&#39;上市公司高管违规-原始数据.xlsx&#39;</span><span class="p">)</span>

<span class="n">df2</span><span class="p">[</span><span class="s1">&#39;公告日期&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df2</span><span class="p">[</span><span class="s1">&#39;公告日期&#39;</span><span class="p">])</span>

<span class="c1"># 将&#39;date&#39;列设置为索引</span>
<span class="n">df2</span><span class="o">.</span><span class="n">set_index</span><span class="p">(</span><span class="s1">&#39;公告日期&#39;</span><span class="p">,</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>

<span class="c1"># 按月份对时间序列数据进行分组</span>
<span class="n">df2_grouped</span> <span class="o">=</span> <span class="n">df2</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">pd</span><span class="o">.</span><span class="n">Grouper</span><span class="p">(</span><span class="n">freq</span><span class="o">=</span><span class="s1">&#39;M&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">size</span><span class="p">()</span>

<span class="c1"># 打印分组结果</span>
<span class="nb">print</span><span class="p">(</span><span class="n">df2_grouped</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">    公告日期
    1997-01-31      1
    1997-02-28      0
    1997-03-31      0
    1997-04-30      0
    1997-05-31      0
                 ... 
    2022-08-31    453
    2022-09-30    479
    2022-10-31    216
    2022-11-30    525
    2022-12-31    343
    Freq: M, Length: 312, dtype: int64
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">import</span> <span class="nn">matplotlib</span>
<span class="kn">import</span> <span class="nn">matplotlib_inline</span>
<span class="n">matplotlib_inline</span><span class="o">.</span><span class="n">backend_inline</span><span class="o">.</span><span class="n">set_matplotlib_formats</span><span class="p">(</span><span class="s1">&#39;png&#39;</span><span class="p">,</span> <span class="s1">&#39;svg&#39;</span><span class="p">)</span>
<span class="kn">import</span> <span class="nn">scienceplots</span>
<span class="kn">import</span> <span class="nn">platform</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>


<span class="n">plt</span><span class="o">.</span><span class="n">style</span><span class="o">.</span><span class="n">use</span><span class="p">([</span><span class="s1">&#39;science&#39;</span><span class="p">,</span> <span class="s1">&#39;no-latex&#39;</span><span class="p">,</span> <span class="s1">&#39;cjk-sc-font&#39;</span><span class="p">])</span>
<span class="n">system</span> <span class="o">=</span> <span class="n">platform</span><span class="o">.</span><span class="n">system</span><span class="p">()</span>  <span class="c1"># 获取操作系统类型</span>

<span class="k">if</span> <span class="n">system</span> <span class="o">==</span> <span class="s1">&#39;Windows&#39;</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;SimHei&#39;</span><span class="p">}</span>
<span class="k">elif</span> <span class="n">system</span> <span class="o">==</span> <span class="s1">&#39;Darwin&#39;</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;Arial Unicode MS&#39;</span><span class="p">}</span>
<span class="k">else</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;sans-serif&#39;</span><span class="p">}</span>
<span class="n">matplotlib</span><span class="o">.</span><span class="n">rc</span><span class="p">(</span><span class="s1">&#39;font&#39;</span><span class="p">,</span> <span class="o">**</span><span class="n">font</span><span class="p">)</span>  <span class="c1"># 设置全局字体</span>


<span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s1">&#39;日期&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s1">&#39;月度违规量&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">&#39;月度上市公司高管违规量(1997-2022)&#39;</span><span class="p">)</span>
<span class="n">df2_grouped</span><span class="o">.</span><span class="n">plot</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/output_3_1.svg" alt=""  />

​</p>
<p><br><br></p>
<h2 id="三深入理解">三、深入理解</h2>
<p>df2.resample(&lsquo;M&rsquo;)或df2.groupby(pd.Grouper(freq=&lsquo;M&rsquo;))  返回的结果是什么类型的数据，有什么特点。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df2</span><span class="o">.</span><span class="n">resample</span><span class="p">(</span><span class="s1">&#39;M&#39;</span><span class="p">)</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">    &lt;pandas.core.resample.DatetimeIndexResampler object at 0x7fece05fb160&gt;
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df2</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">pd</span><span class="o">.</span><span class="n">Grouper</span><span class="p">(</span><span class="n">freq</span><span class="o">=</span><span class="s1">&#39;M&#39;</span><span class="p">))</span> 
</code></pre></div><br>
<p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">    &lt;pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fed31571df0&gt;
</code></pre></div><br>
<p>只要遇到 &lt; object at 0x7fece05fb160&gt;，不知道这内部是什么。可以使用for循环拆解这个黑盒子</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">df2</span><span class="o">.</span><span class="n">resample</span><span class="p">(</span><span class="s1">&#39;M&#39;</span><span class="p">):</span>
    <span class="nb">print</span><span class="p">(</span><span class="nb">type</span><span class="p">(</span><span class="n">x</span><span class="p">),</span> <span class="nb">len</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">&lt;class &#39;tuple&#39;&gt; 2
&lt;class &#39;tuple&#39;&gt; 2
&lt;class &#39;tuple&#39;&gt; 2
&lt;class &#39;tuple&#39;&gt; 2
&lt;class &#39;tuple&#39;&gt; 2
&lt;class &#39;tuple&#39;&gt; 2
......
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">df2</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">pd</span><span class="o">.</span><span class="n">Grouper</span><span class="p">(</span><span class="n">freq</span><span class="o">=</span><span class="s1">&#39;M&#39;</span><span class="p">)):</span>
    <span class="nb">print</span><span class="p">(</span><span class="nb">type</span><span class="p">(</span><span class="n">x</span><span class="p">),</span> <span class="nb">len</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">&lt;class &#39;tuple&#39;&gt; 2
&lt;class &#39;tuple&#39;&gt; 2
&lt;class &#39;tuple&#39;&gt; 2
&lt;class &#39;tuple&#39;&gt; 2
&lt;class &#39;tuple&#39;&gt; 2
&lt;class &#39;tuple&#39;&gt; 2
......
</code></pre></div><p>经过检查发现df2.resample(&lsquo;M&rsquo;)或df2.groupby(pd.Grouper(freq=&lsquo;M&rsquo;)) 内部都是由tuple组成的，而每个tuple又由「日期」和对应的「dataframe」组成。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">df2</span><span class="o">.</span><span class="n">resample</span><span class="p">(</span><span class="s1">&#39;M&#39;</span><span class="p">):</span>
    <span class="nb">print</span><span class="p">(</span><span class="nb">type</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">]),</span> <span class="nb">type</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="mi">1</span><span class="p">]))</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">&lt;class &#39;pandas._libs.tslibs.timestamps.Timestamp&#39;&gt; &lt;class &#39;pandas.core.frame.DataFrame&#39;&gt;
&lt;class &#39;pandas._libs.tslibs.timestamps.Timestamp&#39;&gt; &lt;class &#39;pandas.core.frame.DataFrame&#39;&gt;
&lt;class &#39;pandas._libs.tslibs.timestamps.Timestamp&#39;&gt; &lt;class &#39;pandas.core.frame.DataFrame&#39;&gt;
&lt;class &#39;pandas._libs.tslibs.timestamps.Timestamp&#39;&gt; &lt;class &#39;pandas.core.frame.DataFrame&#39;&gt;
&lt;class &#39;pandas._libs.tslibs.timestamps.Timestamp&#39;&gt; &lt;class &#39;pandas.core.frame.DataFrame&#39;&gt;
&lt;class &#39;pandas._libs.tslibs.timestamps.Timestamp&#39;&gt; &lt;class &#39;pandas.core.frame.DataFrame&#39;&gt;
......
</code></pre></div><p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>文本分析 | 使用「软余弦相似度」测量业绩说明会「答非所问程度」</title>
      <link>https://textdata.cn/blog/2023-05-23-soft-cosine-similarity/</link>
      <pubDate>Wed, 24 May 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-05-23-soft-cosine-similarity/</guid>
      <description>&lt;h2 id=&#34;一答非所问相关文献&#34;&gt;一、「答非所问」相关文献&lt;/h2&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;问:“公司的核心竞争力?”
答:“企业未来的发力肯定是围绕品牌和渠道发力，品牌又是重中之重．”
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;回答与问题之间的相似度越高，则回答与问题就越契合 ，回答质量越高。因此，在会计财经领域的研究中，&lt;strong&gt;答非所问程度&lt;/strong&gt;是一个很有使用价值的指标。&lt;/p&gt;
&lt;p&gt;卞世博,管之凡,阎志鹏.&lt;strong&gt;答非所问与市场反应:基于业绩说明会的研究&lt;/strong&gt;[J].管理科学学报,2021,24(04):109-126.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;摘要:对上市公司业绩说明会中投资者与管理层问答互动中管理层答非所问的现象进行了研究.本文以中小板和创业板上市公司召开的业绩说明会作为研究样本,利用文本分析方法对业绩说明会中管理层在回答投资者提问时答非所问的程度进行度量,进而实证分析了管理层的答非所问与市场反应和公司未来业绩表现之间的可能关联.结果发现:在控制其它因素之后,管理层的答非所问与市场反应之间呈现显著的负相关关系,即公司管理层的答非所问程度越高,随后公司股票的市场表现则就会越差,并且对于那些低分析师关注的公司尤为明显;而在公司未来业绩表现方面,管理层答非所问的程度越高,则公司未来的业绩表现则会越差.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;br&gt;
&lt;p&gt;郭照蕊,袁嘉浩,傅毅.&lt;strong&gt;上市公司“答非所问”程度与审计费用——基于年报问询函与回函的综合研究&lt;/strong&gt;[J].审计研究,2023,No.231(01):99-111.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;摘要:年报问询函是证券交易所向年报披露存疑的上市公司发出的函件，有问有答才构成一次有效的问询回合，因此综合考察年报问询函和回函的经济后果更具意义。本文通过对2015-2020年间年报问询函及上市公司相应回函的文本分析构建了“答非所问”程度指数并实证考察了其对审计费用的影响，结果发现，“答非所问”程度指数越高，上市公司支付的审计费用越高，进而表明，有针对性的释疑能够降低审计费用，回函质量的高低直接影响上市公司因问询函而支付的审计费用“溢价”。该现象受到一系列公司内外部特征的影响，相对于问询函回函长度越长、内部治理水平和外部制度环境越差，审计费用受“答非所问”程度影响而提升得越明显。本文从审计费用的视角证实了高质量的回函对上市公司发挥了积极作用。&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二为什么使用软余弦相似度测量答非所问&#34;&gt;二、为什么使用「软余弦相似度」测量「答非所问」&lt;/h2&gt;
&lt;p&gt;关于「软余弦相似度」测量， 本质上其实就是两个文本的相似程度，相似程度越低， 答非所问程度越高。 但问答是一种特殊的场景， 直接使用余弦相似度测量会很不准，目前主要使用软余弦相似度进行测量。 原因有以下几点：&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;考虑语义关系：软余弦相似度能够考虑词语之间的语义关系，而在问询和业绩说明会问答环节中，问题和答案之间可能存在词语的近义词、同义词以及语义相似的情况。软余弦相似度通过使用词向量来捕捉词语的语义信息，能够更好地度量问题和答案之间的语义相似度，从而更准确地判断它们之间的相似程度。&lt;/li&gt;
&lt;li&gt;考虑词语权重：软余弦相似度通常使用TF-IDF来计算词语的权重，这能够在计算相似度时对词语进行加权，更加准确地反映词语在问题和答案中的重要性。在问询和业绩说明会问答环节中，问题和答案中的词语可能具有不同的重要性，某些关键词可能对于判断相似度起着重要作用。软余弦相似度能够考虑这种权重差异，从而更好地衡量问题和答案之间的相似度。&lt;/li&gt;
&lt;li&gt;考虑词语变体和同义词：在问询和业绩说明会问答环节中，问题和答案之间可能存在词语的变体或同义词。软余弦相似度在词向量的计算过程中，能够通过训练语料库中的上下文信息，将相似的词语映射到相似的向量表示，从而能够更好地处理词语的变体和同义词，提高相似度计算的准确性。&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;综上所述，软余弦相似度在问询和业绩说明会问答环节的相似度计算中具有优势，能够更好地考虑语义关系、词语权重以及词语变体和同义词等因素，从而提高问答相似度的准确性和可靠性。&lt;/p&gt;
&lt;p&gt;需要注意， &lt;strong&gt;答非所问程度 = 1 - 软余弦相似度&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;三答非所问代码&#34;&gt;三、「答非所问」代码&lt;/h2&gt;
&lt;p&gt;文件树结构&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;软余弦相似度-答非所问
 |--Word2Vec
    |--mda01-22.200.6.bin
    |--mda01-22.200.6.bin.vectors.npy
    |--mda01-22.200.6.bin.syn1neg.npy
 |--问答数据.csv
 |--代码.ipynb
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;p&gt;除 **&lt;a href=&#34;https://textdata.cn/blog/2023-03-24-load-w2v-and-expand-your-concpet-dicitonary/&#34;&gt;Word2Vec/mda01-23.200.6.bin**&lt;/a&gt; 是付费内容， 其余内容均都是公开的。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;100元   Word2Vec相关模型文件(mda01-23.200.6.bin)

加微信 372335839， 备注「姓名-学校-专业-word2vec」
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;31-环境准备&#34;&gt;3.1 环境准备&lt;/h3&gt;
&lt;p&gt;打开命令行， 执行以下安装命令&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;pip3 install gensim==4.3.2
pip3 install jieba==0.42.1
pip3 install pandas==2.0.3
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;h3 id=&#34;32-计算一个问答答非所问程度&#34;&gt;3.2 计算一个问答「答非所问程度」&lt;/h3&gt;
&lt;p&gt;谷歌搜 「&lt;strong&gt;soft cosine similarity&lt;/strong&gt;」，能找到相关代码，我使用gensim提供的英文文本的「软余弦相似度」，更改适配成中文的代码。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/01-screen.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;如果问答数据量很大，可以把所有文本堆到一个txt中，训练出对应的word2vec模型&lt;/strong&gt;。这里大邓偷懒，找一个财经领域的语料训练出的word2vec模型。 之前分享过 &lt;a href=&#34;https://textdata.cn/blog/2023-03-24-load-w2v-and-expand-your-concpet-dicitonary/&#34;&gt;词向量(付费) | 使用MD&amp;amp;A2001-2023语料训练Word2Vec模型&lt;/a&gt; ， 购买后可得到财经语料的 Word2Vec模型文件 &lt;strong&gt;mda01-23.200.6.bin&lt;/strong&gt;。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;gensim.corpora&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Dictionary&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;gensim.models&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;TfidfModel&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;gensim.similarities&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;SparseTermSimilarityMatrix&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;WordEmbeddingSimilarityIndex&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;gensim.models&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;KeyedVectors&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;jieba&lt;/span&gt;




&lt;span class=&#34;c1&#34;&gt;#导入预训练word2vec模型&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;w2v_model&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;KeyedVectors&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;load&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;Word2Vec/mda01-23.200.6.bin&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#软余弦相似度&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;def&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;soft_cosine&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;row&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;question&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;jieba&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;lcut&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;row&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;question&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;answer&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;jieba&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;lcut&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;row&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;answer&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;docs&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;question&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;answer&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;DICTION&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Dictionary&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;docs&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;docs2&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;DICTION&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;doc2bow&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;doc&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;doc&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;docs&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;TFIDF&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;TfidfModel&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;docs2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;termsim_index&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;WordEmbeddingSimilarityIndex&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;w2v_model&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;wv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;termsim_matrix&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;SparseTermSimilarityMatrix&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;termsim_index&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;DICTION&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;TFIDF&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;similarity&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;termsim_matrix&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;inner_product&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;docs2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;docs2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;normalized&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
    &lt;span class=&#34;k&#34;&gt;return&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;similarity&lt;/span&gt;


&lt;span class=&#34;n&#34;&gt;row&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;question&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;公司的核心竞争力?&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
       &lt;span class=&#34;s1&#34;&gt;&amp;#39;answer&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;企业未来的发力肯定是围绕品牌和渠道发力，品牌又是重中之重．&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;


&lt;span class=&#34;c1&#34;&gt;#该问答的软余弦相似度&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;soft_cosine&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;row&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;0.17236403
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;该问答的软余弦相似度为0.17236403， 则答非所问程度1-0.17236403 = 0.82763597&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;需要注意， &lt;strong&gt;答非所问程度 = 1 - 软余弦相似度&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;br&gt;
&lt;h3 id=&#34;33-计算多个问答答非所问程度&#34;&gt;3.3 计算多个问答「答非所问程度」&lt;/h3&gt;
&lt;p&gt;点击下载本文实验数据 &lt;a href=&#34;%E9%97%AE%E7%AD%94%E6%95%B0%E6%8D%AE.csv&#34;&gt;&lt;em&gt;&lt;strong&gt;问答数据.csv&lt;/strong&gt;&lt;/em&gt;&lt;/a&gt;&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-Python&#34; data-lang=&#34;Python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#问答实验数据&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;问答数据.csv&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/02-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;建议各位将问答数据整理到csv或者xlsx格式， 第一列question字段， 第二列为answer字段，保证字段名与本代码一致。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;gensim.corpora&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Dictionary&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;gensim.models&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;TfidfModel&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;gensim.similarities&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;SparseTermSimilarityMatrix&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;WordEmbeddingSimilarityIndex&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;gensim.models&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;KeyedVectors&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;jieba&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#导入预训练word2vec模型&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;w2v_model&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;KeyedVectors&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;load&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;Word2Vec/mda01-23.200.6.bin&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;


&lt;span class=&#34;c1&#34;&gt;#软余弦相似度&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;def&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;soft_cosine&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;row&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;question&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;jieba&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;lcut&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;row&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;question&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;answer&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;jieba&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;lcut&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;row&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;answer&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;docs&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;question&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;answer&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;DICTION&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Dictionary&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;docs&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;docs2&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;DICTION&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;doc2bow&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;doc&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;doc&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;docs&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;TFIDF&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;TfidfModel&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;docs2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;termsim_index&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;WordEmbeddingSimilarityIndex&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;w2v_model&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;wv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;termsim_matrix&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;SparseTermSimilarityMatrix&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;termsim_index&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;DICTION&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;TFIDF&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;similarity&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;termsim_matrix&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;inner_product&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;docs2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;docs2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;normalized&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
    &lt;span class=&#34;k&#34;&gt;return&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;similarity&lt;/span&gt;
  


&lt;span class=&#34;c1&#34;&gt;#批量计算&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_csv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;问答数据.csv&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;similarity&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;apply&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;soft_cosine&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;axis&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;答非所问&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;-&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;similarity&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/03-df.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;四资料获取&#34;&gt;四、资料获取&lt;/h2&gt;
&lt;p&gt;除 **&lt;a href=&#34;https://textdata.cn/blog/2023-03-24-load-w2v-and-expand-your-concpet-dicitonary/&#34;&gt;Word2Vec/mda01-23.200.6.bin**&lt;/a&gt; 是付费内容， 其余内容均都是公开的。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;100元   Word2Vec相关模型文件(mda01-23.200.6.bin)

加微信 372335839， 备注「姓名-学校-专业-word2vec」
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
</description>
      <content:encoded><![CDATA[<h2 id="一答非所问相关文献">一、「答非所问」相关文献</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">问:“公司的核心竞争力?”
答:“企业未来的发力肯定是围绕品牌和渠道发力，品牌又是重中之重．”
</code></pre></div><p>回答与问题之间的相似度越高，则回答与问题就越契合 ，回答质量越高。因此，在会计财经领域的研究中，<strong>答非所问程度</strong>是一个很有使用价值的指标。</p>
<p>卞世博,管之凡,阎志鹏.<strong>答非所问与市场反应:基于业绩说明会的研究</strong>[J].管理科学学报,2021,24(04):109-126.</p>
<blockquote>
<p>摘要:对上市公司业绩说明会中投资者与管理层问答互动中管理层答非所问的现象进行了研究.本文以中小板和创业板上市公司召开的业绩说明会作为研究样本,利用文本分析方法对业绩说明会中管理层在回答投资者提问时答非所问的程度进行度量,进而实证分析了管理层的答非所问与市场反应和公司未来业绩表现之间的可能关联.结果发现:在控制其它因素之后,管理层的答非所问与市场反应之间呈现显著的负相关关系,即公司管理层的答非所问程度越高,随后公司股票的市场表现则就会越差,并且对于那些低分析师关注的公司尤为明显;而在公司未来业绩表现方面,管理层答非所问的程度越高,则公司未来的业绩表现则会越差.</p>
</blockquote>
<br>
<p>郭照蕊,袁嘉浩,傅毅.<strong>上市公司“答非所问”程度与审计费用——基于年报问询函与回函的综合研究</strong>[J].审计研究,2023,No.231(01):99-111.</p>
<blockquote>
<p>摘要:年报问询函是证券交易所向年报披露存疑的上市公司发出的函件，有问有答才构成一次有效的问询回合，因此综合考察年报问询函和回函的经济后果更具意义。本文通过对2015-2020年间年报问询函及上市公司相应回函的文本分析构建了“答非所问”程度指数并实证考察了其对审计费用的影响，结果发现，“答非所问”程度指数越高，上市公司支付的审计费用越高，进而表明，有针对性的释疑能够降低审计费用，回函质量的高低直接影响上市公司因问询函而支付的审计费用“溢价”。该现象受到一系列公司内外部特征的影响，相对于问询函回函长度越长、内部治理水平和外部制度环境越差，审计费用受“答非所问”程度影响而提升得越明显。本文从审计费用的视角证实了高质量的回函对上市公司发挥了积极作用。</p>
</blockquote>
<p><br><br></p>
<h2 id="二为什么使用软余弦相似度测量答非所问">二、为什么使用「软余弦相似度」测量「答非所问」</h2>
<p>关于「软余弦相似度」测量， 本质上其实就是两个文本的相似程度，相似程度越低， 答非所问程度越高。 但问答是一种特殊的场景， 直接使用余弦相似度测量会很不准，目前主要使用软余弦相似度进行测量。 原因有以下几点：</p>
<ol>
<li>考虑语义关系：软余弦相似度能够考虑词语之间的语义关系，而在问询和业绩说明会问答环节中，问题和答案之间可能存在词语的近义词、同义词以及语义相似的情况。软余弦相似度通过使用词向量来捕捉词语的语义信息，能够更好地度量问题和答案之间的语义相似度，从而更准确地判断它们之间的相似程度。</li>
<li>考虑词语权重：软余弦相似度通常使用TF-IDF来计算词语的权重，这能够在计算相似度时对词语进行加权，更加准确地反映词语在问题和答案中的重要性。在问询和业绩说明会问答环节中，问题和答案中的词语可能具有不同的重要性，某些关键词可能对于判断相似度起着重要作用。软余弦相似度能够考虑这种权重差异，从而更好地衡量问题和答案之间的相似度。</li>
<li>考虑词语变体和同义词：在问询和业绩说明会问答环节中，问题和答案之间可能存在词语的变体或同义词。软余弦相似度在词向量的计算过程中，能够通过训练语料库中的上下文信息，将相似的词语映射到相似的向量表示，从而能够更好地处理词语的变体和同义词，提高相似度计算的准确性。</li>
</ol>
<p>综上所述，软余弦相似度在问询和业绩说明会问答环节的相似度计算中具有优势，能够更好地考虑语义关系、词语权重以及词语变体和同义词等因素，从而提高问答相似度的准确性和可靠性。</p>
<p>需要注意， <strong>答非所问程度 = 1 - 软余弦相似度</strong></p>
<p><br><br></p>
<h2 id="三答非所问代码">三、「答非所问」代码</h2>
<p>文件树结构</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">软余弦相似度-答非所问
 |--Word2Vec
    |--mda01-22.200.6.bin
    |--mda01-22.200.6.bin.vectors.npy
    |--mda01-22.200.6.bin.syn1neg.npy
 |--问答数据.csv
 |--代码.ipynb
</code></pre></div><br>
<p>除 **<a href="https://textdata.cn/blog/2023-03-24-load-w2v-and-expand-your-concpet-dicitonary/">Word2Vec/mda01-23.200.6.bin**</a> 是付费内容， 其余内容均都是公开的。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">100元   Word2Vec相关模型文件(mda01-23.200.6.bin)

加微信 372335839， 备注「姓名-学校-专业-word2vec」
</code></pre></div><br>
<h3 id="31-环境准备">3.1 环境准备</h3>
<p>打开命令行， 执行以下安装命令</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">pip3 install gensim==4.3.2
pip3 install jieba==0.42.1
pip3 install pandas==2.0.3
</code></pre></div><br>
<h3 id="32-计算一个问答答非所问程度">3.2 计算一个问答「答非所问程度」</h3>
<p>谷歌搜 「<strong>soft cosine similarity</strong>」，能找到相关代码，我使用gensim提供的英文文本的「软余弦相似度」，更改适配成中文的代码。</p>
<p><img loading="lazy" src="img/01-screen.png" alt=""  />
</p>
<p><strong>如果问答数据量很大，可以把所有文本堆到一个txt中，训练出对应的word2vec模型</strong>。这里大邓偷懒，找一个财经领域的语料训练出的word2vec模型。 之前分享过 <a href="https://textdata.cn/blog/2023-03-24-load-w2v-and-expand-your-concpet-dicitonary/">词向量(付费) | 使用MD&amp;A2001-2023语料训练Word2Vec模型</a> ， 购买后可得到财经语料的 Word2Vec模型文件 <strong>mda01-23.200.6.bin</strong>。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">gensim.corpora</span> <span class="kn">import</span> <span class="n">Dictionary</span>
<span class="kn">from</span> <span class="nn">gensim.models</span> <span class="kn">import</span> <span class="n">TfidfModel</span>
<span class="kn">from</span> <span class="nn">gensim.similarities</span> <span class="kn">import</span> <span class="n">SparseTermSimilarityMatrix</span><span class="p">,</span> <span class="n">WordEmbeddingSimilarityIndex</span>
<span class="kn">from</span> <span class="nn">gensim.models</span> <span class="kn">import</span> <span class="n">KeyedVectors</span>
<span class="kn">import</span> <span class="nn">jieba</span>




<span class="c1">#导入预训练word2vec模型</span>
<span class="n">w2v_model</span> <span class="o">=</span> <span class="n">KeyedVectors</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="s1">&#39;Word2Vec/mda01-23.200.6.bin&#39;</span><span class="p">)</span>

<span class="c1">#软余弦相似度</span>
<span class="k">def</span> <span class="nf">soft_cosine</span><span class="p">(</span><span class="n">row</span><span class="p">):</span>
    <span class="n">question</span> <span class="o">=</span> <span class="n">jieba</span><span class="o">.</span><span class="n">lcut</span><span class="p">(</span><span class="n">row</span><span class="p">[</span><span class="s1">&#39;question&#39;</span><span class="p">])</span>
    <span class="n">answer</span> <span class="o">=</span> <span class="n">jieba</span><span class="o">.</span><span class="n">lcut</span><span class="p">(</span><span class="n">row</span><span class="p">[</span><span class="s1">&#39;answer&#39;</span><span class="p">])</span>
    <span class="n">docs</span> <span class="o">=</span> <span class="p">[</span><span class="n">question</span><span class="p">,</span> <span class="n">answer</span><span class="p">]</span>
    <span class="n">DICTION</span> <span class="o">=</span> <span class="n">Dictionary</span><span class="p">(</span><span class="n">docs</span><span class="p">)</span>
    <span class="n">docs2</span> <span class="o">=</span> <span class="p">[</span><span class="n">DICTION</span><span class="o">.</span><span class="n">doc2bow</span><span class="p">(</span><span class="n">doc</span><span class="p">)</span> <span class="k">for</span> <span class="n">doc</span> <span class="ow">in</span> <span class="n">docs</span><span class="p">]</span>
    <span class="n">TFIDF</span> <span class="o">=</span> <span class="n">TfidfModel</span><span class="p">(</span><span class="n">docs2</span><span class="p">)</span>
    <span class="n">termsim_index</span> <span class="o">=</span> <span class="n">WordEmbeddingSimilarityIndex</span><span class="p">(</span><span class="n">w2v_model</span><span class="o">.</span><span class="n">wv</span><span class="p">)</span>
    <span class="n">termsim_matrix</span> <span class="o">=</span> <span class="n">SparseTermSimilarityMatrix</span><span class="p">(</span><span class="n">termsim_index</span><span class="p">,</span> <span class="n">DICTION</span><span class="p">,</span> <span class="n">TFIDF</span><span class="p">)</span>
    <span class="n">similarity</span> <span class="o">=</span> <span class="n">termsim_matrix</span><span class="o">.</span><span class="n">inner_product</span><span class="p">(</span><span class="n">docs2</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">docs2</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">normalized</span><span class="o">=</span><span class="p">(</span><span class="kc">True</span><span class="p">,</span> <span class="kc">True</span><span class="p">))</span>
    <span class="k">return</span> <span class="n">similarity</span>


<span class="n">row</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;question&#39;</span><span class="p">:</span><span class="s1">&#39;公司的核心竞争力?&#39;</span><span class="p">,</span> 
       <span class="s1">&#39;answer&#39;</span><span class="p">:</span> <span class="s1">&#39;企业未来的发力肯定是围绕品牌和渠道发力，品牌又是重中之重．&#39;</span><span class="p">}</span>


<span class="c1">#该问答的软余弦相似度</span>
<span class="n">soft_cosine</span><span class="p">(</span><span class="n">row</span><span class="p">)</span>

</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">0.17236403
</code></pre></div><p>该问答的软余弦相似度为0.17236403， 则答非所问程度1-0.17236403 = 0.82763597</p>
<blockquote>
<p>需要注意， <strong>答非所问程度 = 1 - 软余弦相似度</strong></p>
</blockquote>
<br>
<h3 id="33-计算多个问答答非所问程度">3.3 计算多个问答「答非所问程度」</h3>
<p>点击下载本文实验数据 <a href="%E9%97%AE%E7%AD%94%E6%95%B0%E6%8D%AE.csv"><em><strong>问答数据.csv</strong></em></a></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-Python" data-lang="Python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="c1">#问答实验数据</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;问答数据.csv&#39;</span><span class="p">)</span>
<span class="n">df</span>
</code></pre></div><p><img loading="lazy" src="img/02-df.png" alt=""  />
</p>
<br>
<p>建议各位将问答数据整理到csv或者xlsx格式， 第一列question字段， 第二列为answer字段，保证字段名与本代码一致。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">from</span> <span class="nn">gensim.corpora</span> <span class="kn">import</span> <span class="n">Dictionary</span>
<span class="kn">from</span> <span class="nn">gensim.models</span> <span class="kn">import</span> <span class="n">TfidfModel</span>
<span class="kn">from</span> <span class="nn">gensim.similarities</span> <span class="kn">import</span> <span class="n">SparseTermSimilarityMatrix</span><span class="p">,</span> <span class="n">WordEmbeddingSimilarityIndex</span>
<span class="kn">from</span> <span class="nn">gensim.models</span> <span class="kn">import</span> <span class="n">KeyedVectors</span>
<span class="kn">import</span> <span class="nn">jieba</span>

<span class="c1">#导入预训练word2vec模型</span>
<span class="n">w2v_model</span> <span class="o">=</span> <span class="n">KeyedVectors</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="s1">&#39;Word2Vec/mda01-23.200.6.bin&#39;</span><span class="p">)</span>


<span class="c1">#软余弦相似度</span>
<span class="k">def</span> <span class="nf">soft_cosine</span><span class="p">(</span><span class="n">row</span><span class="p">):</span>
    <span class="n">question</span> <span class="o">=</span> <span class="n">jieba</span><span class="o">.</span><span class="n">lcut</span><span class="p">(</span><span class="n">row</span><span class="p">[</span><span class="s1">&#39;question&#39;</span><span class="p">])</span>
    <span class="n">answer</span> <span class="o">=</span> <span class="n">jieba</span><span class="o">.</span><span class="n">lcut</span><span class="p">(</span><span class="n">row</span><span class="p">[</span><span class="s1">&#39;answer&#39;</span><span class="p">])</span>
    <span class="n">docs</span> <span class="o">=</span> <span class="p">[</span><span class="n">question</span><span class="p">,</span> <span class="n">answer</span><span class="p">]</span>
    <span class="n">DICTION</span> <span class="o">=</span> <span class="n">Dictionary</span><span class="p">(</span><span class="n">docs</span><span class="p">)</span>
    <span class="n">docs2</span> <span class="o">=</span> <span class="p">[</span><span class="n">DICTION</span><span class="o">.</span><span class="n">doc2bow</span><span class="p">(</span><span class="n">doc</span><span class="p">)</span> <span class="k">for</span> <span class="n">doc</span> <span class="ow">in</span> <span class="n">docs</span><span class="p">]</span>
    <span class="n">TFIDF</span> <span class="o">=</span> <span class="n">TfidfModel</span><span class="p">(</span><span class="n">docs2</span><span class="p">)</span>
    <span class="n">termsim_index</span> <span class="o">=</span> <span class="n">WordEmbeddingSimilarityIndex</span><span class="p">(</span><span class="n">w2v_model</span><span class="o">.</span><span class="n">wv</span><span class="p">)</span>
    <span class="n">termsim_matrix</span> <span class="o">=</span> <span class="n">SparseTermSimilarityMatrix</span><span class="p">(</span><span class="n">termsim_index</span><span class="p">,</span> <span class="n">DICTION</span><span class="p">,</span> <span class="n">TFIDF</span><span class="p">)</span>
    <span class="n">similarity</span> <span class="o">=</span> <span class="n">termsim_matrix</span><span class="o">.</span><span class="n">inner_product</span><span class="p">(</span><span class="n">docs2</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">docs2</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">normalized</span><span class="o">=</span><span class="p">(</span><span class="kc">True</span><span class="p">,</span> <span class="kc">True</span><span class="p">))</span>
    <span class="k">return</span> <span class="n">similarity</span>
  


<span class="c1">#批量计算</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;问答数据.csv&#39;</span><span class="p">)</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;similarity&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">soft_cosine</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;答非所问&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span> <span class="o">-</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;similarity&#39;</span><span class="p">]</span>

<span class="n">df</span>
</code></pre></div><p><img loading="lazy" src="img/03-df.png" alt=""  />
</p>
<p><br><br></p>
<h2 id="四资料获取">四、资料获取</h2>
<p>除 **<a href="https://textdata.cn/blog/2023-03-24-load-w2v-and-expand-your-concpet-dicitonary/">Word2Vec/mda01-23.200.6.bin**</a> 是付费内容， 其余内容均都是公开的。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">100元   Word2Vec相关模型文件(mda01-23.200.6.bin)

加微信 372335839， 备注「姓名-学校-专业-word2vec」
</code></pre></div><p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>可视化 | 微博用户群体情绪随时间变化趋势</title>
      <link>https://textdata.cn/blog/2023-05-18-weibo-sentiment-score-line-plot/</link>
      <pubDate>Thu, 18 May 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-05-18-weibo-sentiment-score-line-plot/</guid>
      <description>&lt;p&gt;DataFrame数据如何绘制按时间趋势的折线图，今天以weibo数据集为例，绘制微博文本内容折线图&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;微博文本内容「平均长度随时间变化」&lt;/li&gt;
&lt;li&gt;微博文本内容「平均情感分值随时间变化」&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/output_8_0.svg&#34; alt=&#34;svg&#34;  /&gt;

&lt;img loading=&#34;lazy&#34; src=&#34;img/output_12_0.svg&#34; alt=&#34;svg&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;一准备工作&#34;&gt;一、准备工作&lt;/h2&gt;
&lt;h3 id=&#34;11-下载数据集&#34;&gt;1.1 下载数据集&lt;/h3&gt;
&lt;p&gt;数据集下载链接 &lt;a href=&#34;https://www.kaggle.com/datasets/dylanli/weibo-content-during-covid19-period&#34;&gt;https://www.kaggle.com/datasets/dylanli/weibo-content-during-covid19-period&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/download.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;含8个json文件&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;user1.json、user2.json、user3.json、user4.json&lt;/li&gt;
&lt;li&gt;weibo1.json、weibo2.json、weibo3.json、weibo4.json&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;这里仅尝试读取weibo1.json  &lt;br&gt;&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;os&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;os&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;listdir&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;pre&gt;&lt;code&gt;[&#39;weibo2.json&#39;,
 &#39;.DS_Store&#39;,
 &#39;weibo3.json&#39;,
 &#39;Untitled.ipynb&#39;,
 &#39;weibo4.json&#39;,
 &#39;user1.json&#39;,
 &#39;user2.json&#39;,
 &#39;说明.md&#39;,
 &#39;user3.json&#39;,
 &#39;.ipynb_checkpoints&#39;,
 &#39;user4.json&#39;,
 &#39;weibo1.json&#39;]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h3 id=&#34;12-导入数据&#34;&gt;1.2 导入数据&lt;/h3&gt;
&lt;p&gt;导入 7138微博用户数据后，查看&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;数据量&lt;/li&gt;
&lt;li&gt;字段的数据类型&lt;/li&gt;
&lt;/ol&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;weibo_df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_json&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;weibo1.json&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;weibo_df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;head&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/df1.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;#记录数&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;weibo_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;pre&gt;&lt;code&gt;560840
&lt;/code&gt;&lt;/pre&gt;
&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;#字段的数据类型&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;weibo_df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;dtypes&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;pre&gt;&lt;code&gt;_id                        object
user_id                    object
screen_name                object
id                         object
bid                        object
text                       object
pics                       object
video_url                  object
location                   object
created_at         datetime64[ns]
source                     object
attitudes_count             int64
comments_count              int64
reposts_count               int64
topics                     object
at_users                   object
retweet                    object
dtype: object
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二数据分析&#34;&gt;二、数据分析&lt;/h2&gt;
&lt;p&gt;绘制微博内容&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;平均长度随时间变化&lt;/li&gt;
&lt;li&gt;平均情感分值随时间变化&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;21-平均长度随时间变化&#34;&gt;2.1 平均长度随时间变化&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;plt&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;matplotlib&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;matplotlib_inline&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;matplotlib_inline&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;backend_inline&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;set_matplotlib_formats&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;png&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;svg&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;scienceplots&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;platform&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;style&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;use&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;([&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;science&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;no-latex&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;cjk-sc-font&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;system&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;platform&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;system&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;  &lt;span class=&#34;c1&#34;&gt;# 获取操作系统类型&lt;/span&gt;

&lt;span class=&#34;k&#34;&gt;if&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;system&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;==&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;Windows&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;font&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;family&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;SimHei&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;elif&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;system&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;==&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;Darwin&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;font&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;family&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;Arial Unicode MS&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;else&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;font&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;family&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;sans-serif&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;matplotlib&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;rc&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;font&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;**&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;font&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;  &lt;span class=&#34;c1&#34;&gt;# 设置全局字体&lt;/span&gt;



&lt;span class=&#34;c1&#34;&gt;# 统计人均字符长度变化&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#df[&amp;#39;text_length&amp;#39;] = weibo_df[&amp;#39;text&amp;#39;].apply(lambda x: len(x))&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;weibo_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;text_length&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;weibo_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;text&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df_avg_length&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;weibo_df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;groupby&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;created_at&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;text_length&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;mean&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;reset_index&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;# 绘制人均字符长度变化图&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;figure&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;figsize&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;10&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;5&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;plot&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df_avg_length&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;created_at&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df_avg_length&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;text_length&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;xlabel&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;日期&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ylabel&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;微博内容平均长度&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;title&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;微博内容平均长度随时间变化&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;xticks&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;rotation&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;45&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;show&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/output_8_0.svg&#34; alt=&#34;svg&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;22-平均情感分值随时间变化&#34;&gt;2.2 平均情感分值随时间变化&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;设计情感计算函数senti_score&lt;/li&gt;
&lt;li&gt;测试一条文本的情感计算实验&lt;/li&gt;
&lt;li&gt;推广到所有weibo内容的情感计算&lt;/li&gt;
&lt;li&gt;参考「平均长度随时间变化」，会「平均情感分值随时间变化」&lt;/li&gt;
&lt;/ol&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;c1&#34;&gt;#pip3 install cntext==1.9.2&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;cntext&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;ct&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;jieba&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#使用知网Hownet情感词典&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;pos_words&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;load_pkl_dict&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;HOWNET.pkl&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;HOWNET&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;][&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;pos&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;neg_words&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;load_pkl_dict&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;HOWNET.pkl&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;HOWNET&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;][&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;neg&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;


&lt;span class=&#34;k&#34;&gt;def&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;senti_score&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;pos&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;neg&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;0&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;words&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;jieba&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;lcut&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;word&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;words&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
        &lt;span class=&#34;k&#34;&gt;if&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;word&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pos_words&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
            &lt;span class=&#34;n&#34;&gt;pos&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pos&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;
        &lt;span class=&#34;k&#34;&gt;if&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;word&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;neg_words&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
            &lt;span class=&#34;n&#34;&gt;neg&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;neg&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;
    &lt;span class=&#34;c1&#34;&gt;#(pos-neg)/(pos+neg)即可，为防止分母为0，特加1&lt;/span&gt;
    &lt;span class=&#34;k&#34;&gt;return&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;pos&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;neg&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;/&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;pos&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;neg&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    
    
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;senti_score&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;我很开心！&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;senti_score&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;我很难过！&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;pre&gt;&lt;code&gt;0.5
-0.5
&lt;/code&gt;&lt;/pre&gt;
&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;weibo1_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;pre&gt;&lt;code&gt;560840
&lt;/code&gt;&lt;/pre&gt;
&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;plt&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;matplotlib&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;matplotlib_inline&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;matplotlib_inline&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;backend_inline&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;set_matplotlib_formats&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;png&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;svg&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;scienceplots&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;platform&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;numpy&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;np&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;#一共有560840条推特，这个部分代码运算量比较大，你所看到的情感变化图是按照1%随机抽样绘制的结果。&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#按照1%随机抽样绘制的结果, &lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#np.random.seed(666)&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#weibo_df = weibo_df.sample(frac=0.01)&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;style&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;use&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;([&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;science&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;no-latex&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;cjk-sc-font&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;system&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;platform&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;system&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;  &lt;span class=&#34;c1&#34;&gt;# 获取操作系统类型&lt;/span&gt;

&lt;span class=&#34;k&#34;&gt;if&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;system&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;==&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;Windows&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;font&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;family&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;SimHei&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;elif&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;system&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;==&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;Darwin&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;font&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;family&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;Arial Unicode MS&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;else&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;font&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;family&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;sans-serif&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;matplotlib&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;rc&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;font&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;**&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;font&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;  &lt;span class=&#34;c1&#34;&gt;# 设置全局字体&lt;/span&gt;


&lt;span class=&#34;c1&#34;&gt;# 统计平均情感分值&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;weibo_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;senti&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;weibo_df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;text&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;apply&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;senti_score&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df_senti_avg_length&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;weibo_df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;groupby&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;created_at&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;senti&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;mean&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;reset_index&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;# 绘制平均情感分值随时间变化&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;figure&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;figsize&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;10&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;5&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;plot&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df_senti_avg_length&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;created_at&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df_senti_avg_length&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;senti&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;xlabel&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;日期&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ylabel&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;微博内容平均情感分值&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;title&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;微博内容平均情感分值随时间变化&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;xticks&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;rotation&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;45&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;plt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;show&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/output_12_0.svg&#34; alt=&#34;svg&#34;  /&gt;

​&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
</description>
      <content:encoded><![CDATA[<p>DataFrame数据如何绘制按时间趋势的折线图，今天以weibo数据集为例，绘制微博文本内容折线图</p>
<ol>
<li>微博文本内容「平均长度随时间变化」</li>
<li>微博文本内容「平均情感分值随时间变化」</li>
</ol>
<p><img loading="lazy" src="img/output_8_0.svg" alt="svg"  />

<img loading="lazy" src="img/output_12_0.svg" alt="svg"  />
</p>
<p><br><br></p>
<h2 id="一准备工作">一、准备工作</h2>
<h3 id="11-下载数据集">1.1 下载数据集</h3>
<p>数据集下载链接 <a href="https://www.kaggle.com/datasets/dylanli/weibo-content-during-covid19-period">https://www.kaggle.com/datasets/dylanli/weibo-content-during-covid19-period</a></p>
<p><img loading="lazy" src="img/download.png" alt=""  />
</p>
<br>
<p>含8个json文件</p>
<ul>
<li>user1.json、user2.json、user3.json、user4.json</li>
<li>weibo1.json、weibo2.json、weibo3.json、weibo4.json</li>
</ul>
<p>这里仅尝试读取weibo1.json  <br></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">os</span>

<span class="n">os</span><span class="o">.</span><span class="n">listdir</span><span class="p">()</span>
</code></pre></div><pre><code>['weibo2.json',
 '.DS_Store',
 'weibo3.json',
 'Untitled.ipynb',
 'weibo4.json',
 'user1.json',
 'user2.json',
 '说明.md',
 'user3.json',
 '.ipynb_checkpoints',
 'user4.json',
 'weibo1.json']
</code></pre>
<p><br><br></p>
<h3 id="12-导入数据">1.2 导入数据</h3>
<p>导入 7138微博用户数据后，查看</p>
<ol>
<li>数据量</li>
<li>字段的数据类型</li>
</ol>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">weibo_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_json</span><span class="p">(</span><span class="s1">&#39;weibo1.json&#39;</span><span class="p">)</span>
<span class="n">weibo_df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/df1.png" alt=""  />
</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#记录数</span>
<span class="nb">len</span><span class="p">(</span><span class="n">weibo_df</span><span class="p">)</span>
</code></pre></div><pre><code>560840
</code></pre>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#字段的数据类型</span>
<span class="n">weibo_df</span><span class="o">.</span><span class="n">dtypes</span>
</code></pre></div><pre><code>_id                        object
user_id                    object
screen_name                object
id                         object
bid                        object
text                       object
pics                       object
video_url                  object
location                   object
created_at         datetime64[ns]
source                     object
attitudes_count             int64
comments_count              int64
reposts_count               int64
topics                     object
at_users                   object
retweet                    object
dtype: object
</code></pre>
<p><br><br></p>
<h2 id="二数据分析">二、数据分析</h2>
<p>绘制微博内容</p>
<ol>
<li>平均长度随时间变化</li>
<li>平均情感分值随时间变化</li>
</ol>
<h3 id="21-平均长度随时间变化">2.1 平均长度随时间变化</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">import</span> <span class="nn">matplotlib</span>
<span class="kn">import</span> <span class="nn">matplotlib_inline</span>
<span class="n">matplotlib_inline</span><span class="o">.</span><span class="n">backend_inline</span><span class="o">.</span><span class="n">set_matplotlib_formats</span><span class="p">(</span><span class="s1">&#39;png&#39;</span><span class="p">,</span> <span class="s1">&#39;svg&#39;</span><span class="p">)</span>
<span class="kn">import</span> <span class="nn">scienceplots</span>
<span class="kn">import</span> <span class="nn">platform</span>

<span class="n">plt</span><span class="o">.</span><span class="n">style</span><span class="o">.</span><span class="n">use</span><span class="p">([</span><span class="s1">&#39;science&#39;</span><span class="p">,</span> <span class="s1">&#39;no-latex&#39;</span><span class="p">,</span> <span class="s1">&#39;cjk-sc-font&#39;</span><span class="p">])</span>
<span class="n">system</span> <span class="o">=</span> <span class="n">platform</span><span class="o">.</span><span class="n">system</span><span class="p">()</span>  <span class="c1"># 获取操作系统类型</span>

<span class="k">if</span> <span class="n">system</span> <span class="o">==</span> <span class="s1">&#39;Windows&#39;</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;SimHei&#39;</span><span class="p">}</span>
<span class="k">elif</span> <span class="n">system</span> <span class="o">==</span> <span class="s1">&#39;Darwin&#39;</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;Arial Unicode MS&#39;</span><span class="p">}</span>
<span class="k">else</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;sans-serif&#39;</span><span class="p">}</span>
<span class="n">matplotlib</span><span class="o">.</span><span class="n">rc</span><span class="p">(</span><span class="s1">&#39;font&#39;</span><span class="p">,</span> <span class="o">**</span><span class="n">font</span><span class="p">)</span>  <span class="c1"># 设置全局字体</span>



<span class="c1"># 统计人均字符长度变化</span>
<span class="c1">#df[&#39;text_length&#39;] = weibo_df[&#39;text&#39;].apply(lambda x: len(x))</span>
<span class="n">weibo_df</span><span class="p">[</span><span class="s1">&#39;text_length&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">weibo_df</span><span class="p">[</span><span class="s1">&#39;text&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">len</span><span class="p">()</span>
<span class="n">df_avg_length</span> <span class="o">=</span> <span class="n">weibo_df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">&#39;created_at&#39;</span><span class="p">)[</span><span class="s1">&#39;text_length&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span><span class="o">.</span><span class="n">reset_index</span><span class="p">()</span>

<span class="c1"># 绘制人均字符长度变化图</span>
<span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">df_avg_length</span><span class="p">[</span><span class="s1">&#39;created_at&#39;</span><span class="p">],</span> <span class="n">df_avg_length</span><span class="p">[</span><span class="s1">&#39;text_length&#39;</span><span class="p">])</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s1">&#39;日期&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s1">&#39;微博内容平均长度&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">&#39;微博内容平均长度随时间变化&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xticks</span><span class="p">(</span><span class="n">rotation</span><span class="o">=</span><span class="mi">45</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/output_8_0.svg" alt="svg"  />
</p>
<br>
<h3 id="22-平均情感分值随时间变化">2.2 平均情感分值随时间变化</h3>
<ol>
<li>设计情感计算函数senti_score</li>
<li>测试一条文本的情感计算实验</li>
<li>推广到所有weibo内容的情感计算</li>
<li>参考「平均长度随时间变化」，会「平均情感分值随时间变化」</li>
</ol>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="c1">#pip3 install cntext==1.9.2</span>
<span class="kn">import</span> <span class="nn">cntext</span> <span class="k">as</span> <span class="nn">ct</span>
<span class="kn">import</span> <span class="nn">jieba</span>

<span class="c1">#使用知网Hownet情感词典</span>
<span class="n">pos_words</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">load_pkl_dict</span><span class="p">(</span><span class="s1">&#39;HOWNET.pkl&#39;</span><span class="p">)[</span><span class="s1">&#39;HOWNET&#39;</span><span class="p">][</span><span class="s1">&#39;pos&#39;</span><span class="p">]</span>
<span class="n">neg_words</span> <span class="o">=</span> <span class="n">ct</span><span class="o">.</span><span class="n">load_pkl_dict</span><span class="p">(</span><span class="s1">&#39;HOWNET.pkl&#39;</span><span class="p">)[</span><span class="s1">&#39;HOWNET&#39;</span><span class="p">][</span><span class="s1">&#39;neg&#39;</span><span class="p">]</span>


<span class="k">def</span> <span class="nf">senti_score</span><span class="p">(</span><span class="n">text</span><span class="p">):</span>
    <span class="n">pos</span><span class="p">,</span><span class="n">neg</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span><span class="mi">0</span>
    <span class="n">words</span> <span class="o">=</span> <span class="n">jieba</span><span class="o">.</span><span class="n">lcut</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>
    <span class="k">for</span> <span class="n">word</span> <span class="ow">in</span> <span class="n">words</span><span class="p">:</span>
        <span class="k">if</span> <span class="n">word</span> <span class="ow">in</span> <span class="n">pos_words</span><span class="p">:</span>
            <span class="n">pos</span> <span class="o">=</span> <span class="n">pos</span> <span class="o">+</span> <span class="mi">1</span>
        <span class="k">if</span> <span class="n">word</span> <span class="ow">in</span> <span class="n">neg_words</span><span class="p">:</span>
            <span class="n">neg</span> <span class="o">=</span> <span class="n">neg</span> <span class="o">+</span> <span class="mi">1</span>
    <span class="c1">#(pos-neg)/(pos+neg)即可，为防止分母为0，特加1</span>
    <span class="k">return</span> <span class="p">(</span><span class="n">pos</span><span class="o">-</span><span class="n">neg</span><span class="p">)</span><span class="o">/</span><span class="p">(</span><span class="n">pos</span><span class="o">+</span><span class="n">neg</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span>
    
    
<span class="nb">print</span><span class="p">(</span><span class="n">senti_score</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="s1">&#39;我很开心！&#39;</span><span class="p">))</span>
<span class="nb">print</span><span class="p">(</span><span class="n">senti_score</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="s1">&#39;我很难过！&#39;</span><span class="p">))</span>
</code></pre></div><pre><code>0.5
-0.5
</code></pre>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="nb">len</span><span class="p">(</span><span class="n">weibo1_df</span><span class="p">)</span>
</code></pre></div><pre><code>560840
</code></pre>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">import</span> <span class="nn">matplotlib</span>
<span class="kn">import</span> <span class="nn">matplotlib_inline</span>
<span class="n">matplotlib_inline</span><span class="o">.</span><span class="n">backend_inline</span><span class="o">.</span><span class="n">set_matplotlib_formats</span><span class="p">(</span><span class="s1">&#39;png&#39;</span><span class="p">,</span> <span class="s1">&#39;svg&#39;</span><span class="p">)</span>
<span class="kn">import</span> <span class="nn">scienceplots</span>
<span class="kn">import</span> <span class="nn">platform</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>

<span class="c1">#一共有560840条推特，这个部分代码运算量比较大，你所看到的情感变化图是按照1%随机抽样绘制的结果。</span>
<span class="c1">#按照1%随机抽样绘制的结果, </span>
<span class="c1">#np.random.seed(666)</span>
<span class="c1">#weibo_df = weibo_df.sample(frac=0.01)</span>

<span class="n">plt</span><span class="o">.</span><span class="n">style</span><span class="o">.</span><span class="n">use</span><span class="p">([</span><span class="s1">&#39;science&#39;</span><span class="p">,</span> <span class="s1">&#39;no-latex&#39;</span><span class="p">,</span> <span class="s1">&#39;cjk-sc-font&#39;</span><span class="p">])</span>
<span class="n">system</span> <span class="o">=</span> <span class="n">platform</span><span class="o">.</span><span class="n">system</span><span class="p">()</span>  <span class="c1"># 获取操作系统类型</span>

<span class="k">if</span> <span class="n">system</span> <span class="o">==</span> <span class="s1">&#39;Windows&#39;</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;SimHei&#39;</span><span class="p">}</span>
<span class="k">elif</span> <span class="n">system</span> <span class="o">==</span> <span class="s1">&#39;Darwin&#39;</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;Arial Unicode MS&#39;</span><span class="p">}</span>
<span class="k">else</span><span class="p">:</span>
    <span class="n">font</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;family&#39;</span><span class="p">:</span> <span class="s1">&#39;sans-serif&#39;</span><span class="p">}</span>
<span class="n">matplotlib</span><span class="o">.</span><span class="n">rc</span><span class="p">(</span><span class="s1">&#39;font&#39;</span><span class="p">,</span> <span class="o">**</span><span class="n">font</span><span class="p">)</span>  <span class="c1"># 设置全局字体</span>


<span class="c1"># 统计平均情感分值</span>
<span class="n">weibo_df</span><span class="p">[</span><span class="s1">&#39;senti&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">weibo_df</span><span class="p">[</span><span class="s1">&#39;text&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">senti_score</span><span class="p">)</span>
<span class="n">df_senti_avg_length</span> <span class="o">=</span> <span class="n">weibo_df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">&#39;created_at&#39;</span><span class="p">)[</span><span class="s1">&#39;senti&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span><span class="o">.</span><span class="n">reset_index</span><span class="p">()</span>

<span class="c1"># 绘制平均情感分值随时间变化</span>
<span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">df_senti_avg_length</span><span class="p">[</span><span class="s1">&#39;created_at&#39;</span><span class="p">],</span> <span class="n">df_senti_avg_length</span><span class="p">[</span><span class="s1">&#39;senti&#39;</span><span class="p">])</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s1">&#39;日期&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s1">&#39;微博内容平均情感分值&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">&#39;微博内容平均情感分值随时间变化&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xticks</span><span class="p">(</span><span class="n">rotation</span><span class="o">=</span><span class="mi">45</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/output_12_0.svg" alt="svg"  />

​</p>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>实验数据 | 194城市楼市政策梳理(2010-2022)</title>
      <link>https://textdata.cn/blog/2023-05-17-china-200-city-real-estate-policy/</link>
      <pubDate>Wed, 17 May 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-05-17-china-200-city-real-estate-policy/</guid>
      <description>&lt;br&gt;
&lt;h2 id=&#34;楼市数据集&#34;&gt;楼市数据集&lt;/h2&gt;
&lt;p&gt;有热心粉丝分享了她整理的楼市政策文本，含三个文件&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;- 194城市楼市政策梳理2010-2022.xlsx
- 2023年楼市政策(截止2.24).xlsx
- 2022年1-10月楼市政策梳理.xlsx
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/2011-2022.png&#34; alt=&#34;&#34;  /&gt;

&lt;img loading=&#34;lazy&#34; src=&#34;img/2023%e5%85%a8%e5%9b%bd%e6%94%bf%e7%ad%96.png&#34; alt=&#34;&#34;  /&gt;

&lt;img loading=&#34;lazy&#34; src=&#34;img/2023%e5%9c%b0%e6%96%b9%e6%94%bf%e7%ad%96.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;h2 id=&#34;用途&#34;&gt;用途&lt;/h2&gt;
&lt;p&gt;该实验数据集，可用于练习&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;词频统计&lt;/li&gt;
&lt;li&gt;词云图&lt;/li&gt;
&lt;li&gt;相似度计算等。&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;导入数据&#34;&gt;导入数据&lt;/h2&gt;
&lt;p&gt;因为每个xlsx文件中含有多个sheet， 可以根据sheet名读取不同的sheet表的数据。&lt;/p&gt;
&lt;p&gt;以 &lt;code&gt;194城市楼市政策梳理2010-2022.xlsx&lt;/code&gt; 为例， 导入表名为&lt;code&gt;宝鸡、保定、北海、北京、常州&lt;/code&gt;的数据。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/2011-2022.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_excel&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;194城市房产政策梳理2010-2022.xlsx&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;sheet_name&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;宝鸡、保定、北海、北京、常州&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;head&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div&gt;
&lt;style scoped&gt;
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
&lt;pre&gt;&lt;code&gt;.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;/style&gt;&lt;/p&gt;
&lt;table border=&#34;1&#34; class=&#34;dataframe&#34;&gt;
  &lt;thead&gt;
    &lt;tr style=&#34;text-align: right;&#34;&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;Unnamed: 0&lt;/th&gt;
      &lt;th&gt;城市名称&lt;/th&gt;
      &lt;th&gt;时间&lt;/th&gt;
      &lt;th&gt;标题&lt;/th&gt;
      &lt;th&gt;政策内容&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;0&lt;/th&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;宝鸡&lt;/td&gt;
      &lt;td&gt;2020.11.30&lt;/td&gt;
      &lt;td&gt;限贷政策&lt;/td&gt;
      &lt;td&gt;贷款年限30年，首套≤144㎡，贷款比例＞75%；首套＞144㎡，贷款比例70%；二套≤14...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1&lt;/th&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;宝鸡&lt;/td&gt;
      &lt;td&gt;2022.5.18&lt;/td&gt;
      &lt;td&gt;关于印发推进陕西自由贸易试验区贸易投资便利化改革创新若干措施的通知（土地政策）&lt;/td&gt;
      &lt;td&gt;优先保障自贸试验区合理用地需求，按照土地要素跟着项目走的原则，施行对产业链环节等多宗土地整体...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;2&lt;/th&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;保定&lt;/td&gt;
      &lt;td&gt;2015.1.20&lt;/td&gt;
      &lt;td&gt;人才政策&lt;/td&gt;
      &lt;td&gt;户籍制度改革实施意见的提及放开人才落户限制。规定具有初级及以上专业技术职称、高级工（国家职业...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;3&lt;/th&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;保定&lt;/td&gt;
      &lt;td&gt;2016.4.17&lt;/td&gt;
      &lt;td&gt;土地政策&lt;/td&gt;
      &lt;td&gt;供地计划对土地供应总量、用途结构、空间布局、土地供应导向等做了详细规定，其中土地供应导向中强...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;4&lt;/th&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;保定&lt;/td&gt;
      &lt;td&gt;2016.4.20&lt;/td&gt;
      &lt;td&gt;土地政策&lt;/td&gt;
      &lt;td&gt;严格掌控土地供应，中心城区内经营性用地全部纳入政府储备。&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;城市名称&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;unique&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;    array([&amp;#39;宝鸡&amp;#39;, &amp;#39;保定&amp;#39;, &amp;#39;北海&amp;#39;, &amp;#39;北京&amp;#39;, &amp;#39;常州&amp;#39;, &amp;#39;成都&amp;#39;], dtype=object)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;时间&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;to_datetime&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;时间&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;

&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;时间&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;min&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;())&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;时间&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;max&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;    2010-04-30 00:00:00
    2022-08-04 00:00:00
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;数据集获取&#34;&gt;数据集获取&lt;/h2&gt;
&lt;p&gt;链接: &lt;a href=&#34;https://pan.baidu.com/s/13neTAQzuY3wkJzmc1FjwFg&#34;&gt;https://pan.baidu.com/s/13neTAQzuY3wkJzmc1FjwFg&lt;/a&gt; 提取码: w2ra&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
</description>
      <content:encoded><![CDATA[<br>
<h2 id="楼市数据集">楼市数据集</h2>
<p>有热心粉丝分享了她整理的楼市政策文本，含三个文件</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">- 194城市楼市政策梳理2010-2022.xlsx
- 2023年楼市政策(截止2.24).xlsx
- 2022年1-10月楼市政策梳理.xlsx
</code></pre></div><p><img loading="lazy" src="img/2011-2022.png" alt=""  />

<img loading="lazy" src="img/2023%e5%85%a8%e5%9b%bd%e6%94%bf%e7%ad%96.png" alt=""  />

<img loading="lazy" src="img/2023%e5%9c%b0%e6%96%b9%e6%94%bf%e7%ad%96.png" alt=""  />
</p>
<br>
<br>
<h2 id="用途">用途</h2>
<p>该实验数据集，可用于练习</p>
<ul>
<li>词频统计</li>
<li>词云图</li>
<li>相似度计算等。</li>
</ul>
<p><br><br></p>
<h2 id="导入数据">导入数据</h2>
<p>因为每个xlsx文件中含有多个sheet， 可以根据sheet名读取不同的sheet表的数据。</p>
<p>以 <code>194城市楼市政策梳理2010-2022.xlsx</code> 为例， 导入表名为<code>宝鸡、保定、北海、北京、常州</code>的数据。</p>
<p><img loading="lazy" src="img/2011-2022.png" alt=""  />
</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_excel</span><span class="p">(</span><span class="s1">&#39;194城市房产政策梳理2010-2022.xlsx&#39;</span><span class="p">,</span> <span class="n">sheet_name</span><span class="o">=</span><span class="s1">&#39;宝鸡、保定、北海、北京、常州&#39;</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
<p></style></p>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Unnamed: 0</th>
      <th>城市名称</th>
      <th>时间</th>
      <th>标题</th>
      <th>政策内容</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>NaN</td>
      <td>宝鸡</td>
      <td>2020.11.30</td>
      <td>限贷政策</td>
      <td>贷款年限30年，首套≤144㎡，贷款比例＞75%；首套＞144㎡，贷款比例70%；二套≤14...</td>
    </tr>
    <tr>
      <th>1</th>
      <td>NaN</td>
      <td>宝鸡</td>
      <td>2022.5.18</td>
      <td>关于印发推进陕西自由贸易试验区贸易投资便利化改革创新若干措施的通知（土地政策）</td>
      <td>优先保障自贸试验区合理用地需求，按照土地要素跟着项目走的原则，施行对产业链环节等多宗土地整体...</td>
    </tr>
    <tr>
      <th>2</th>
      <td>NaN</td>
      <td>保定</td>
      <td>2015.1.20</td>
      <td>人才政策</td>
      <td>户籍制度改革实施意见的提及放开人才落户限制。规定具有初级及以上专业技术职称、高级工（国家职业...</td>
    </tr>
    <tr>
      <th>3</th>
      <td>NaN</td>
      <td>保定</td>
      <td>2016.4.17</td>
      <td>土地政策</td>
      <td>供地计划对土地供应总量、用途结构、空间布局、土地供应导向等做了详细规定，其中土地供应导向中强...</td>
    </tr>
    <tr>
      <th>4</th>
      <td>NaN</td>
      <td>保定</td>
      <td>2016.4.20</td>
      <td>土地政策</td>
      <td>严格掌控土地供应，中心城区内经营性用地全部纳入政府储备。</td>
    </tr>
  </tbody>
</table>
</div>
<p><br><br></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s1">&#39;城市名称&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">unique</span><span class="p">()</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">    array([&#39;宝鸡&#39;, &#39;保定&#39;, &#39;北海&#39;, &#39;北京&#39;, &#39;常州&#39;, &#39;成都&#39;], dtype=object)
</code></pre></div><br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s1">&#39;时间&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;时间&#39;</span><span class="p">])</span>

<span class="nb">print</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;时间&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">min</span><span class="p">())</span>
<span class="nb">print</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;时间&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">())</span>
</code></pre></div><p>Run</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">    2010-04-30 00:00:00
    2022-08-04 00:00:00
</code></pre></div><p><br><br></p>
<h2 id="数据集获取">数据集获取</h2>
<p>链接: <a href="https://pan.baidu.com/s/13neTAQzuY3wkJzmc1FjwFg">https://pan.baidu.com/s/13neTAQzuY3wkJzmc1FjwFg</a> 提取码: w2ra</p>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>数据集 | 上市公司高管违规数据(2008-2022)</title>
      <link>https://textdata.cn/blog/2023-05-17-top-manager-violation/</link>
      <pubDate>Wed, 17 May 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-05-17-top-manager-violation/</guid>
      <description>&lt;h2 id=&#34;一数据概况&#34;&gt;一、数据概况&lt;/h2&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;数据集名: 上市公司高管违规-原始数据.xlsx
记录条数: 25365
覆盖日期: 1997-01-16 ~ 2022-12-28
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/df1.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;为得到截图所示的&lt;code&gt;高管违规次数.xlsx&lt;/code&gt;，实现步骤:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;使用pd.read_excel()函数读取&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;高管违规数据集 &lt;code&gt;上市公司高管违规-原始数据.xlsx&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;股票代码列表 &lt;code&gt;行业代码.xlsx&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;然后我们使用pd.merge()函数将两个数据集按照股票代码和年度进行合并，使用全连接（how=&amp;lsquo;outer&amp;rsquo;）&lt;strong&gt;确保即使某些股票代码未出现在高管违规数据集中，也能保留在结果中&lt;/strong&gt;。&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;接下来，我们使用groupby()函数按股票代码和年度进行分组，然后使用count()函数统计每个组的违规次数。&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;检查结果， 无误后导出xlsx。字段包括股票代码、年度和违规次数。&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;二实现过程&#34;&gt;二、实现过程&lt;/h2&gt;
&lt;h3 id=&#34;21-导入数据&#34;&gt;2.1 导入数据&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;df1&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_excel&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;行业代码.xlsx&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;  &lt;span class=&#34;n&#34;&gt;converters&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;股票代码&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;})&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df1&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;head&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/df2.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df2&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_excel&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;上市公司高管违规-原始数据.xlsx&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;converters&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;股票代码&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;str&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;})&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;公告日期&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;to_datetime&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;公告日期&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;会计年度&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;公告日期&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;dt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;year&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df2&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;head&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;25365&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/df3.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;22-合并&#34;&gt;2.2 合并&lt;/h3&gt;
&lt;p&gt;然后我们使用pd.merge()函数将两个数据集按照股票代码和年度进行合并，使用全连接（how=&amp;lsquo;outer&amp;rsquo;）确保即使某些股票代码未出现在高管违规数据集中，也能保留在结果中。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;pd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;merge&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;how&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;outer&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;on&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;股票代码&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;会计年度&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;head&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/df4.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;23-分组groupby&#34;&gt;2.3 分组Groupby&lt;/h3&gt;
&lt;p&gt;接下来，&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;使用groupby()函数按 &lt;code&gt;股票代码&lt;/code&gt; 和 &lt;code&gt;会计年度&lt;/code&gt; 进行分组&lt;/li&gt;
&lt;li&gt;然后使用count()函数统计每组次数&lt;/li&gt;
&lt;li&gt;并将计算命名为 &lt;code&gt;违规次数&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;result_df&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;groupby&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;([&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;股票代码&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;会计年度&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;违规行为&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;count&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;reset_index&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;name&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;违规次数&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;result_df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/df5.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;h3 id=&#34;24-检查保存&#34;&gt;2.4 检查&amp;amp;保存&lt;/h3&gt;
&lt;p&gt;检查结果， 无误后导出xlsx。字段包括股票代码、年度和违规次数。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/check.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;股票代码&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;==&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;871753&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;公告日期&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;dt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;year&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;==&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;2022&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/df6.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;df2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;股票代码&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;==&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;873527&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;公告日期&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;dt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;year&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;==&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;2018&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/df7.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;这里仅随机检查了两个记录(现实中要多检查几次)， 与result_df中是一致的， 现在保存结果供后续实证分析&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;n&#34;&gt;result_df&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;to_excel&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;高管违规次数.xlsx&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;index&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;False&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;h2 id=&#34;三获取数据&#34;&gt;三、获取数据&lt;/h2&gt;
&lt;p&gt;链接: &lt;a href=&#34;https://pan.baidu.com/s/1Ff2G8jRZaTtJH7VcfQGX5Q?pwd=npyf&#34;&gt;https://pan.baidu.com/s/1Ff2G8jRZaTtJH7VcfQGX5Q?pwd=npyf&lt;/a&gt; 提取码: npyf&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;
</description>
      <content:encoded><![CDATA[<h2 id="一数据概况">一、数据概况</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">数据集名: 上市公司高管违规-原始数据.xlsx
记录条数: 25365
覆盖日期: 1997-01-16 ~ 2022-12-28
</code></pre></div><p><br><br></p>
<p><img loading="lazy" src="img/df1.png" alt=""  />
</p>
<p>为得到截图所示的<code>高管违规次数.xlsx</code>，实现步骤:</p>
<ol>
<li>
<p>使用pd.read_excel()函数读取</p>
<ul>
<li>高管违规数据集 <code>上市公司高管违规-原始数据.xlsx</code></li>
<li>股票代码列表 <code>行业代码.xlsx</code></li>
</ul>
</li>
<li>
<p>然后我们使用pd.merge()函数将两个数据集按照股票代码和年度进行合并，使用全连接（how=&lsquo;outer&rsquo;）<strong>确保即使某些股票代码未出现在高管违规数据集中，也能保留在结果中</strong>。</p>
</li>
<li>
<p>接下来，我们使用groupby()函数按股票代码和年度进行分组，然后使用count()函数统计每个组的违规次数。</p>
</li>
<li>
<p>检查结果， 无误后导出xlsx。字段包括股票代码、年度和违规次数。</p>
</li>
</ol>
<p><br><br></p>
<h2 id="二实现过程">二、实现过程</h2>
<h3 id="21-导入数据">2.1 导入数据</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="n">df1</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_excel</span><span class="p">(</span><span class="s1">&#39;行业代码.xlsx&#39;</span><span class="p">,</span>  <span class="n">converters</span><span class="o">=</span><span class="p">{</span><span class="s1">&#39;股票代码&#39;</span><span class="p">:</span> <span class="nb">str</span><span class="p">})</span>
<span class="n">df1</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/df2.png" alt=""  />
</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df2</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_excel</span><span class="p">(</span><span class="s1">&#39;上市公司高管违规-原始数据.xlsx&#39;</span><span class="p">,</span> <span class="n">converters</span><span class="o">=</span><span class="p">{</span><span class="s1">&#39;股票代码&#39;</span><span class="p">:</span> <span class="nb">str</span><span class="p">})</span>
<span class="n">df2</span><span class="p">[</span><span class="s1">&#39;公告日期&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df2</span><span class="p">[</span><span class="s1">&#39;公告日期&#39;</span><span class="p">])</span>
<span class="n">df2</span><span class="p">[</span><span class="s1">&#39;会计年度&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df2</span><span class="p">[</span><span class="s1">&#39;公告日期&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">dt</span><span class="o">.</span><span class="n">year</span>
<span class="nb">print</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">df2</span><span class="p">))</span>
<span class="n">df2</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><p>25365</p>
<p><img loading="lazy" src="img/df3.png" alt=""  />
</p>
<br>
<h3 id="22-合并">2.2 合并</h3>
<p>然后我们使用pd.merge()函数将两个数据集按照股票代码和年度进行合并，使用全连接（how=&lsquo;outer&rsquo;）确保即使某些股票代码未出现在高管违规数据集中，也能保留在结果中。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">merge</span><span class="p">(</span><span class="n">df1</span><span class="p">,</span> <span class="n">df2</span><span class="p">,</span> <span class="n">how</span><span class="o">=</span><span class="s1">&#39;outer&#39;</span><span class="p">,</span> <span class="n">on</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">,</span> <span class="s1">&#39;会计年度&#39;</span><span class="p">])</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div><p><img loading="lazy" src="img/df4.png" alt=""  />
</p>
<br>
<h3 id="23-分组groupby">2.3 分组Groupby</h3>
<p>接下来，</p>
<ol>
<li>使用groupby()函数按 <code>股票代码</code> 和 <code>会计年度</code> 进行分组</li>
<li>然后使用count()函数统计每组次数</li>
<li>并将计算命名为 <code>违规次数</code></li>
</ol>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">result_df</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s1">&#39;股票代码&#39;</span><span class="p">,</span> <span class="s1">&#39;会计年度&#39;</span><span class="p">])[</span><span class="s1">&#39;违规行为&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">count</span><span class="p">()</span><span class="o">.</span><span class="n">reset_index</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s1">&#39;违规次数&#39;</span><span class="p">)</span>
<span class="n">result_df</span>
</code></pre></div><p><img loading="lazy" src="img/df5.png" alt=""  />
</p>
<br>
<h3 id="24-检查保存">2.4 检查&amp;保存</h3>
<p>检查结果， 无误后导出xlsx。字段包括股票代码、年度和违规次数。</p>
<p><img loading="lazy" src="img/check.png" alt=""  />
</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df2</span><span class="p">[(</span><span class="n">df2</span><span class="p">[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">]</span><span class="o">==</span><span class="s1">&#39;871753&#39;</span><span class="p">)</span> <span class="o">&amp;</span> <span class="p">(</span><span class="n">df2</span><span class="p">[</span><span class="s1">&#39;公告日期&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">dt</span><span class="o">.</span><span class="n">year</span><span class="o">==</span><span class="mi">2022</span><span class="p">)]</span>
</code></pre></div><p><img loading="lazy" src="img/df6.png" alt=""  />
</p>
<br>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">df2</span><span class="p">[(</span><span class="n">df2</span><span class="p">[</span><span class="s1">&#39;股票代码&#39;</span><span class="p">]</span><span class="o">==</span><span class="s1">&#39;873527&#39;</span><span class="p">)</span> <span class="o">&amp;</span> <span class="p">(</span><span class="n">df2</span><span class="p">[</span><span class="s1">&#39;公告日期&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">dt</span><span class="o">.</span><span class="n">year</span><span class="o">==</span><span class="mi">2018</span><span class="p">)]</span>
</code></pre></div><p><img loading="lazy" src="img/df7.png" alt=""  />
</p>
<br>
<p>这里仅随机检查了两个记录(现实中要多检查几次)， 与result_df中是一致的， 现在保存结果供后续实证分析</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="n">result_df</span><span class="o">.</span><span class="n">to_excel</span><span class="p">(</span><span class="s1">&#39;高管违规次数.xlsx&#39;</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
</code></pre></div><p><br><br></p>
<h2 id="三获取数据">三、获取数据</h2>
<p>链接: <a href="https://pan.baidu.com/s/1Ff2G8jRZaTtJH7VcfQGX5Q?pwd=npyf">https://pan.baidu.com/s/1Ff2G8jRZaTtJH7VcfQGX5Q?pwd=npyf</a> 提取码: npyf</p>
<p><br><br></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>网络爬虫 | 批量采集话题「如何评价淄博烧烤？」的回答</title>
      <link>https://textdata.cn/blog/2023-05-12-welcome-to-zibo-barbecue/</link>
      <pubDate>Fri, 12 May 2023 00:00:00 +0000</pubDate>
      
      <guid>/blog/2023-05-12-welcome-to-zibo-barbecue/</guid>
      <description>&lt;h2 id=&#34;一采集数据&#34;&gt;一、采集数据&lt;/h2&gt;
&lt;p&gt;最近淄博烧烤一直很火，岁数大了，莫名其妙就被有人情味的短视频感动。 我从小在山东长大，一直25岁离开山东。 我记得09年升大学前后，我记得山东宣传的口号「&lt;strong&gt;好客山东，欢迎您&lt;/strong&gt;」，但是一直在生活在山东， 对好客的理解不够深刻。后来走过的地方多了，对比之下才知道「好客山东」 不仅仅是口号，更是山东人民好客的真实写照。 淄博烧烤的出圈，不是口味，也不是价格，是淄博乃至山东仁义价值观的成功。&lt;/p&gt;
&lt;p&gt;知乎话题「&lt;strong&gt;如何评价淄博的烧烤？&lt;/strong&gt;」数据采集于2023年5月12日。 之前分享过付费代码， &lt;a href=&#34;https://textdata.cn/blog/2023-04-23-data-collector-for-douban-group-parent-child-relationship/&#34;&gt;网络爬虫(付费)  |  知乎热门话题「全职儿女」&lt;/a&gt;  ，现在免费公开数据采集部分代码。公众号内的代码复制容易出问题， 建议textdata.cn中找本文对应的博文，准确复制代码。&lt;/p&gt;
&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;img/01-screen.png&#34; alt=&#34;&#34;  /&gt;
&lt;/p&gt;
&lt;p&gt;根据截图，获取爬虫运行的初始参数&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;#话题ID 
question_id = 510779192

#话题回答数 
reply_num = 1178
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;a href=&#34;https://www.zhihu.com/question/510779192&#34;&gt;https://www.zhihu.com/question/510779192&lt;/a&gt;&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;csv&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;requests&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;time&lt;/span&gt;


&lt;span class=&#34;c1&#34;&gt;#知乎话题ID&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;question_id&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;510779192&amp;#39;&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;#当前回答数&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;reply_num&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;1178&lt;/span&gt;


&lt;span class=&#34;c1&#34;&gt;#存储csv,文件名为话题ID&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;with&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;open&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;sa&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;question_id&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;.csv&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;w&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;encoding&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;utf-8&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;newline&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;csvf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
    &lt;span class=&#34;c1&#34;&gt;#字段&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;fieldnames&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;name&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;userID&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;gender&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;headline&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
                  &lt;span class=&#34;s1&#34;&gt;&amp;#39;is_advertiser&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;is_org&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;utype&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;can_comment&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
                 &lt;span class=&#34;s1&#34;&gt;&amp;#39;follower_count&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;content&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;updated_time&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;voteup_count&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;writer&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;csv&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;DictWriter&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;csvf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;fieldnames&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fieldnames&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;writer&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;writeheader&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
    
    &lt;span class=&#34;c1&#34;&gt;#网址规律&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;url&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;sa&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;https://www.zhihu.com/api/v4/questions/&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;question_id&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;/feeds&amp;#39;&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;next_url&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;&amp;#39;&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;data&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;
        &lt;span class=&#34;s1&#34;&gt;&amp;#39;include&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;data[*].is_normal,admin_closed_comment,reward_info,is_collapsed,annotation_action,annotation_detail,collapse_reason,is_sticky,collapsed_by,suggest_edit,comment_count,can_comment,content,editable_content,attachment,voteup_count,reshipment_settings,comment_permission,created_time,updated_time,review_info,relevant_info,question,excerpt,is_labeled,paid_info,paid_info_content,reaction_instruction,relationship.is_authorized,is_author,voting,is_thanked,is_nothelp,is_recognized;data[*].mark_infos[*].url;data[*].author.follower_count,vip_info,badge[*].topics;data[*].settings.table_of_content.enabled&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
        &lt;span class=&#34;s1&#34;&gt;&amp;#39;offset&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
        &lt;span class=&#34;s1&#34;&gt;&amp;#39;limit&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;5&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
        &lt;span class=&#34;s1&#34;&gt;&amp;#39;order&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;updated&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;headers&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;user-agent&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;

    &lt;span class=&#34;c1&#34;&gt;#循环抓取&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;max_page&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;int&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;reply_num&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;/&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;5&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;
    &lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;page&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;range&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;max_page&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
        &lt;span class=&#34;n&#34;&gt;time&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;sleep&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
        &lt;span class=&#34;c1&#34;&gt;#发起访问&lt;/span&gt;
        &lt;span class=&#34;k&#34;&gt;if&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;next_url&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
            &lt;span class=&#34;n&#34;&gt;resp&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;requests&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;get&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;next_url&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;headers&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;headers&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
        &lt;span class=&#34;k&#34;&gt;else&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
            &lt;span class=&#34;n&#34;&gt;resp&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;requests&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;get&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;url&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;params&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;data&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;headers&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;headers&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
        &lt;span class=&#34;n&#34;&gt;answers&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;resp&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34