Blogs

EDGAR | 25年数据的预训练词向量模型

EDGAR 是美国证券交易委员会（SEC）的电子数据收集、分析和检索系统。EDGAR系统允许公众通过互联网访问公司提交给SEC的各种文件，例如注册声明、年度报告和其他披露文件。这些文件包括公司的财务信息、业务信息和其他关键信息，对于投资者和研究人员来说非常有用。金融等方向的同学，如果想用 **词嵌入** 技术开展研究，可以考虑使用这个开源的数据集。EDGAR is an electronic data collection, analysis, and retrieval system of the US Securities and Exchange Commission (SEC). The EDGAR system allows the public to access various documents submitted to the SEC by companies through the internet, such as registration statements, annual reports, and other disclosure documents. These documents include financial information, business information, and other key information of the companies, which is very useful for investors and researchers. Students in finance and related fields who want to conduct research using word embedding techniques may consider using this open-source dataset....

数据集 | 马前卒工作室睡前消息文稿汇总

一直有观看马前卒工作室睡前消息的习惯，感觉他的内容很理性，透露着马列科学社会风。引爆全网的两个话题独山县债务问题、以岭药业连花清瘟胶囊事件。 **数据可以拿来练习词频统计、词云图制作、情感分析、lda话题建模。已整理为csv文件，留给需要的人**。...

可视化 | 词嵌入模型用于计算社科领域刻板印象等信息（含代码）

语言的文字反映了人类思想的结构，使我们能够在个人之间传递思想，而使用大规模语料训练得来的词嵌入模型蕴含着这类信息。英文的词嵌入在社会科学中的应用教程较多，大家可以谷歌查询，我主要想丰富中文数据的教程。The words of language reflect the structure of human thought, allowing us to transfer thoughts between individuals, and word embedding models trained using large-scale corpora contain this information. There are many application tutorials of English word embedding in social science. You can search it on Google. I mainly want to enrich the tutorials of Chinese data....

TechWeekly-19 每周有趣有用的技术分享

simpleT5 库 | 根据英文摘要内容生成标题

T5（Text-to-Text Transfer Transformer）是一种基于 Transformer 架构的自然语言处理模型，由 Google Brain 团队开发。T5 模型采用了 encoder-decoder 架构，其中 encoder 将输入文本编码为向量，decoder 则从该向量生成目标文本。T5 模型的特点是将所有自然语言处理任务都视为“从输入文本到输出文本”的转换问题，它可以通过在任务之间共享模型参数和预训练模型来轻松地应用于各种 NLP 任务，如**文本分类、命名实体识别、文本摘要、问答系统**等。与其他 NLP 模型不同的是，T5 模型使用了一种称为“text-to-text”方法的统一输入输出架构，使得所有 NLP 任务都能转化为文本转换问题，从而使得模型训练更加高效。...