Blogs

数据集 | 6.6亿条美国谷歌地图评论数据(~2021.9)

这个数据集包含截至 2021 年 9 月美国谷歌地图上的评论信息（评分、文本、图片等），企业元数据（地址、地理信息、描述、类别信息、价格、营业时间以及其它信息），以及相关企业的链接。...

数据集 | arXiv网站 269w 学术论文元数据 (2007 ~ 2025)

在这些独特的全球挑战时期，从数据中高效提取洞察至关重要。为了使 arXiv 更加易于访问，我们在此提供一个免费的开源 Kaggle 管道，用于机器可读的 arXiv 数据集：一个包含 170 万篇文章的仓库，具有相关特征，如文章标题、作者、类别、摘要、全文 PDF 等。In these times of unique global challenges, efficient extraction of insights from data is essential. To help make the arXiv more accessible, we present a free, open pipeline on Kaggle to the machine-readable arXiv dataset: a repository of 1.7 million articles, with relevant features such as article titles, authors, categories, abstracts, full text PDFs, and more....

数据集 | Glassdoor网站 990w 条英国公司(职位)评论数据(2008~2023.7)

Glassdoor 成立于2007年，总部位于美国加利福尼亚州的 Mill Valley。 Glassdoor允许员工匿名发布对公司、工作环境、薪资等方面的评价，同时也提供了职位搜索、公司评分、面试经验分享等功能，为求职者和在职员工提供参考。尽管Glassdoor起源于美国，但它已经扩展到包括英国在内的多个国家和地区，为全球用户提供服务。这意味着用户可以在Glassdoor上查找来自世界各地的公司信息和职位空缺，包括但不限于：公司评论和评分、薪资报告、面试问题和经验、职位招聘信息因此，虽然Glassdoor可以在英国使用，并且对英国的职场人士非常有用，但它并不是一个仅限于英国或由英国运营的网站。它是一个跨国平台，旨在为全球用户提供有关职场和招聘过程中的透明信息。...

数据集 | Layline美股内幕交易数据集

该数据集捕捉了公开交易公司的内幕交易活动。证券交易委员会自 2003 年中以来在其网站上以结构化格式提供了这些内幕交易报告。然而，大多数学术论文使用的是商业数据库而非直接使用监管文件，这使得复制变得困难，因为商业数据库中的数据操作和聚合步骤是不透明的，而且随着时间的推移，数据提供者可能会更改历史记录。为了克服这些限制，本数据集是从原始监管文件创建的；它每天更新，并包括内幕人士报告的所有信息，未经修改。This dataset captures insider trading activity at publicly traded companies. The Securities and Exchange Commission has made these insider trading reports available on its web site in a structured format since mid-2003. However, most academic papers use proprietary commercial databases instead of regulatory filings directly, which makes replication challenging because the data manipulation and aggregation steps in commercial databases are opaque and historical records could be altered by the data provider over time. To overcome these limitations, the presented dataset is created from the original regulatory filings; it is updated daily and includes all information reported by insiders without alteration....