一、ZhihuRec介绍#
ZhihuRec数据集由 清华大学信息检索组(THUIR)和 知乎公司 共同构建,仅供研究使用。ZhihuRec 数据集是从知识共享平台(知乎)收集的,该平台由 10 天内收集的约 一亿(100M) 次交互、798K 用户、165K 问题、554K 答案、240K 作者、70K 主题和超过 501K 用户查询日志组成。 还有用户、答案、问题、作者和主题的描述,这些都是匿名的。 据我们所知,这是用于个性化推荐的最大的真实世界交互数据集。由于ZhihuRec数据集包含约100M的用户回答印象日志,因此也称为ZhihuRec-100M。 还构建了从 ZhihuRec-100M 数据集随机采样的两个较小的数据集,分别称为 ZhihuRec-20M 和 ZhihuRec-1M,以满足各种应用需求。 它们包含大约 20M 和 1M 的用户回答印象日志,可以看作是一个中等大小的数据集和一个相对较小的数据集。
ZhihuRec项目及下载地址
二、数据集详情#
2.1 数据集内的文件#
Filename |
Size |
Description |
inter_impression.csv |
2.6GB |
user clicks and impressions |
inter_query.csv |
111MB |
user queries |
info_user.csv |
135MB |
the features of the users occured in the dataset |
info_answer.csv |
917MB |
the features of the answers occured in the dataset |
info_question.csv |
14MB |
the features of the questions occured in the dataset |
info_author.csv |
3.1MB |
the features of the authors occured in the dataset |
info_topic.csv |
413KB |
the IDs of the topics occured in the dataset |
info_token.csv |
409MB |
the features of the tokens occured in the dataset |
2.2 数据集统计信息#
Dataset |
ZhihuRec-100M |
ZhihuRec-20M |
ZhihuRec-1M |
#impressions |
99,978,523 |
19,999,857 |
999,970 |
#clicks |
26,981,583 |
5,402,345 |
268,656 |
#clicks : #non-clicks |
1 : 2.71 |
1 : 2.70 |
1 : 2.72 |
#queries |
3,899,553 |
776,201 |
38,422 |
#users |
798,086 |
159,642 |
7,974 |
avg #impressions per user |
125.27 |
125.28 |
125.40 |
avg #clicks per user |
33.81 |
33.84 |
33.69 |
#users with queries |
501,893 |
100,271 |
5,047 |
avg #queries per user |
7.77 |
7.74 |
7.61 |
#answers |
554,976 |
343,103 |
81,563 |
#questions |
165,012 |
104,130 |
29,340 |
#authors |
240,956 |
167,796 |
47,888 |
#topics |
72,318 |
54,785 |
22,897 |
#tokens |
556,546 |
428,334 |
249,586 |
2.3 数据集字段#
Some fields in the data set are null, which are represented by empty strings in the file.
inter_impression.csv
#
Index |
Nullable |
Description |
0 |
|
user ID |
1 |
|
answer ID |
2 |
|
impression timestamp |
3 |
|
click timestamp (0 for non-click) |
inter_query.csv
#
Index |
Nullable |
Description |
0 |
|
user ID |
1 |
|
token IDs in the query (separated by spaces) |
2 |
|
query timestamp |
info_user.csv
#
Index |
Nullable |
Description |
0 |
|
user ID |
1 |
|
register timestamp |
2 |
|
gender |
3 |
|
login frequency |
4 |
|
#followers |
5 |
|
#topics followed by this user |
6 |
|
#questions followed by this user |
7 |
|
#answers |
8 |
|
#questions |
9 |
|
#comments |
10 |
|
#thanks received by this user |
11 |
|
#comments received by this user |
12 |
|
#likes received by this user |
13 |
|
#dislikes received by this user |
14 |
|
register type |
15 |
|
register platform |
16 |
|
from android or not |
17 |
|
from iphone or not |
18 |
|
from ipad or not |
19 |
|
from pc or not |
20 |
|
from mobile web or not |
21 |
|
device model |
22 |
|
device brand |
23 |
|
platform |
24 |
|
province |
25 |
|
city |
26 |
$\sqrt{}$ |
topic IDs followed by this user (separated by spaces) |
info_answer.csv
#
Index |
Nullable |
Description |
0 |
|
answer ID |
1 |
$\sqrt{}$ |
question ID |
2 |
|
anonymous or not |
3 |
$\sqrt{}$ |
author ID (null for anonymous) |
4 |
|
labeled high-value answer or not |
5 |
|
recommended by the editor or not |
6 |
|
create timestamp |
7 |
|
contain pictures or not |
8 |
|
contain videos or not |
9 |
|
#thanks |
10 |
|
#likes |
11 |
|
#comments |
12 |
|
#collections |
13 |
|
#dislikes |
14 |
|
#reports |
15 |
|
#helpless |
16 |
$\sqrt{}$ |
token IDs in the answer (separated by spaces) |
17 |
$\sqrt{}$ |
topic IDs of the answer (separated by spaces) |
info_question.csv
#
Index |
Nullable |
Description |
0 |
|
question ID |
1 |
|
create timestamp |
2 |
|
#answers |
3 |
|
#followers |
4 |
|
#invitations |
5 |
|
#comments |
6 |
$\sqrt{}$ |
token IDs in the question (separated by spaces) |
7 |
$\sqrt{}$ |
topic IDs of the queation (separated by spaces) |
info_author.csv
#
Index |
Nullable |
Description |
0 |
|
author ID |
1 |
|
is excellent author or not |
2 |
|
#followers |
3 |
|
is excellent answerer or not |
info_topic.csv
#
Index |
Nullable |
Description |
0 |
|
topic ID |
info_token.csv
#
Index |
Nullable |
Description |
0 |
|
token ID |
1 |
|
word vector trained by word2vec (64 dimensions, separated by spaces) |
ZhihuRec can’t provide the corresponding text of tokens for privacy reasons. Researchers can use word vectors in the dataset or train word vectors from scratch.
引用说明#
ZhihuRec dataset can be downloaded from here, and it is for the paper:
Bin Hao, Min Zhang, Weizhi Ma, Shaoyun Shi, Xinxing Yu, Houzhi Shan, Yiqun Liu and Shaoping Ma, 2021, A Large-Scale Rich Context Query and Recommendation Dataset in Online Knowledge-Sharing. arXiv preprint arXiv:2106.06467.
please cite the paper if you use this dataset:
@misc{hao2021largescale,
title={A Large-Scale Rich Context Query and Recommendation Dataset in Online Knowledge-Sharing},
author={Bin Hao and Min Zhang and Weizhi Ma and Shaoyun Shi and Xinxing Yu and Houzhi Shan and Yiqun Liu and Shaoping Ma},
year={2021},
eprint={2106.06467},
archivePrefix={arXiv},
primaryClass={cs.IR}
}
广而告之#