Microsoft Learning to Rank Datasets

Established: June 10, 2010

We released two large scale datasets for research on learning to rank: MSLR-WEB30k with more than 30,000 queries and a random sampling of it MSLR-WEB10K with 10,000 queries.

 

Dataset Descriptions

The datasets are machine learning data, in which queries and urls are represented by IDs. The datasets consist of feature vectors extracted from query-url pairs along with relevance judgment labels:

(1) The relevance judgments are obtained from a retired labeling set of a commercial web search engine (Microsoft Bing), which take 5 values from 0 (irrelevant) to 4 (perfectly relevant).

(2) The features are basically extracted by us, and are those widely used in the research community.

In the data files, each row corresponds to a query-url pair. The first column is relevance label of the pair, the second column is query id, and the following columns are features. The larger value the relevance label has, the more relevant the query-url pair is. A query-url pair is represented by a 136-dimensional feature vector.

Below are two rows from MSLR-WEB10K dataset:

==============================================

0 qid:1 1:3 2:0 3:2 4:2 … 135:0 136:0

2 qid:1 1:3 2:3 3:0 4:0 … 135:0 136:0

==============================================

Dataset Partition

We have partitioned each dataset into five parts with about the same number of queries, denoted as S1, S2, S3, S4, and S5, for five-fold cross validation. In each fold, we propose using three parts for training, one part for validation, and the remaining part for test (see the following table). The training set is used to learn ranking models. The validation set is used to tune the hyper parameters of the learning algorithms, such as the number of iterations in RankBoost and the combination coefficient in the objective function of Ranking SVM. The test set is used to evaluate the performance of the learned ranking models.

 Folds  Training Set Validation Set Test Set
 Fold1  {S1,S2,S3}  S4  S5
 Fold2  {S2,S3,S4}  S5  S1
 Fold3  {S3,S4,S5}  S1  S2
 Fold4  {S4,S5,S1}  S2  S3
 Fold5  {S5,S1,S2}  S3  S4

 

 

Datasets

The datasets were released on June 16, 2010.

To use the datasets, you must read and accept the online agreement. By using the datasets, you agree to be bound by the terms of its license.

Datasets      Size        MD5
MSLR-WEB10K      ~ 1.2G        97c5d4e7c171e475c91d7031e4fd8e79
MSLR-WEB30K      ~ 3.7G        4beae4bee0cd244fc9b2aff355a61555

 

Evaluation tools

The evaluation script was updated on Jan. 13, 2011. Thank you to Yasser Ganjisaffar for pointing out the bug.

 

Feature List

Each query-url pair is represented by a 136-dimensional vector.

Feature List of Microsoft Learning to Rank Datasets
feature id    feature description      stream comments
1    covered query term number      body
2      anchor
3      title
4      url
5      whole document
6    covered query term ratio      body
7      anchor
8      title
9      url
10      whole document
11    stream length      body
12      anchor
13      title
14      url
15      whole document
16    IDF(Inverse document frequency)      body
17      anchor
18      title
19      url
20      whole document
21    sum of term frequency      body
22      anchor
23      title
24      url
25      whole document
26    min of term frequency      body
27      anchor
28      title
29      url
30      whole document
31    max of term frequency      body
32      anchor
33      title
34      url
35      whole document
36    mean of term frequency      body
37      anchor
38      title
39      url
40      whole document
41    variance of term frequency      body
42      anchor
43      title
44      url
45      whole document
46    sum of stream length normalized term frequency      body
47      anchor
48      title
49      url
50      whole document
51    min of stream length normalized term frequency      body
52      anchor
53      title
54      url
55      whole document
56    max of stream length normalized term frequency      body
57      anchor
58      title
59      url
60      whole document
61    mean of stream length normalized term frequency      body
62      anchor
63      title
64      url
65      whole document
66    variance of stream length normalized term frequency      body
67      anchor
68      title
69      url
70      whole document
71    sum of tf*idf      body
72      anchor
73      title
74      url
75      whole document
76    min of tf*idf      body
77      anchor
78      title
79      url
80      whole document
81    max of tf*idf      body
82      anchor
83      title
84      url
85      whole document
86    mean of tf*idf      body
87      anchor
88      title
89      url
90      whole document
91    variance of tf*idf      body
92      anchor
93      title
94      url
95      whole document
96    boolean model      body
97      anchor
98      title
99      url
100      whole document
101    vector space model      body
102      anchor
103      title
104      url
105      whole document
106    BM25      body
107      anchor
108      title
109      url
110      whole document
111    LMIR.ABS      body Language model approach for information retrieval (IR) with absolute discounting smoothing
112      anchor
113      title
114      url
115      whole document
116    LMIR.DIR      body Language model approach for IR with Bayesian smoothing using Dirichlet priors
117      anchor
118      title
119      url
120      whole document
121    LMIR.JM      body Language model approach for IR with Jelinek-Mercer smoothing
122      anchor
123      title
124      url
125      whole document
126    Number of slash in URL
127    Length of URL
128    Inlink number
129    Outlink number
130    PageRank
131    SiteRank Site level PageRank
132    QualityScore The quality score of a web page. The score is outputted by a web page quality classifier.
133    QualityScore2 The quality score of a web page. The score is outputted by a web page quality classifier, which measures the badness of a web page.
134    Query-url click count The click count of a query-url pair at a search engine in a period
135    url click count The click count of a url aggregated from user browsing data in a period
136    url dwell time The average dwell time of a url aggregated from user browsing data in a period

Reference

You can cite this dataset as below.

@article{DBLP:journals/corr/QinL13,
  author    = {Tao Qin and
               Tie{-}Yan Liu},
  title     = {Introducing {LETOR} 4.0 Datasets},
  journal   = {CoRR},
  volume    = {abs/1306.2597},
  year      = {2013},
  url       = {http://arxiv.org/abs/1306.2597},
  timestamp = {Mon, 01 Jul 2013 20:31:25 +0200},
  biburl    = {http://dblp.uni-trier.de/rec/bib/journals/corr/QinL13},
  bibsource = {dblp computer science bibliography, http://dblp.org}
}

Release Notes

  • The following people have contributed to the construction of the data: Tao Qin, Tie-Yan Liu, Wenkui Ding, Jun Xu, Hang Li.
  • We would like to thank Bing team for the support in dataset creation. We would also like to thank Nick Craswell for the help in dataset release.
  • If you have any questions or suggestions, please kindly let us know.
  • Related links: LETOR3.0 and LETOR4.0 datasets.

People

Portrait of Tao Qin

Tao Qin

Partner Research Manager

Portrait of Tie-Yan Liu

Tie-Yan Liu

Distinguished Scientist, Microsoft Research AI for Science