Abstract
This paper describes an approach for learning to generate web-search queries for collecting documents matching a minority concept. As a case study we use the concept of text documents belonging to Slovenian, a minority natural language on the Web. Individual documents are automatically labeled as relevant or non-relevant using a language filter and the feedback is used to learn what query-lengths and inclusion/exclusion term-selection methods are helpful for finding previously unseen documents in the target language. Our system, CorpusBuilder, learns to select “good” query terms using a variety of term scoring methods. We present empirical results with learning methods that vary the time horizon used when learning from the results of past queries. Our approaches generalize well across several languages regardless of the initial conditions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Blum, A. (1996). On-line algorithms in machine learning. In Proceedings of the Workshop on On-Line Algorithms, Dagstuhl, 1996.
Boley, D., Gini, M., Gross, R., Han, E.-H. S., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., & Moor, J. (1999). Document categorization and query generation on the world wide web using webace. AI Review, 13, 365–391.
Cavnar, W. B., & Trenkle, J. M. (1994). N-gram-based text categorization. Proceedings of SDAIR 1994 (pp. 161–175). Las Vegas, NV.
Chen, Z., Meng, X., Zhu, B., & Fowler, R. H. (2000). Websail: From on-line learning to web search. Proc. of the International Conf. on Web Information Systems Engineering.
Ghani, R., & Jones, R. (2000). Learning a monolingual language model from a multilingual text database. Proceedings of CIKM 2000.
Ghani, R., Jones, R., & Mladenić, D. (2001). Building minority language corpora by learning to generate web search queries (Technical Report CMU-CALD-01-100).
Glover, E., Flake, G., Lawrence, S., Birmingham, W. P., Kruger, A., Giles, C. L., & Pennock, D. (2001). Improving category specific web search by learning query modifications. Symposium on Applications and the Internet. San Diego, CA.
Mladenic, D., & Grobelnik, M. (1999). Feature selection for unbalanced class distribution and naive bayes. Proceedings of ICML 1999.
Rennie, J., & McCallum, A. K. (1999). Using reinforcement learning to spider the web efficiently. Proceedings of ICML 1999.
van Noord, G. Textcat. http://www.odur.let.rug.nl/vannoord/TextCat/.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ghani, R., Jones, R., Mladenic, D. (2001). Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web. In: Zhong, N., Yao, Y., Liu, J., Ohsuga, S. (eds) Web Intelligence: Research and Development. WI 2001. Lecture Notes in Computer Science(), vol 2198. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45490-X_65
Download citation
DOI: https://doi.org/10.1007/3-540-45490-X_65
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42730-8
Online ISBN: 978-3-540-45490-8
eBook Packages: Springer Book Archive