WebCrawlingChapter Chapter 8
WebCrawlingChapter Chapter 8
WebCrawlingChapter Chapter 8
8: Web Crawling
By Filippo Menczer
Indiana University School of Informatics
Crawler:
basic
idea
hits
etc... etc...
• Fetching
– Get only the first 10-100 KB per page
– Take care to detect and break
redirection loops
– Soft fail for timeout, server not
responding, file not found, and other
errors
• http://www.imdb.com/Name?Menczer,+Erico
• http://www.imdb.com/name/nm0578801/
– Why or why not? How can we tell if a page is dynamic? What
about ‘spider traps’?
– What do Google and other search engines do?
http://informatics.indiana.edu
http://informatics.indiana.edu/
http://informatics.indiana.edu/index.html#fragment
http://informatics.indiana.edu/index.html
http://informatics.indiana.edu/dir1/./../dir2/
http://informatics.indiana.edu/dir2/
http://informatics.indiana.edu/%7Efil/
http://informatics.indiana.edu/~fil/
http://INFORMATICS.INDIANA.EDU/fil/
http://informatics.indiana.edu/fil/
High-level
architecture of a
scalable universal
crawler
Optimize use of
network bandwidth
Crawl size
Context graph
Slides © 2007 Filippo Menczer, Indiana University School of Informatics
Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Topical crawlers
• All we have is a topic (query, description,
keywords) and a set of seed pages (not
necessarily relevant)
• No labeled examples
• Must predict relevance of unvisited links to
prioritize
• Original idea: Menczer 1997, Menczer &
Belew 1998
Link-cluster
conjecture
€ Preservation of
•
semantics (meaning)
across links
• 1000 times more
likely to be on topic
if near an on-topic
page!
∑ path(q, p)
{ p: path(q, p ) ≤δ }
L(q,δ) ≡
Slides © 2007 Filippo Menczer, Indiana University School of Informatics
{ p : path(q, p) ≤ δ}
Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
The “link-content” ∑ sim(q, p)
conjecture S(q,δ) ≡
{ p: path(q, p ) ≤δ }
• Correlation of { p : path(q, p) ≤ δ}
lexical (content)
and linkage
topology
• L(): average link
distance €
• S(): average
content
similarity to
start (topic)
page from pages
up to distance
• Correlation
(L,S) = –0.76
aL b edu net
S = c + (1− c)e
gov
€ org
.com has
more drift com
signif. diff. a only (=0.05)
Experiment based on
Split ODP URLs 159 ODP topics (Pant Add 10 best hubs
between seeds and & Menczer 2003) to seeds for 94
targets topics
crawling agent
λ l = nnet(in1, ..., inN )
δ ( k, w)
ink = ∑
w∈D distw
( , l)
k1
ki λl
kn
sum of matches
linkll Instances
link instances of ki with
Agent’s neural
agent's neural net net
inverse-distance
of ki weighting
keyword
vector neural net
offspring
Slides © 2007 Filippo Menczer, Indiana University School of Informatics
Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Multithreaded InfoSpiders
(MySpiders)
• Different ways to compute the cost of
visiting a document:
– Constant: costconst = E0 p0 / Tmax
– Proportional to download time:
costtime = f(costconst t / timeout)
• The latter is of course more efficient
(faster crawling), but it also yields
better quality pages!
TH 0.31 0.99
SYSTEM 0.22
GOVERN 0.19
• Automate
evaluation
using edited
directories
• Different
sources of
relevance
assessments
d=2
d=3
t
S ∩ Td
c
Sct ∩ Td
target t
pages Td S c
target
∑σ c ( p,Dd ) ∑σ c ( p,Dd ) target
descriptions p ∈S ct depth
p ∈S ct
€ € S t
d=2
c d=1
d=0
“recall” “precision”
€
Slides © 2007 Filippo Menczer, Indiana University School of Informatics
Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Crawling evaluation framework
• Keywords
• Seed URLs Main
Private Data
Crawler 1 Crawler N Structures
Logic Logic (limited resource)
URL Common
Data
Concurrent Fetch/Parse/Stem Modules
Web Structures
HTTP HTTP HTTP
Average
target
page
recall
Pages crawled
Slides © 2007 Filippo Menczer, Indiana University School of Informatics
Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Efficiency & scalability
Performance/cost
BreadthFirst + + + +
BFS-1 + + +
BFS-256 + + +
InfoSpiders + + + +
User-agent: *
Disallow:
All crawlers…
…can go
anywhere!
User-agent: Googlebot
Disallow: /chl/* Google crawler is
Disallow: /uk/* allowed everywhere
Disallow: /italy/*
Disallow: /france/*
except these paths
User-agent: slurp
Disallow: Yahoo and
Crawl-delay: 2 MSN/Windows Live
User-agent: MSNBot
are allowed
Disallow: everywhere but
Crawl-delay: 2 should slow down
User-agent: scooter
Disallow:
AltaVista has no limits
# all others
User-agent: * Everyone else keep off!
Disallow: /
local query
storage
query
A query
B query
hit
hit
C
Data mining & referral
opportunities
Emerging communities
Slides © 2007 Filippo Menczer, Indiana University School of Informatics
Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Basic idea: Learn based on prior
query/response interactions
More
sophisticated
learning
algorithms
The more
do better
interactions,
the better
1-click configuration of
personal crawler and
setup of search engine
Search via
Firefox browser
extension