The Basic of Computer Science: Dr. Manish Kumar Kamboj Assistant Professor, CSE
The Basic of Computer Science: Dr. Manish Kumar Kamboj Assistant Professor, CSE
The Basic of Computer Science: Dr. Manish Kumar Kamboj Assistant Professor, CSE
Computer Science
Who is father of
Harnaaz Sandhu?
Onomatopoeia
• “The total number of words spoken by entire human race so far!” written in .txt ~
size of Web.
–It’s always on
–It is “free”
–It’s (almost) never noticeably congested (though individual
sites or access points might be)
–you can get messages to anywhere in the world
instantaneously
–you can communicate for free, including voice and video
conferencing
–you can stream music and movies
–it is uncensored (in most places) (of course, this can be
viewed as good or bad)
Slide 5
Search Engine
• Search Engine is a software program that helps in locating information stored on a computer
system typically on www.
✓1.Crawler based
❖Create their listing automatically like Google, Yahoo.
❖Crawl or spider web to create directory of info.
❖Changes made to page are updated automatically.
✓2.Human Powered
❖Depends on user for creation like keyword submission like dmoz.org
❖User submits description of webpage along with keywords.
❖When searching only description submitted are looked for.
• Hybrid search engines combine these two features e.g. looksmart, submitexpress.
Slide 6
Components of Crawler based Search Engine
• 1. Crawler or Spider
✓ Crawl from one web pages using hyperlinks based on some criteria.
✓ Visit sites regularly to look for changes.
• 2.Index or Catalog
✓ Huge book containing a copy of every webpage that crawler finds.
✓Pages only after indexing become searchable.
search queries
index
WWW
repository
user crawler Slide 7
Challenges faced in Web Crawling
Slide 8
Web Crawler Policies
• Politeness Policy
✓Do not hamper sites.
✓Only crawl allowed pages.
✓Respect robots.txt (more on this in next slide)
• Robustness Policy
✓ Be immune to spider traps and other malicious behaviour from web servers.
• Parallelization Policy
✓ Different Thread should not visit same site, if crawler using multithreading.
• Revisit Policy
✓When to check for changes.
✓If we cover too much, it will get stale
.
Slide 9
Robots.txt
User-agent: *
Disallow:
All crawlers…
…can go anywhere!
Slide 10
www.microsoft.com/robots.txt
Slide 12
Most Valuable Asset in todays world
Data
Slide 13
What is knowledge?
RIGHT!
• Artificial Intelligence:
– speech recognition
– Some reasoning; computer beats man in
chess
– Privacy and security problems
– Computers can be a pain in the butt
WRONG!
Slide 15
Predicting the future
Slide 16
Information Science and Data Generation
Trends
Slide 17
How Much Data Is on the Internet?
Everything
Zetta
!
•Soon everything will be Recorded
recorded and indexed All Books Exa
•Most bytes will never be MultiMedia
seen by humans. Peta
•Data summarization, All books
trend detection (words) Tera
anomaly detection
are key technologies .Movi
e Giga
A Photo
Mega
A Book
24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli
Kilo
Slide 19
Slide 20
Slide 21
Slide 22
Moore's Law
Slide 23
First Disk 1956
•4 MB
•50x24” disks
•1200 rpm
•100 ms access
•35k$/y rent
1.6 meters
30 MB
Slide 25