1 Web Search Environments Web Crawling Metadata using RDF and Dublin Core Dave Beckett Slides:

1 Web Search Environments Web Crawling Metadata using RDF and Dublin Core Dave Beckett http://purl.org/net/dajobe/ Slides: http://ilrt.org/people/cmdjb/talks/tnc2002/

2 Introduction Overview of SGs and Web Crawling Why WSE, whats new? Novel results Future work (or stuff we didnt do) and conclusions

3 Overview Digital Library community In UK, subject-specific gateways (SGs) Want to improve: scope (more), timeliness (fresh), cost (less) Stay professional – the Quality word Compete with web search engines – the Google Test

4 Human Cataloguing of the Web Pros: High quality, domain knowledge selection, subject- specialised, cataloguing done to well-known and developed standards Cons: Expensive, slow, descriptions need to be reviewed regularly to keep them relevant

5 Software running web crawls Pros: vastly comprehensive (Con: too much), can be very up-to-date Cons: cannot distinguish this page sucks from this page rocks, indiscriminate, subject to spamming, very general (but…)

6 Combining Web Crawling and High Quality Description A solution Seed the web crawl from high quality records Crawl to other (presumably) good quality pages Track the provenance of the crawled pages Provenance can be used for querying and result ranking

7 Web Search Environments (WSE) Project Research by ILRT and later Resource Discovery Network (RDN)ILRT Resource Discovery Network RDN funds UK SGs (ILRT also had DutchESS) DutchESS

8 WSE Technologies Simple Dublin Core (DC) records extracted from SGsDublin Core OAI protocol used to collect these records in one place (not required) Combine Web Crawler RDF framework to connect the resource descriptions togetherRDF

9 Simple DC Records Really simple: Title Description Identifier (URI of resource) Source (URI of record)

10 Information model 1 DC records describe all the resources Web crawler reads these and returns crawled web pages These generate a new web crawled resource

11 Information model 2 Link back to original record(s), plus web page properties RDF model lets these be connected via page, record URIs Giving one large RDF graph of the total information

12 WSE graph

13 Novel Outcomes? It is obvious that: Metadata gathering is not new (Harvest) Web crawling is not new (Lycos) Cataloguing is not new (1000s of years) So what is new?

14 WSE – Areas Not Focused I digress… Gathering data together – not crucial, Combine is a distributed harvester Full text indexing – not optimised Web crawling algorithm – the routes through the web were not selected in a sophisticated way

15 WSE – General Benefits Connecting separate systems (one less place needed to go) RDF graph allows more data mixing (not fragile) Leverages existing systems (Combine, Zebra), standards (RDF, DC)

16 WSE – Novel Searching game theory napster – zero hits Cross-subject searching in one system – gmo Can navigate resulting provenance

17 WSE – Gains Web crawling gains from high quality human description SGs gain from increase in relevant pages Fresher content than human-catalogued resource More focused than a general search engine

18 WSE as a new tool For subject experts Which includes cataloguers Gives fast, relevant search (no formal precision, recall analysis)

19 WSE – new areas Cross-subject searching possible in subjects not yet catalogued, or that fall between SGs Searching emerging topics is possible ahead of additions to catalogue standards Helps indicate where new SGs, thesauri are needed

20 WSE - deploying ILRT WSE RDN WSE RDN – investigating for the main search system

21 WSE for SGs Individual SGs – enhancing subject- specific searches: Deep / full web crawling of high quality sites Granularity of cataloguing and cost It is better for humans to describe entire sites (or large parts) and let the software do the detailed work of individual pages

22 Future Improve and target the crawling Use the SG information with result ranking Add other relevant data to the graph such as RSS news A Semantic Web applicationSemantic Web

23 Questions? Thank You Slides: http://ilrt.org/people/cmdjb/talks/tnc/2002/ http://ilrt.org/people/cmdjb/talks/tnc/2002/ Project: http://wse.search.ac.uk/ http://wse.search.ac.uk/

24 References Combine Web Crawler: http://www.lub.lu.se/combine/ http://www.lub.lu.se/combine/ Dublin Core: http://dublincore.org/ http://dublincore.org/ ILRT: http://ilrt.org/ http://ilrt.org/ RDF: http://www.w3.org/RDF/ http://www.w3.org/RDF/ Semantic Web: http://www.w3.org/2001/sw/ http://www.w3.org/2001/sw/

1 Web Search Environments Web Crawling Metadata using RDF and Dublin Core Dave Beckett Slides:

Similar presentations

Presentation on theme: "1 Web Search Environments Web Crawling Metadata using RDF and Dublin Core Dave Beckett Slides:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Web Search Environments Web Crawling Metadata using RDF and Dublin Core Dave Beckett Slides:

Similar presentations

Presentation on theme: "1 Web Search Environments Web Crawling Metadata using RDF and Dublin Core Dave Beckett Slides:"— Presentation transcript:

Similar presentations

About project

Feedback