Eclipse Community Forums
Forum Search:

Search      Help    Register    Login    Home
Home » Eclipse Projects » SeMantic Information Logistics Architecture (SMILA) » WebCrawler and URL-Parser
WebCrawler and URL-Parser [message #565149] Tue, 24 August 2010 08:31
Andrej Rosenheinrich is currently offline Andrej Rosenheinrich
Messages: 22
Registered: August 2010
Junior Member
Hi,

once again I have some questions about the webcrawler.

First, how are seeds parsed containing a "#"? Seems to me like everything after a "#" is ignored. What would be a problem, because some sites use this character in the get options (you can consider this bad style, at least i do, but it works and its used out there). So ignoring the information would lead to a completely different site. Can the parser be configured or modified to parse such URL?

Second, could you give a little bit more information about the crawlingmodels, what a crawler would behave like with different models? For instance MaxDepth, when I provide several seeds, will the first seed be crawled until the depthlimit is reached and then the second seed is looked at, or will all seeds be crawled before going deeper?

Third, where can i find more information about the filter format? Without description its a bit tricky ;)

Thanks in advance!
Andrej
Previous Topic:Deploying SMILA with Eclipse 3.6
Next Topic:BPEL Designer extensionActivity bug
Goto Forum:
  


Current Time: Mon Sep 01 22:02:27 EDT 2014

Powered by FUDForum. Page generated in 0.01555 seconds