Eclipse Community Forums: SeMantic Information Logistics Architecture (SMILA)

Help

Home

Home » Archived » SeMantic Information Logistics Architecture (SMILA) » WebCrawler and URL-Parser

Show: Today's Messages :: Show Polls :: Message Navigator

WebCrawler and URL-Parser [message #554790]

Tue, 24 August 2010 12:31

Andrej Rosenheinrich

Messages: 22
Registered: August 2010

Junior Member

Hi,

once again I have some questions about the webcrawler.

First, how are seeds parsed containing a "#"? Seems to me like everything after a "#" is ignored. What would be a problem, because some sites use this character in the get options (you can consider this bad style, at least i do, but it works and its used out there). So ignoring the information would lead to a completely different site. Can the parser be configured or modified to parse such URL?

Second, could you give a little bit more information about the crawlingmodels, what a crawler would behave like with different models? For instance MaxDepth, when I provide several seeds, will the first seed be crawled until the depthlimit is reached and then the second seed is looked at, or will all seeds be crawled before going deeper?

Third, where can i find more information about the filter format? Without description its a bit tricky Wink

Thanks in advance!
Andrej

Report message to a moderator

Re: WebCrawler and URL-Parser [message #564521 is a reply to message #554790]

Tue, 21 September 2010 16:31

Sebastian Voigt

Messages: 11
Registered: July 2009

Junior Member

Hi Andrej,

the token # is considered as separator between the URL to a page and the anchor to a part of that page. Thus if you use # in a seed url, the webcrawler will use only the part before the token as url.

Regarding the CrawlingModels and the Filter please have a look at the Page http://wiki.eclipse.org/SMILA/Documentation/Web_Crawler and feel free to ask questions regarding the documentation and special things of the configuration.
Maybe you can have also a look at the XSD for the DataSourceConnectionConfig of the WebCrawler:

configuration\org.eclipse.smila.connectivity.framework.crawl er.web\schemas\WebDataSourceConnectionConfigSchema.xsd.

Maybe you can also activate the logging for the webcrawler, this should help to understand and test the different configuration possibilities.
org.eclipse.smila.connectivity.framework.crawler.web:
please add the following line to the log4j.properties
log4j.logger.org.eclipse.smila.connectivity.framework.crawle r.web=DEBUG, file, stdout.

Sebastian

Report message to a moderator

Previous Topic:	Seeds and POST-Parameters
Next Topic:	Seeds and POST-Parameters

Goto Forum:

-=] Back to Top [=-

[ Syndicate this forum (XML) ] [

]

Current Time: Fri Apr 19 19:46:11 GMT 2024

.:: Contact :: Home ::.

Breadcrumbs

Sign up to our Newsletter