[
Date Prev][
Date Next][
Thread Prev][
Thread Next][
Date Index][
Thread Index]
[
List Home]
RE: [smila-user] Crawler and link analysis
|
Hi,
Filters are used in connection with the tag CrawlScope and also with tag <Seeds FollowLinks>.
In Connection with the CrawlScope:
If a link matches the configured CrawlScope only the Unselect-Filters are checked.
If a link doesn't match the Select Filters are checked.
In your case the CrawlScope Broad matches to every link, thus the Select-Filters are not used.
In Connection with the Seed FollowLinks:
Follow -> If a Unselect-Filter matches the Link is only analyzed (means will be spidered, but will not be stored in the index),
Select-Filters are not used
NoFollow->If a Unselect-Filter matches the Link will not be spidered!
FollowLinksWithCorrespondingSelectFilter-->
pages that match both "Select" and "Unselect" filters will be indexed , and everything else that matches will be analyzed
What does it mean for your case:
If A and B are on the same domain/host you should use the CrawlScope:Domain/Host/Path
Google-Links should not be spidered in this case.
You can also use the FollowLinks="NoFollow" Mode and explicit forbid google with a Unselect-Filter.
Also the following Line in log4.properties should result in more logging information regarding the webcrawler.
log4j.logger.org.eclipse.smila.connectivity.framework.crawler.web=DEBUG
Hope this helps.
Sebastian
> -----Original Message-----
> From: smila-user-bounces@xxxxxxxxxxx [mailto:smila-user-bounces@xxxxxxxxxxx] On Behalf Of Patrick Pekczynski
> Sent: Saturday, August 14, 2010 9:51 AM
> To: smila-user@xxxxxxxxxxx
> Subject: [smila-user] Crawler and link analysis
>
> Dear all,
>
> I played a bit around with the SMILA crawling facilities, especially with the WEB-crawling component.
>
> If I want to crawl a site A where A has links to B and to google.com A -> B A -> google.com
>
> and I setup a web-crawler as follows:
>
> <CrawlScope Type= "Broad"></CrawlScope>
>
> <Seeds FollowLinks="Follow">
> <Seed> A </Seed>
> </Seeds>
> <Filters>
> <Filter Type="RegExp" Value=".*B.*" WorkType="Select"/> </Filters>
>
> I would expect the crawler to start at site A and then ONLY follow B, but instead it also crawls google.com.
>
> I also tried to use WorkType="Unselect" instead which though a bit contraintuitive is recommended in the Crawler-Documentation.
> But though the crawler should only follow "some matching Unselect filters" it not only crawls B but also google.com.
>
> My question now is, whether someone can show me what I am doing wrong or how to setup such a scenario correctly (starting at A
> and ONLY following links matching some pattern B)
>
> Thanks for your help
>
> Kind regards,
>
> Patrick
>
>
>
>
> --
> Patrick Pekczynski
> Lilienweg 11
> D - 66773 Schwalbach-Elm
> eMail: pekczynski@xxxxxxxxxxxxxx