Eclipse Community Forums
Forum Search:

Search      Help    Register    Login    Home
Home » Eclipse Projects » SeMantic Information Logistics Architecture (SMILA) » WebCrawler issue using with proxy SMILA 1.2(Exception while running job crawlSmilaWiki)
WebCrawler issue using with proxy SMILA 1.2 [message #1057727] Wed, 08 May 2013 09:25 Go to next message
Abhi Mahajan is currently offline Abhi Mahajan
Messages: 2
Registered: May 2013
Junior Member
Hi,

I am using SMILA 1.2 for Win7 32 bit OS. I am able to follow steps as given in the 5 minute tutorial available in Eclipse site till 'Start indexing job run'. But exception occurs after reaching the step 'Start the crawler'.

My machine is behind a firewall and I have added proxy setting in the file: \SMILA\configuration\org.eclipse.smila.importing.crawler.web\webcrawler.properties.

Status of the job 'crawlSmilaWiki' after hitting the URL in browser:
"localhost:8080/smila/jobmanager/jobs/"
is shown as 'FAILED'.


After viewing the log file, the following exception is logged:

org.eclipse.smila.importing.crawler.web.WebCrawlerException: org.apache.http.NoHttpResponseException: The target server failed to respond
at org.eclipse.smila.importing.crawler.web.WebCrawlerWorker.handleCrawlException(WebCrawlerWorker.java:285)
at org.eclipse.smila.importing.crawler.web.WebCrawlerWorker.crawlLinkRecord(WebCrawlerWorker.java:270)
at org.eclipse.smila.importing.crawler.web.WebCrawlerWorker.initiateCrawling(WebCrawlerWorker.java:186)
at org.eclipse.smila.importing.crawler.web.WebCrawlerWorker.perform(WebCrawlerWorker.java:167)
at org.eclipse.smila.workermanager.internal.WorkerRunner.call(WorkerRunner.java:55)
at org.eclipse.smila.workermanager.internal.WorkerRunner.call(WorkerRunner.java:1)
at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)


I have attached the log file for further analysis. Request anyone to help me out with the issue.

Thanks & Regards,
Abhi

  • Attachment: SMILA.log
    (Size: 157.20KB, Downloaded 49 times)
Re: WebCrawler issue using with proxy SMILA 1.2 [message #1057746 is a reply to message #1057727] Wed, 08 May 2013 10:21 Go to previous messageGo to next message
Marco Strack is currently offline Marco Strack
Messages: 1
Registered: May 2013
Junior Member
Hello Abhi,

i had a look at the provided logfile which seemed fine so far.

Can you post the relevant section of the config file here? (obfuscate the hostname)

Other guesses would be:
* smila has not been restarted after the config file update
* the proxy may be of type SOCKS instead of HTTP

regards

Marco
Re: WebCrawler issue using with proxy SMILA 1.2 [message #1057825 is a reply to message #1057746] Thu, 09 May 2013 01:52 Go to previous messageGo to next message
Abhi Mahajan is currently offline Abhi Mahajan
Messages: 2
Registered: May 2013
Junior Member

Hi Marco,

PFA the webcrawler config file wherein I changed the proxy settings. The proxy is of type HTTP and not SOCKS.
While testing the same, I have restarted SMILA and the same exception was encountered.

Please let me know if other changes are to be done.

Regards,
Abhi
Re: WebCrawler issue using with proxy SMILA 1.2 [message #1057989 is a reply to message #1057825] Fri, 10 May 2013 06:54 Go to previous messageGo to next message
Andreas Schank is currently offline Andreas Schank
Messages: 3
Registered: May 2013
Junior Member
Hi Abhi,

I checked using a proxy with SMILA 1.2 and a squid, I installed as a proxy, and I could see in the proxy's log that the robots. txt could be accessed through the proxy.

Pleas try another way. Uncomment the proxy settings in your webcrawler.properties file and add the following lines to your SMILA.ini file:
-Dhttp.proxyHost=xxx.xxx.xxx.xxx
-Dhttp.proxyPort=3128
-Dhttps.proxyHost=xxx.xxx.xxx.xxx
-Dhttps.proxyPort=3128

Please exchange the xxx.xxx.xxx.xxx by the IP address or host name of your proxy host and - if your proxy host uses another port than my squid - also the proxyPort to match your proxy's port number.

And start SMILA over again.

But I found out something different. It seems that I cannot crawl SMILA wiki via a proxy (no matter which way it is configured) because SMILA claims that robots.txt would forbid it, which is not true, I can crawl it without a proxy. There seems to be some bug in the robots.txt handling, I will create a bug report, but this is not the same problem than your problem reported above....

Bye,
Andreas
Re: WebCrawler issue using with proxy SMILA 1.2 [message #1058026 is a reply to message #1057825] Fri, 10 May 2013 09:54 Go to previous messageGo to next message
Andreas Schank is currently offline Andreas Schank
Messages: 3
Registered: May 2013
Junior Member
Added and fixed bug 407732 (https://bugs.eclipse.org/bugs/show_bug.cgi?id=407732)

Seems that you must switch to the trunk in order to properly crawl behind a proxy. Sorry, there seems to be no quick workaround, at least I tried to configure a port, but it would not do for me.

Bye
Andreas
Re: WebCrawler issue using with proxy SMILA 1.2 [message #1058031 is a reply to message #1057825] Fri, 10 May 2013 09:54 Go to previous messageGo to next message
Andreas Schank is currently offline Andreas Schank
Messages: 3
Registered: May 2013
Junior Member
Added and fixed bug 407732 (https://bugs.eclipse.org/bugs/show_bug.cgi?id=407732)

Seems that you must switch to the trunk in order to properly crawl behind a proxy. Sorry, there seems to be no quick workaround, at least I tried to configure a port, but it would not do for me.

Bye
Andreas
Re: WebCrawler issue using with proxy SMILA 1.2 [message #1058210 is a reply to message #1057989] Mon, 13 May 2013 04:45 Go to previous messageGo to next message
Venkatesh Channal is currently offline Venkatesh Channal
Messages: 5
Registered: March 2012
Junior Member
Hi,

I think the issue is related as it is resolved on modifying the value returned by getHostAndPort(final URL url) to be default 80.

I had faced similar issue any my machine is behind proxy as well. Here are the things that were done.

On downloading the SMILA 1.2 code, making changes to webcrawler.properties to include proxyHost and proxyPort within the bundle observed that the change are not reflected in SMILA.application's webcrawler.properties.

The proxy changes had to be inside SMILA.application's webcrawler.properties for proxy setting.

The port value is getting set to -1 in the current build and this is causing the target server not responding. On changing the value in UriHelper to have default of 80 the issue got resolved and am able to crawl http://wiki.eclipse.org/SMILA
Re: WebCrawler issue using with proxy SMILA 1.2 [message #1058476 is a reply to message #1058210] Tue, 14 May 2013 06:02 Go to previous message
Andreas Weber is currently offline Andreas Weber
Messages: 24
Registered: July 2009
Junior Member
Hi,

> On downloading the SMILA 1.2 code, making changes to
> webcrawler.properties to include proxyHost and proxyPort within the
> bundle observed that the change are not reflected in SMILA.application's
> webcrawler.properties.
>
> The proxy changes had to be inside SMILA.application's
> webcrawler.properties for proxy setting.

General hint: Configuration changes always have to be done in SMILA's
"configuration" folder, not in the bundles directly.
When running SMILA in eclipse IDE, the configuration folder can be found
in the "SMILA.application" project.

>
> The port value is getting set to -1 in the current build and this is
> causing the target server not responding. On changing the value in
> UriHelper to have default of 80 the issue got resolved and am able to
> crawl http://wiki.eclipse.org/SMILA

After Andreas' fix, crawling with a proxy should be possible now without
code changes with the nightly build download (resp. current trunk when
running in eclipse IDE).

Best regards,
Andreas
Previous Topic:SMILA 1.2 released
Next Topic:Manually add external WSDL invocation to existing BEPL stub
Goto Forum:
  


Current Time: Fri Jul 25 19:17:54 EDT 2014

Powered by FUDForum. Page generated in 0.01915 seconds