Eclipse Community Forums
Forum Search:

Search      Help    Register    Login    Home
Home » Eclipse Projects » SeMantic Information Logistics Architecture (SMILA) » Missing HTTP Proxy setting causing web crawler to fail(The "crawlSmilaWiki" crawler mentioned in http://wiki.eclipse.org/SMILA/Documentation_for_5_Minutes_to_Success link is failing to complete the execution. It seems that in the scenario where )
Missing HTTP Proxy setting causing web crawler to fail [message #830251] Tue, 27 March 2012 11:43 Go to next message
Venkatesh Channal is currently offline Venkatesh Channal
Messages: 5
Registered: March 2012
Junior Member
Hi,

The "crawlSmilaWiki" crawler mentioned in http//wiki.eclipse.org/SMILA/Documentation_for_5_Minutes_to_Success link is failing to complete the execution. It seems that in the scenario where http proxy setting is required, the crawler is failing. Could you please let me know in which file to set the http proxy information?

Regards,
Venky
Re: Missing HTTP Proxy setting causing web crawler to fail [message #830292 is a reply to message #830251] Tue, 27 March 2012 12:49 Go to previous messageGo to next message
Juergen Schumacher is currently offline Juergen Schumacher
Messages: 35
Registered: July 2009
Member
Hi,

Am 27.03.2012, 13:44 Uhr, schrieb Venkatesh Channal
<forums-noreply@xxxxxxxx>:
> Hi, The "crawlSmilaWiki" crawler mentioned in
> http//wiki.eclipse.org/SMILA/Documentation_for_5_Minutes_to_Success link
> is failing to complete the execution. It seems that in the scenario
> where http proxy setting is required, the crawler is failing. Could you
> please let me know in which file to set the http proxy information?

SMILA does not have own proxy settings yet, but maybe the standard Java
settings
will work. I found this description:

http://www.rgagnon.com/javadetails/java-0085.html

You can set these properties in the SMILA.ini file: Add two seperate lines
after the "-vmargs" line, e.g.

....
-vmargs
-Xms40m
-Xmx512m
-XX:MaxPermSize=256m
-Dhttp.proxyHost=myproxyserver.com
-Dhttp.proxyPort=80
-Declipse.ignoreApp=true
....

Please let me know if this works, I cannot check it myself. If not, we
will probably have
to implement something. It would be nice if you could create an issue in
http://bugs.eclipse.org/
in this case, thanks!

Regards,
Juergen
Re: Missing HTTP Proxy setting causing web crawler to fail [message #832607 is a reply to message #830292] Fri, 30 March 2012 10:45 Go to previous messageGo to next message
Venkatesh Channal is currently offline Venkatesh Channal
Messages: 5
Registered: March 2012
Junior Member
Hi,

The problem is not solved by setting the proxy information in the SMILA.ini file. I have raised a bug as suggested. Bug ID: id=375428.

Further update - tried things mentioned below, but problem exists.

On searching around, found the following into www.eclipse.org/smila/documentation/0.9/SMILA/Documentation/Web_Crawler.html that contains information about ProxyServer. Have configured the web.xml to now have a ProxyServer element.

Sample:

<Process>
<WebSite ProjectName="Example Crawler Configuration" Header="Accept-Encoding: gzip,deflate; Via: myProxy" Referer="myReferer">
<UserAgent Name="Crawler" Version="1.0" Description="teddy crawler" Url="www.teddy.com" Email="crawler@teddy.com"/>
<CrawlingModel Type="MaxDepth" Value="1000"/>
<CrawlScope Type="Path" />
<CrawlLimits>
<!-- Warning: The amount of files returned is limited to 1000 -->
<SizeLimits MaxBytesDownload="0" MaxDocumentDownload="1000" MaxTimeSec="3600" MaxLengthBytes="100000"/>
<TimeoutLimits Timeout="100000"/>
<WaitLimits Wait="0" RandomWait="false" MaxRetries="8" WaitRetry="0"/>
</CrawlLimits>
<Proxy>
<ProxyServer Host="115.112.231.106" Port="80" Login="" Password=""/>
</Proxy>
<Seeds FollowLinks="NoFollow">
<Seed>wiki.eclipse.org/SMILA</Seed>
</Seeds>
<!--Filters>
<Filter Type="RegExp" Value=".*action=edit.*" WorkType="Unselect"/>
<Filter Type="RegExp" Value="^((?!/SMILA).)*$" WorkType="Unselect"/>
</Filters-->
<MetaTagFilters>
<MetaTagFilter Type="Name" Name="robots" Content="noindex,nofollow" WorkType="Unselect"/>
</MetaTagFilters>
</WebSite>
</Process>

It works for SMILA 0.8 but not SMILA 1.0.

Other things tried are:

Changing the crawlSmilaWiki Job in jobs.json as:

{
"name":"crawlSmilaWiki",
"workflow":"webCrawling",
"parameters":{
"tempStore":"temp",
"dataSource":"web",
"startUrl":"wiki.eclipse.org/SMILA",
"filter":{
"urlPrefix":"wiki.eclipse.org/SMILA"
},
"proxyHost":"115.112.231.106",
"proxyPort":"80",
"Proxy":{
"ProxyServer":{
"Host":"115.112.231.106",
"Port":"80",
"Login":"",
"Password":""

}
},
"jobToPushTo":"indexUpdate"
}
}

First by creating a "Proxy" element and then separate "proxyHost" and "proxyPort" elements.

The http connection is failing.

The error log in SMILA.log is:


2012-03-30 15:55:01,171 INFO [Component Resolve Thread (Bundle 5) ] internal.HttpServiceImpl - HTTP server started successfully on port 8080.
2012-03-30 15:56:19,222 INFO [qtp30101162-50 ] internal.JobRunEngineImpl - start called for job 'indexUpdate', jobRunMode 'null'
2012-03-30 15:56:19,971 INFO [qtp30101162-50 ] zk.RunStorageZk - Changing job state for job run '20120330-155619222712' for job 'indexUpdate' to state RUNNING while expecting state PREPARING returned result: true
2012-03-30 15:56:19,971 INFO [qtp30101162-50 ] internal.JobRunEngineImpl - started job run '20120330-155619222712' for job 'indexUpdate'
2012-03-30 15:56:29,729 INFO [qtp30101162-51 ] internal.JobRunEngineImpl - start called for job 'crawlSmilaWiki', jobRunMode 'null'
2012-03-30 15:56:30,284 INFO [qtp30101162-51 ] zk.RunStorageZk - Changing job state for job run '20120330-155629729559' for job 'crawlSmilaWiki' to state RUNNING while expecting state PREPARING returned result: true
2012-03-30 15:56:30,416 INFO [qtp30101162-51 ] internal.JobRunEngineImpl - finish called for job 'crawlSmilaWiki', run '20120330-155629729559'
2012-03-30 15:56:30,424 INFO [qtp30101162-51 ] helper.BulkbuilderTaskProvider - Could not find task to be finished for job 'crawlSmilaWiki'.
2012-03-30 15:56:30,626 INFO [qtp30101162-51 ] internal.JobRunEngineImpl - started job run '20120330-155629729559' for job 'crawlSmilaWiki'
2012-03-30 15:56:51,855 INFO [pool-6-thread-1 ] httpclient.HttpMethodDirector - I/O exception (java.net.ConnectException) caught when processing request: Connection timed out: connect
2012-03-30 15:56:51,855 INFO [pool-6-thread-1 ] httpclient.HttpMethodDirector - Retrying request
2012-03-30 15:57:12,863 INFO [pool-6-thread-1 ] httpclient.HttpMethodDirector - I/O exception (java.net.ConnectException) caught when processing request: Connection timed out: connect
2012-03-30 15:57:12,863 INFO [pool-6-thread-1 ] httpclient.HttpMethodDirector - Retrying request
2012-03-30 15:57:33,865 INFO [pool-6-thread-1 ] httpclient.HttpMethodDirector - I/O exception (java.net.ConnectException) caught when processing request: Connection timed out: connect
2012-03-30 15:57:33,865 INFO [pool-6-thread-1 ] httpclient.HttpMethodDirector - Retrying request
2012-03-30 15:57:54,849 WARN [pool-6-thread-1 ] taskworker.DefaultTaskLogFactory - Task adb136b2-6fc2-4c3a-a94f-b584629cb681: Error while executing task adb136b2-6fc2-4c3a-a94f-b584629cb681 in worker org.eclipse.smila.importing.crawler.web.WebCrawlerWorker@1ec4a0c: IO error while getting web resource wiki.eclipse.org/SMILA: Connection timed out: connect
org.eclipse.smila.importing.crawler.web.WebCrawlerException: IO error while getting web resource wiki.eclipse.org/SMILA: Connection timed out: connect
at org.eclipse.smila.importing.crawler.web.fetcher.SimpleFetcher.crawl(SimpleFetcher.java:92)
at org.eclipse.smila.importing.crawler.web.WebCrawlerWorker.invokeFetcherTimed(WebCrawlerWorker.java:281)
at org.eclipse.smila.importing.crawler.web.WebCrawlerWorker.crawlLinkRecord(WebCrawlerWorker.java:234)
at org.eclipse.smila.importing.crawler.web.WebCrawlerWorker.initiateCrawling(WebCrawlerWorker.java:172)
at org.eclipse.smila.importing.crawler.web.WebCrawlerWorker.perform(WebCrawlerWorker.java:156)
at org.eclipse.smila.workermanager.internal.WorkerRunner.call(WorkerRunner.java:55)
at org.eclipse.smila.workermanager.internal.WorkerRunner.call(WorkerRunner.java:1)
at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: java.net.ConnectException: Connection timed out: connect
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.PlainSocketImpl.doConnect(Unknown Source)
at java.net.PlainSocketImpl.connectToAddress(Unknown Source)
at java.net.PlainSocketImpl.connect(Unknown Source)
at java.net.SocksSocketImpl.connect(Unknown Source)
at java.net.Socket.connect(Unknown Source)
at java.net.Socket.connect(Unknown Source)
at java.net.Socket.<init>(Unknown Source)
at java.net.Socket.<init>(Unknown Source)
at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80)
at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:122)
at org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707)
at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.open(MultiThreadedHttpConnectionManager.java:1361)
at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387)
at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
at org.eclipse.smila.importing.crawler.web.fetcher.SimpleFetcher.getResource(SimpleFetcher.java:120)
at org.eclipse.smila.importing.crawler.web.fetcher.SimpleFetcher.crawl(SimpleFetcher.java:85)
... 14 more
2012-03-30 15:57:55,197 WARN [pool-6-thread-1 ] internal.JobTaskProcessorImpl - A recoverable error 'TaskWorker'('Error while executing task adb136b2-6fc2-4c3a-a94f-b584629cb681 in worker org.eclipse.smila.importing.crawler.web.WebCrawlerWorker@1ec4a0c: IO error while getting web resource wiki.eclipse.org/SMILA: Connection timed out: connect') occurred in processing of task 'adb136b2-6fc2-4c3a-a94f-b584629cb681' for worker 'webCrawler'
Re: Missing HTTP Proxy setting causing web crawler to fail [message #832632 is a reply to message #832607] Fri, 30 March 2012 11:27 Go to previous messageGo to next message
Juergen Schumacher is currently offline Juergen Schumacher
Messages: 35
Registered: July 2009
Member
Am 30.03.2012, 12:45 Uhr, schrieb Venkatesh Channal
<forums-noreply@xxxxxxxx>:

> Hi, The problem is not solved by setting the proxy information in the
> SMILA.ini file. I have raised a bug as suggested. Bug ID: id=375428.
>
> Further update - tried things mentioned below, but problem exists.
>
> On searching around, found the following into
> www.eclipse.org/smila/documentation/0.9/SMILA/Documentation/Web_Crawler.html
> that contains information about ProxyServer. Have configured the web.xml
> to now have a ProxyServer element. Sample:
> [...]
> It works for SMILA 0.8 but not SMILA 1.0.

Ok. These are two different crawler implementations. The new one is
currently in
a "proof-of-concept" state and not yet ready for production. We will
improve it
soon (hopefully ... sorry, cannot make any promises about the timeline),
and we
should put proxy support on our todo list for that.

However, the old implementation is still available (see
http://wiki.eclipse.org/SMILA/Documentation#Deprecated_Components) and an
older
"5 minutes" description should still work, e.g.
http://www.eclipse.org/smila/documentation/0.9/SMILA/Documentation_for_5_Minutes_to_Success.html#Configure_and_run_the_Web_crawler
So maybe you can get along with this for now.

> Other things tried are:

No, this will not work. We have to implement proxy support explicitly
first, obviously.
Or you can extend it yourself, the HTTP connection code of the new crawler
is all in
org.eclipse.smila.importing.crawler.web.fetcher.SimpleFetcher. It uses
Apache
HttpClient 3.1 (yes, we should update to 4.x). Seems to me you would have
to set a
proper org.apache.commons.httpclient.HostConfiguration object in the used
HttpClient instance.

Regards,
Juergen.
Re: Missing HTTP Proxy setting causing web crawler to fail [message #832690 is a reply to message #832632] Fri, 30 March 2012 13:12 Go to previous messageGo to next message
Venkatesh Channal is currently offline Venkatesh Channal
Messages: 5
Registered: March 2012
Junior Member
Thank you for your guidance.

I changed the SimpleFetcher constructor as following:

/** initialize HttpClient with disabled redirects. */
public SimpleFetcher() {
final HttpClientParams params = new HttpClientParams();
params.setIntParameter(HttpClientParams.MAX_REDIRECTS, 0);
// params.setVirtualHost("115.112.233.76");
System.setProperty("http.proxyHost", "115.112.233.76");
System.setProperty("http.proxyPort", "80");

params.setParameter("http.proxyHost", "115.112.233.76");
params.setParameter("http.proxyPort", "80");
final MultiThreadedHttpConnectionManager connectionManager = new MultiThreadedHttpConnectionManager();
connectionManager.getParams().setDefaultMaxConnectionsPerHost(DEFAULT_MAX_CONNECTIONS_PER_HOST);
connectionManager.getParams().setMaxTotalConnections(DEFAULT_MAX_TOTAL_CONNECTIONS);

_httpClient = new HttpClient(params, connectionManager);
}


Still getting the same exception. I am not understanding how to set the proxy. Could you show a sample code on how to do proxy setting?

Thanks and regards,
Venky
Re: Missing HTTP Proxy setting causing web crawler to fail [message #832735 is a reply to message #832690] Fri, 30 March 2012 14:11 Go to previous messageGo to next message
Juergen Schumacher is currently offline Juergen Schumacher
Messages: 35
Registered: July 2009
Member
> Still getting the same exception. I am not understanding how to set the
> proxy. Could you show a sample code on how to do proxy setting?

I *suppose* it should be

public SimpleFetcher() {
final HttpClientParams params = new HttpClientParams();
params.setIntParameter(HttpClientParams.MAX_REDIRECTS, 0);

final MultiThreadedHttpConnectionManager connectionManager = new
MultiThreadedHttpConnectionManager();
connectionManager.getParams().setDefaultMaxConnectionsPerHost(DEFAULT_MAX_CONNECTIONS_PER_HOST);
connectionManager.getParams().setMaxTotalConnections(DEFAULT_MAX_TOTAL_CONNECTIONS);

_httpClient = new HttpClient(params, connectionManager);

HostConfiguration hc = _httpClient.getHostConfiguration();
hc.setProxy("115.112.233.76", 80);
_httpClient.setHostConfiguration(hc);
}

But I cannot test it here. But something similar is done in
org.eclipse.smila.connectivity.framework.crawler.web.http.Http.configureClient(),
which is the class that configures the HttpClient in the old crawler
implementation.

Regards,
J├╝rgen.
Re: Missing HTTP Proxy setting causing web crawler to fail [message #834780 is a reply to message #832735] Mon, 02 April 2012 11:36 Go to previous messageGo to next message
Venkatesh Channal is currently offline Venkatesh Channal
Messages: 5
Registered: March 2012
Junior Member
Hi,

The changes mentioned by you, enabled the web crawl. Thank you.

As a request, it would be good if the same can be set via a configuration file in the future instead of having to change the code.

Regards,
Venky
Re: Missing HTTP Proxy setting causing web crawler to fail [message #834854 is a reply to message #834780] Mon, 02 April 2012 13:32 Go to previous message
Juergen Schumacher is currently offline Juergen Schumacher
Messages: 35
Registered: July 2009
Member
Am 02.04.2012, 13:36 Uhr, schrieb Venkatesh Channal
<forums-noreply@xxxxxxxx>:
> The changes mentioned by you, enabled the web crawl. Thank you.
>
> As a request, it would be good if the same can be set via a
> configuration file in the future instead of having to change the code.

Yes, of course. It's on our todo list.

Thanks for the feedback,
Juergen.
Previous Topic:Would be SMILA useful in this context?
Next Topic:Build failed with fresh installation (on OpenSuse Linux)
Goto Forum:
  


Current Time: Fri Oct 24 13:01:57 GMT 2014

Powered by FUDForum. Page generated in 0.02350 seconds
.:: Contact :: Home ::.

Powered by: FUDforum 3.0.2.
Copyright ©2001-2010 FUDforum Bulletin Board Software