Home » Archived » SeMantic Information Logistics Architecture (SMILA) » Missing HTTP Proxy setting causing web crawler to fail(The "crawlSmilaWiki" crawler mentioned in http://wiki.eclipse.org/SMILA/Documentation_for_5_Minutes_to_Success link is failing to complete the execution. It seems that in the scenario where )
|
Re: Missing HTTP Proxy setting causing web crawler to fail [message #830292 is a reply to message #830251] |
Tue, 27 March 2012 12:49 |
Juergen Schumacher Messages: 35 Registered: July 2009 |
Member |
|
|
Hi,
Am 27.03.2012, 13:44 Uhr, schrieb Venkatesh Channal
<forums-noreply@xxxxxxxx>:
> Hi, The "crawlSmilaWiki" crawler mentioned in
> http//wiki.eclipse.org/SMILA/Documentation_for_5_Minutes_to_Success link
> is failing to complete the execution. It seems that in the scenario
> where http proxy setting is required, the crawler is failing. Could you
> please let me know in which file to set the http proxy information?
SMILA does not have own proxy settings yet, but maybe the standard Java
settings
will work. I found this description:
http://www.rgagnon.com/javadetails/java-0085.html
You can set these properties in the SMILA.ini file: Add two seperate lines
after the "-vmargs" line, e.g.
....
-vmargs
-Xms40m
-Xmx512m
-XX:MaxPermSize=256m
-Dhttp.proxyHost=myproxyserver.com
-Dhttp.proxyPort=80
-Declipse.ignoreApp=true
....
Please let me know if this works, I cannot check it myself. If not, we
will probably have
to implement something. It would be nice if you could create an issue in
http://bugs.eclipse.org/
in this case, thanks!
Regards,
Juergen
|
|
|
Re: Missing HTTP Proxy setting causing web crawler to fail [message #832607 is a reply to message #830292] |
Fri, 30 March 2012 10:45 |
Venkatesh Channal Messages: 5 Registered: March 2012 |
Junior Member |
|
|
Hi,
The problem is not solved by setting the proxy information in the SMILA.ini file. I have raised a bug as suggested. Bug ID: id=375428.
Further update - tried things mentioned below, but problem exists.
On searching around, found the following into www.eclipse.org/smila/documentation/0.9/SMILA/Documentation/Web_Crawler.html that contains information about ProxyServer. Have configured the web.xml to now have a ProxyServer element.
Sample:
<Process>
<WebSite ProjectName="Example Crawler Configuration" Header="Accept-Encoding: gzip,deflate; Via: myProxy" Referer="myReferer">
<UserAgent Name="Crawler" Version="1.0" Description="teddy crawler" Url="www.teddy.com" Email="crawler@teddy.com"/>
<CrawlingModel Type="MaxDepth" Value="1000"/>
<CrawlScope Type="Path" />
<CrawlLimits>
<!-- Warning: The amount of files returned is limited to 1000 -->
<SizeLimits MaxBytesDownload="0" MaxDocumentDownload="1000" MaxTimeSec="3600" MaxLengthBytes="100000"/>
<TimeoutLimits Timeout="100000"/>
<WaitLimits Wait="0" RandomWait="false" MaxRetries="8" WaitRetry="0"/>
</CrawlLimits>
<Proxy>
<ProxyServer Host="115.112.231.106" Port="80" Login="" Password=""/>
</Proxy>
<Seeds FollowLinks="NoFollow">
<Seed>wiki.eclipse.org/SMILA</Seed>
</Seeds>
<!--Filters>
<Filter Type="RegExp" Value=".*action=edit.*" WorkType="Unselect"/>
<Filter Type="RegExp" Value="^((?!/SMILA).)*$" WorkType="Unselect"/>
</Filters-->
<MetaTagFilters>
<MetaTagFilter Type="Name" Name="robots" Content="noindex,nofollow" WorkType="Unselect"/>
</MetaTagFilters>
</WebSite>
</Process>
It works for SMILA 0.8 but not SMILA 1.0.
Other things tried are:
Changing the crawlSmilaWiki Job in jobs.json as:
{
"name":"crawlSmilaWiki",
"workflow":"webCrawling",
"parameters":{
"tempStore":"temp",
"dataSource":"web",
"startUrl":"wiki.eclipse.org/SMILA",
"filter":{
"urlPrefix":"wiki.eclipse.org/SMILA"
},
"proxyHost":"115.112.231.106",
"proxyPort":"80",
"Proxy":{
"ProxyServer":{
"Host":"115.112.231.106",
"Port":"80",
"Login":"",
"Password":""
}
},
"jobToPushTo":"indexUpdate"
}
}
First by creating a "Proxy" element and then separate "proxyHost" and "proxyPort" elements.
The http connection is failing.
The error log in SMILA.log is:
2012-03-30 15:55:01,171 INFO [Component Resolve Thread (Bundle 5) ] internal.HttpServiceImpl - HTTP server started successfully on port 8080.
2012-03-30 15:56:19,222 INFO [qtp30101162-50 ] internal.JobRunEngineImpl - start called for job 'indexUpdate', jobRunMode 'null'
2012-03-30 15:56:19,971 INFO [qtp30101162-50 ] zk.RunStorageZk - Changing job state for job run '20120330-155619222712' for job 'indexUpdate' to state RUNNING while expecting state PREPARING returned result: true
2012-03-30 15:56:19,971 INFO [qtp30101162-50 ] internal.JobRunEngineImpl - started job run '20120330-155619222712' for job 'indexUpdate'
2012-03-30 15:56:29,729 INFO [qtp30101162-51 ] internal.JobRunEngineImpl - start called for job 'crawlSmilaWiki', jobRunMode 'null'
2012-03-30 15:56:30,284 INFO [qtp30101162-51 ] zk.RunStorageZk - Changing job state for job run '20120330-155629729559' for job 'crawlSmilaWiki' to state RUNNING while expecting state PREPARING returned result: true
2012-03-30 15:56:30,416 INFO [qtp30101162-51 ] internal.JobRunEngineImpl - finish called for job 'crawlSmilaWiki', run '20120330-155629729559'
2012-03-30 15:56:30,424 INFO [qtp30101162-51 ] helper.BulkbuilderTaskProvider - Could not find task to be finished for job 'crawlSmilaWiki'.
2012-03-30 15:56:30,626 INFO [qtp30101162-51 ] internal.JobRunEngineImpl - started job run '20120330-155629729559' for job 'crawlSmilaWiki'
2012-03-30 15:56:51,855 INFO [pool-6-thread-1 ] httpclient.HttpMethodDirector - I/O exception (java.net.ConnectException) caught when processing request: Connection timed out: connect
2012-03-30 15:56:51,855 INFO [pool-6-thread-1 ] httpclient.HttpMethodDirector - Retrying request
2012-03-30 15:57:12,863 INFO [pool-6-thread-1 ] httpclient.HttpMethodDirector - I/O exception (java.net.ConnectException) caught when processing request: Connection timed out: connect
2012-03-30 15:57:12,863 INFO [pool-6-thread-1 ] httpclient.HttpMethodDirector - Retrying request
2012-03-30 15:57:33,865 INFO [pool-6-thread-1 ] httpclient.HttpMethodDirector - I/O exception (java.net.ConnectException) caught when processing request: Connection timed out: connect
2012-03-30 15:57:33,865 INFO [pool-6-thread-1 ] httpclient.HttpMethodDirector - Retrying request
2012-03-30 15:57:54,849 WARN [pool-6-thread-1 ] taskworker.DefaultTaskLogFactory - Task adb136b2-6fc2-4c3a-a94f-b584629cb681: Error while executing task adb136b2-6fc2-4c3a-a94f-b584629cb681 in worker org.eclipse.smila.importing.crawler.web.WebCrawlerWorker@1ec4a0c: IO error while getting web resource wiki.eclipse.org/SMILA: Connection timed out: connect
org.eclipse.smila.importing.crawler.web.WebCrawlerException: IO error while getting web resource wiki.eclipse.org/SMILA: Connection timed out: connect
at org.eclipse.smila.importing.crawler.web.fetcher.SimpleFetcher.crawl(SimpleFetcher.java:92)
at org.eclipse.smila.importing.crawler.web.WebCrawlerWorker.invokeFetcherTimed(WebCrawlerWorker.java:281)
at org.eclipse.smila.importing.crawler.web.WebCrawlerWorker.crawlLinkRecord(WebCrawlerWorker.java:234)
at org.eclipse.smila.importing.crawler.web.WebCrawlerWorker.initiateCrawling(WebCrawlerWorker.java:172)
at org.eclipse.smila.importing.crawler.web.WebCrawlerWorker.perform(WebCrawlerWorker.java:156)
at org.eclipse.smila.workermanager.internal.WorkerRunner.call(WorkerRunner.java:55)
at org.eclipse.smila.workermanager.internal.WorkerRunner.call(WorkerRunner.java:1)
at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: java.net.ConnectException: Connection timed out: connect
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.PlainSocketImpl.doConnect(Unknown Source)
at java.net.PlainSocketImpl.connectToAddress(Unknown Source)
at java.net.PlainSocketImpl.connect(Unknown Source)
at java.net.SocksSocketImpl.connect(Unknown Source)
at java.net.Socket.connect(Unknown Source)
at java.net.Socket.connect(Unknown Source)
at java.net.Socket.<init>(Unknown Source)
at java.net.Socket.<init>(Unknown Source)
at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80)
at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:122)
at org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707)
at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.open(MultiThreadedHttpConnectionManager.java:1361)
at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387)
at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
at org.eclipse.smila.importing.crawler.web.fetcher.SimpleFetcher.getResource(SimpleFetcher.java:120)
at org.eclipse.smila.importing.crawler.web.fetcher.SimpleFetcher.crawl(SimpleFetcher.java:85)
... 14 more
2012-03-30 15:57:55,197 WARN [pool-6-thread-1 ] internal.JobTaskProcessorImpl - A recoverable error 'TaskWorker'('Error while executing task adb136b2-6fc2-4c3a-a94f-b584629cb681 in worker org.eclipse.smila.importing.crawler.web.WebCrawlerWorker@1ec4a0c: IO error while getting web resource wiki.eclipse.org/SMILA: Connection timed out: connect') occurred in processing of task 'adb136b2-6fc2-4c3a-a94f-b584629cb681' for worker 'webCrawler'
|
|
|
Re: Missing HTTP Proxy setting causing web crawler to fail [message #832632 is a reply to message #832607] |
Fri, 30 March 2012 11:27 |
Juergen Schumacher Messages: 35 Registered: July 2009 |
Member |
|
|
Am 30.03.2012, 12:45 Uhr, schrieb Venkatesh Channal
<forums-noreply@xxxxxxxx>:
> Hi, The problem is not solved by setting the proxy information in the
> SMILA.ini file. I have raised a bug as suggested. Bug ID: id=375428.
>
> Further update - tried things mentioned below, but problem exists.
>
> On searching around, found the following into
> www.eclipse.org/smila/documentation/0.9/SMILA/Documentation/Web_Crawler.html
> that contains information about ProxyServer. Have configured the web.xml
> to now have a ProxyServer element. Sample:
> [...]
> It works for SMILA 0.8 but not SMILA 1.0.
Ok. These are two different crawler implementations. The new one is
currently in
a "proof-of-concept" state and not yet ready for production. We will
improve it
soon (hopefully ... sorry, cannot make any promises about the timeline),
and we
should put proxy support on our todo list for that.
However, the old implementation is still available (see
http://wiki.eclipse.org/SMILA/Documentation#Deprecated_Components) and an
older
"5 minutes" description should still work, e.g.
http://www.eclipse.org/smila/documentation/0.9/SMILA/Documentation_for_5_Minutes_to_Success.html#Configure_and_run_the_Web_crawler
So maybe you can get along with this for now.
> Other things tried are:
No, this will not work. We have to implement proxy support explicitly
first, obviously.
Or you can extend it yourself, the HTTP connection code of the new crawler
is all in
org.eclipse.smila.importing.crawler.web.fetcher.SimpleFetcher. It uses
Apache
HttpClient 3.1 (yes, we should update to 4.x). Seems to me you would have
to set a
proper org.apache.commons.httpclient.HostConfiguration object in the used
HttpClient instance.
Regards,
Juergen.
|
|
| | | | |
Goto Forum:
Current Time: Tue Sep 24 16:39:37 GMT 2024
Powered by FUDForum. Page generated in 0.03936 seconds
|