NullPointerException while crawling [message #654553] |
Wed, 16 February 2011 09:48 |
SMILANewBee Messages: 42 Registered: August 2010 |
Member |
|
|
Hello,
there are thrown many NullPointerExceptions while crawling. The exception is as follows:
org.eclipse.smila.connectivity.framework.CrawlerException: java.lang.NullPointerException
at org.eclipse.smila.connectivity.framework.crawler.web.WebCrawler.getMObject(WebCrawler.java:361)
at org.eclipse.smila.connectivity.framework.util.internal.DataReferenceImpl.getRecord(DataReferenceImpl.java:100)
at org.eclipse.smila.connectivity.framework.impl.CrawlThread.processDataReferences(CrawlThread.java:352)
at org.eclipse.smila.connectivity.framework.impl.CrawlThread.run(CrawlThread.java:235)
Caused by: java.lang.NullPointerException
at org.eclipse.smila.connectivity.framework.crawler.web.WebCrawler.getRecord(WebCrawler.java:571)
at org.eclipse.smila.connectivity.framework.crawler.web.WebCrawler.getMObject(WebCrawler.java:359)
... 3 more
Its seems that the reference of _dataReferenceRecords is null at this point in time. I provide you the code because (of the class WebCrawler) we have change minor things (but we are sure that these changes doesn't provide these exceptions (the exceptions were already thrown before we made changes))l.
private Record getRecord(final Id id) throws InvalidTypeException, IOException, CrawlerException {
if (_records.containsKey(id)) {
return _records.get(id);
} else {
final Record record = _dataReferenceRecords.get(id);
final MObject metadata = record.getMetadata();
IndexDocument indexDocument = null;
for (final Attribute attribute : _attributes) {
if (!(attribute.isHashAttribute() || attribute.isKeyAttribute())) {
if (indexDocument == null) {
final String url = metadata.getAttribute(FieldAttributeType.URL.value()).getLiteral().getStringValue();
indexDocument = deserializeIndexDocument(DigestUtils.md5Hex(url));
}
setAttribute(record, indexDocument, attribute);
}
}
if (_log.isDebugEnabled()) {
_log.debug("Created record for url: "
+ metadata.getAttribute(FieldAttributeType.URL.value()).getLiteral().getValue());
}
_records.put(id, record);
_dataReferenceRecords.remove(id);
return record;
}
}
The cause of the exception is the line:
final Record record = _dataReferenceRecords.get(id);
[Updated on: Wed, 16 February 2011 09:49] Report message to a moderator
|
|
|
Re: NullPointerException while crawling [message #654564 is a reply to message #654553] |
Wed, 16 February 2011 10:29 |
UNI-HI Stud Messages: 6 Registered: August 2010 Location: Germany |
Junior Member |
|
|
I think with this little snippet, nobody can answer your question :/ Which kind of records are stored in _dataReferenceRecords etc...
One possibility:
final Record record = _dataReferenceRecords.get(id);
_dataReferenceRecords.remove(id);
Could it be that in case of parallel processes an other process has allready deleted that record?
--> Logging of the deleted records and than compare that id's with that one which throws the exception
[Updated on: Wed, 16 February 2011 10:31] Report message to a moderator
|
|
|
|
|
Re: NullPointerException while crawling [message #655120 is a reply to message #654827] |
Fri, 18 February 2011 12:07 |
Eclipse User |
|
|
|
Originally posted by: juergen.schumacher.attensity.com
Hi,
Am 17.02.2011, 11:27 Uhr, schrieb SMILANewBee <nils.thieme@unister.de>:
> We have found the cause for the NullPointerException. It lies in the =
> method getNext() of the WebCrawler class. If the method returns an arr=
ay =
> with DataReferences that contains two data references with the same id=
a =
> NullPointerException will be thrown after the CrawlThread wants to loa=
d =
> the record from the data reference (in method =
> [I[processDataReferences()[/I]). This is because the data references =
> have the same id and the data reference will be removed from the =
> internal map of the WebCrawler so the data references doesn't exists a=
ny =
> more in this map. The duplicate data reference wants to access the map=
=
> again, but there exists no record anymore for the id. So null will be =
=
> returned and in further processing a NullPointerException will be thro=
wn.
>
> To prevent this we have filtered out the duplicate data references in =
=
> the getNext() in the WebCrawler class and returns only unique entries.=
>
> We want to ask if this approach makes sense and if this approach doesn=
't =
> make trouble.
To be honest, I do not know the crawler part of SMILA very much, so I =
cannot answer this
right now. However, it sounds sensible to me.
It would be great if you could create a Bugzilla issue for this on =
https://bugs.eclipse.org/bugs/
and attach your patched code. So we could review it and probably commit =
it =
to SVN then. Thanks!
Regards,
J=C3=BCrgen.
|
|
|
Powered by
FUDForum. Page generated in 0.02789 seconds