Eclipse Community Forums
Forum Search:

Search      Help    Register    Login    Home
Home » Eclipse Projects » SeMantic Information Logistics Architecture (SMILA) » NullPointerException while crawling
NullPointerException while crawling [message #654553] Wed, 16 February 2011 04:48 Go to next message
SMILANewBee is currently offline SMILANewBee
Messages: 42
Registered: August 2010
Member
Hello,

there are thrown many NullPointerExceptions while crawling. The exception is as follows:

org.eclipse.smila.connectivity.framework.CrawlerException: java.lang.NullPointerException
	at org.eclipse.smila.connectivity.framework.crawler.web.WebCrawler.getMObject(WebCrawler.java:361)
	at org.eclipse.smila.connectivity.framework.util.internal.DataReferenceImpl.getRecord(DataReferenceImpl.java:100)
	at org.eclipse.smila.connectivity.framework.impl.CrawlThread.processDataReferences(CrawlThread.java:352)
	at org.eclipse.smila.connectivity.framework.impl.CrawlThread.run(CrawlThread.java:235)
Caused by: java.lang.NullPointerException
	at org.eclipse.smila.connectivity.framework.crawler.web.WebCrawler.getRecord(WebCrawler.java:571)
	at org.eclipse.smila.connectivity.framework.crawler.web.WebCrawler.getMObject(WebCrawler.java:359)
	... 3 more


Its seems that the reference of _dataReferenceRecords is null at this point in time. I provide you the code because (of the class WebCrawler) we have change minor things (but we are sure that these changes doesn't provide these exceptions (the exceptions were already thrown before we made changes))l.

private Record getRecord(final Id id) throws InvalidTypeException, IOException, CrawlerException {
    if (_records.containsKey(id)) {
      return _records.get(id);
    } else {
      final Record record = _dataReferenceRecords.get(id);
      final MObject metadata = record.getMetadata();
      IndexDocument indexDocument = null;
      for (final Attribute attribute : _attributes) {
        if (!(attribute.isHashAttribute() || attribute.isKeyAttribute())) {
          if (indexDocument == null) {
            final String url = metadata.getAttribute(FieldAttributeType.URL.value()).getLiteral().getStringValue();
            indexDocument = deserializeIndexDocument(DigestUtils.md5Hex(url));
          }
          setAttribute(record, indexDocument, attribute);
        }
      }

      if (_log.isDebugEnabled()) {
        _log.debug("Created record for url: "
          + metadata.getAttribute(FieldAttributeType.URL.value()).getLiteral().getValue());
      }

      _records.put(id, record);
      _dataReferenceRecords.remove(id);
      return record;
    }
  }

The cause of the exception is the line:
final Record record = _dataReferenceRecords.get(id);

[Updated on: Wed, 16 February 2011 04:49]

Report message to a moderator

Re: NullPointerException while crawling [message #654564 is a reply to message #654553] Wed, 16 February 2011 05:29 Go to previous messageGo to next message
UNI-HI Stud is currently offline UNI-HI Stud
Messages: 6
Registered: August 2010
Location: Germany
Junior Member
I think with this little snippet, nobody can answer your question :/ Which kind of records are stored in _dataReferenceRecords etc...

One possibility:

final Record record = _dataReferenceRecords.get(id);
_dataReferenceRecords.remove(id);

Could it be that in case of parallel processes an other process has allready deleted that record?
--> Logging of the deleted records and than compare that id's with that one which throws the exception



[Updated on: Wed, 16 February 2011 05:31]

Report message to a moderator

Re: NullPointerException while crawling [message #654614 is a reply to message #654564] Wed, 16 February 2011 08:42 Go to previous messageGo to next message
SMILANewBee is currently offline SMILANewBee
Messages: 42
Registered: August 2010
Member
Hello,

we have already the idea that this exception is thrown because of thread concurrencies but our understanding is not so well about this.

I don't understand you in one detail. We supposed that there exists only one record but you say implicitly that there are other types of records. Which other records exists?
Re: NullPointerException while crawling [message #654827 is a reply to message #654553] Thu, 17 February 2011 05:27 Go to previous messageGo to next message
SMILANewBee is currently offline SMILANewBee
Messages: 42
Registered: August 2010
Member
We have found the cause for the NullPointerException. It lies in the method getNext() of the WebCrawler class. If the method returns an array with DataReferences that contains two data references with the same id a NullPointerException will be thrown after the CrawlThread wants to load the record from the data reference (in method [I[processDataReferences()[/I]). This is because the data references have the same id and the data reference will be removed from the internal map of the WebCrawler so the data references doesn't exists any more in this map. The duplicate data reference wants to access the map again, but there exists no record anymore for the id. So null will be returned and in further processing a NullPointerException will be thrown.

To prevent this we have filtered out the duplicate data references in the getNext() in the WebCrawler class and returns only unique entries.

We want to ask if this approach makes sense and if this approach doesn't make trouble.
Re: NullPointerException while crawling [message #655120 is a reply to message #654827] Fri, 18 February 2011 07:07 Go to previous message
Eclipse User
Originally posted by: juergen.schumacher.attensity.com

Hi,

Am 17.02.2011, 11:27 Uhr, schrieb SMILANewBee <nils.thieme@unister.de>:
> We have found the cause for the NullPointerException. It lies in the =

> method getNext() of the WebCrawler class. If the method returns an arr=
ay =

> with DataReferences that contains two data references with the same id=
a =

> NullPointerException will be thrown after the CrawlThread wants to loa=
d =

> the record from the data reference (in method =

> [I[processDataReferences()[/I]). This is because the data references =

> have the same id and the data reference will be removed from the =

> internal map of the WebCrawler so the data references doesn't exists a=
ny =

> more in this map. The duplicate data reference wants to access the map=
=

> again, but there exists no record anymore for the id. So null will be =
=

> returned and in further processing a NullPointerException will be thro=
wn.
>
> To prevent this we have filtered out the duplicate data references in =
=

> the getNext() in the WebCrawler class and returns only unique entries.=

>
> We want to ask if this approach makes sense and if this approach doesn=
't =

> make trouble.

To be honest, I do not know the crawler part of SMILA very much, so I =

cannot answer this
right now. However, it sounds sensible to me.

It would be great if you could create a Bugzilla issue for this on =

https://bugs.eclipse.org/bugs/
and attach your patched code. So we could review it and probably commit =
it =

to SVN then. Thanks!

Regards,
J=C3=BCrgen.
Previous Topic:Storing all crawled data?
Next Topic:Content type filter don't work to avoid unnecessary downloads
Goto Forum:
  


Current Time: Wed Jul 30 19:26:51 EDT 2014

Powered by FUDForum. Page generated in 0.11589 seconds