Eclipse Community Forums: SeMantic Information Logistics Architecture (SMILA)

Home

Home » Archived » SeMantic Information Logistics Architecture (SMILA) » NullPointerException while crawling

NullPointerException while crawling [message #654553]

Wed, 16 February 2011 04:48

Eclipse User

Hello,

there are thrown many NullPointerExceptions while crawling. The exception is as follows:

org.eclipse.smila.connectivity.framework.CrawlerException: java.lang.NullPointerException
	at org.eclipse.smila.connectivity.framework.crawler.web.WebCrawler.getMObject(WebCrawler.java:361)
	at org.eclipse.smila.connectivity.framework.util.internal.DataReferenceImpl.getRecord(DataReferenceImpl.java:100)
	at org.eclipse.smila.connectivity.framework.impl.CrawlThread.processDataReferences(CrawlThread.java:352)
	at org.eclipse.smila.connectivity.framework.impl.CrawlThread.run(CrawlThread.java:235)
Caused by: java.lang.NullPointerException
	at org.eclipse.smila.connectivity.framework.crawler.web.WebCrawler.getRecord(WebCrawler.java:571)
	at org.eclipse.smila.connectivity.framework.crawler.web.WebCrawler.getMObject(WebCrawler.java:359)
	... 3 more

Its seems that the reference of _dataReferenceRecords is null at this point in time. I provide you the code because (of the class WebCrawler) we have change minor things (but we are sure that these changes doesn't provide these exceptions (the exceptions were already thrown before we made changes))l.

private Record getRecord(final Id id) throws InvalidTypeException, IOException, CrawlerException {
    if (_records.containsKey(id)) {
      return _records.get(id);
    } else {
      final Record record = _dataReferenceRecords.get(id);
      final MObject metadata = record.getMetadata();
      IndexDocument indexDocument = null;
      for (final Attribute attribute : _attributes) {
        if (!(attribute.isHashAttribute() || attribute.isKeyAttribute())) {
          if (indexDocument == null) {
            final String url = metadata.getAttribute(FieldAttributeType.URL.value()).getLiteral().getStringValue();
            indexDocument = deserializeIndexDocument(DigestUtils.md5Hex(url));
          }
          setAttribute(record, indexDocument, attribute);
        }
      }

      if (_log.isDebugEnabled()) {
        _log.debug("Created record for url: "
          + metadata.getAttribute(FieldAttributeType.URL.value()).getLiteral().getValue());
      }

      _records.put(id, record);
      _dataReferenceRecords.remove(id);
      return record;
    }
  }

The cause of the exception is the line:

final Record record = _dataReferenceRecords.get(id);

[Updated on: Wed, 16 February 2011 04:49] by Moderator

Re: NullPointerException while crawling [message #654564 is a reply to message #654553]

Wed, 16 February 2011 05:29

Eclipse User

I think with this little snippet, nobody can answer your question :/ Which kind of records are stored in _dataReferenceRecords etc...

One possibility:

final Record record = _dataReferenceRecords.get(id);
_dataReferenceRecords.remove(id);

Could it be that in case of parallel processes an other process has allready deleted that record?
--> Logging of the deleted records and than compare that id's with that one which throws the exception

[Updated on: Wed, 16 February 2011 05:31] by Moderator

Re: NullPointerException while crawling [message #654614 is a reply to message #654564]

Wed, 16 February 2011 08:42

Eclipse User

Hello,

we have already the idea that this exception is thrown because of thread concurrencies but our understanding is not so well about this.

I don't understand you in one detail. We supposed that there exists only one record but you say implicitly that there are other types of records. Which other records exists?

Re: NullPointerException while crawling [message #654827 is a reply to message #654553]

Thu, 17 February 2011 05:27

Eclipse User

We have found the cause for the NullPointerException. It lies in the method getNext() of the WebCrawler class. If the method returns an array with DataReferences that contains two data references with the same id a NullPointerException will be thrown after the CrawlThread wants to load the record from the data reference (in method [I[processDataReferences()[/I]). This is because the data references have the same id and the data reference will be removed from the internal map of the WebCrawler so the data references doesn't exists any more in this map. The duplicate data reference wants to access the map again, but there exists no record anymore for the id. So null will be returned and in further processing a NullPointerException will be thrown.

To prevent this we have filtered out the duplicate data references in the getNext() in the WebCrawler class and returns only unique entries.

We want to ask if this approach makes sense and if this approach doesn't make trouble.

Re: NullPointerException while crawling [message #655120 is a reply to message #654827]

Fri, 18 February 2011 07:07

Eclipse User

Originally posted by: juergen.schumacher.attensity.com

Hi,

Am 17.02.2011, 11:27 Uhr, schrieb SMILANewBee <nils.thieme@unister.de>:
> We have found the cause for the NullPointerException. It lies in the =

> method getNext() of the WebCrawler class. If the method returns an arr=
ay =

> with DataReferences that contains two data references with the same id=
a =

> NullPointerException will be thrown after the CrawlThread wants to loa=
d =

> the record from the data reference (in method =

> [I[processDataReferences()[/I]). This is because the data references =

> have the same id and the data reference will be removed from the =

> internal map of the WebCrawler so the data references doesn't exists a=
ny =

> more in this map. The duplicate data reference wants to access the map=
=

> again, but there exists no record anymore for the id. So null will be =
=

> returned and in further processing a NullPointerException will be thro=
wn.
>
> To prevent this we have filtered out the duplicate data references in =
=

> the getNext() in the WebCrawler class and returns only unique entries.=

>
> We want to ask if this approach makes sense and if this approach doesn=
't =

> make trouble.

To be honest, I do not know the crawler part of SMILA very much, so I =

cannot answer this
right now. However, it sounds sensible to me.

It would be great if you could create a Bugzilla issue for this on =

https://bugs.eclipse.org/bugs/
and attach your patched code. So we could review it and probably commit =
it =

to SVN then. Thanks!

Regards,
J=C3=BCrgen.

Previous Topic:	Storing all crawled data?
Next Topic:	Content type filter don't work to avoid unnecessary downloads

Goto Forum:

-=] Back to Top [=-

Current Time: Tue Jul 15 00:09:29 EDT 2025

.:: Contact :: Home ::.

Breadcrumbs

Sign up to our Newsletter