Eclipse Community Forums
Forum Search:

Search      Help    Register    Login    Home
Home » Modeling » EMF » EMF and Large Scientific Datasets
EMF and Large Scientific Datasets [message #995823] Mon, 31 December 2012 07:44 Go to next message
Gabe Colburn is currently offline Gabe Colburn
Messages: 7
Registered: December 2012
Junior Member
I've used EMF for a variety of previous projects. Presently I am starting a project where I need persist data models that contain large multidimensional arrays of scientific data (for example 3D datasets). In the past I've used either the default XML persistence, or CDO to persist to a database. XML persistence is not optimal for such large datasets. Furthermore its not obvious to me that there is an ideal way of modeling, mapping, and storing the datasets in a relational database using CDO. Since the arrays are very large, most often only subsets of filtered data will be accessed for processing and visualization instead of the entire dataset.

Are there any recommendations on how to model and persist large multidimensional arrays (without fully loading them except on request, and enable partial loading based on filters) with EMF and related technologies (CDO, Teneo, etc.) and be reasonably fast? I'm wondering how the EMF modeling/persistence gurus would approach this.

An alternate approach I've thought of is to use EMF to model all the data structures that do not involve multidimensional arrays, and create a field with a URI that points to a file that contains the large dataset. For example the datasets could be stored in Hierarchical Data Format 5 (HDF5) files, which would allow efficient access to the data and loading subsets of data. In this case I would persist the main data model using normal methods (XML, CDO, Teneo, etc.), and persist the large datasets separately. At run-time the application would selectively load data by reading the file at the stored URI in the data model (which would have to be accessible).

The other idea I also had would be to design a custom EStore that persists to an HDF5 file. This would be a lot of work however to do correctly.

Thanks for any thoughts on the best approaches.

Re: EMF and Large Scientific Datasets [message #995964 is a reply to message #995823] Mon, 31 December 2012 16:59 Go to previous messageGo to next message
Eike Stepper is currently offline Eike Stepper
Messages: 5522
Registered: July 2009
Senior Member
Am 31.12.2012 16:52, schrieb Gabe Colburn:
> I've used EMF for a variety of previous projects. Presently I am starting a project where I need persist data models
> that contain large multidimensional arrays of scientific data (for example 3D datasets). In the past I've used either
> the default XML persistence, or CDO to persist to a database. XML persistence is not optimal for such large datasets.
> Furthermore its not obvious to me that there is an ideal way of modeling, mapping, and storing the datasets in a
> relational database using CDO.
It's a (common) misconception that CDO implies O/R mapping (i.e. relational persistence in the end). CDO can integrate
with any type of physical backend storage by means of implementing the IStore interface. Examples are the
ObjectivityStore, the DB4OStore, the MongoDBStore, the LissomeStore and a number of stores that other CDO users have
implemented (and that are not shipped with CDO).

> Since the arrays are very large, most often only subsets of filtered data will be accessed for processing and
> visualization instead of the entire dataset.
>
> Are there any recommendations on how to model and persist large multidimensional arrays (without fully loading them
> except on request, and enable partial loading based on filters) with EMF and related technologies (CDO, Teneo, etc.)
> and be reasonably fast? I'm wondering how the EMF modeling/persistence gurus would approach this.
I don't think that CDO offers out-of-the-box solutions for efficient multidimensional arrays (or lists of lists). But,
depending on your other non-functional requirements, it might serve as a platform that's cheaper to adjust than just
starting from scratch.

> An alternate approach I've thought of is to use EMF to model all the data structures that do not involve
> multidimensional arrays, and create a field with a URI that points to a file that contains the large dataset. For
> example the datasets could be stored in Hierarchical Data Format 5 (HDF5) files, which would allow efficient access to
> the data and loading subsets of data. In this case I would persist the main data model using normal methods (XML, CDO,
> Teneo, etc.), and persist the large datasets separately. At run-time the application would selectively load data by
> reading the file at the stored URI in the data model (which would have to be accessible).
>
> The other idea I also had would be to design a custom EStore that persists to an HDF5 file. This would be a lot of
> work however to do correctly.
Yes, implementing EStore correctly is more complicated than immediately obvious, e.g. containment changes have to be
handled implicitely. CDO does all that already and it *might* be easier to implement CDO's IStore interface for physical
backends (many methods are optional, depending on the set of features you want to support):

http://git.eclipse.org/c/cdo/cdo.git/tree/plugins/org.eclipse.emf.cdo.server/src/org/eclipse/emf/cdo/server/IStore.java
http://git.eclipse.org/c/cdo/cdo.git/tree/plugins/org.eclipse.emf.cdo.server/src/org/eclipse/emf/cdo/server/IStoreAccessor.java

Granted, JavaDocs are less than optimal here but CDO ships with many implementations that can serve as examples.

Cheers
/Eike

----
http://www.esc-net.de
http://thegordian.blogspot.com
http://twitter.com/eikestepper
Re: EMF and Large Scientific Datasets [message #995981 is a reply to message #995964] Mon, 31 December 2012 17:52 Go to previous messageGo to next message
Gabe Colburn is currently offline Gabe Colburn
Messages: 7
Registered: December 2012
Junior Member
Thanks Eike. I'll look more at the IStore and IStoreAccessor interfaces and example implementations.
Re: EMF and Large Scientific Datasets [message #996516 is a reply to message #995823] Wed, 02 January 2013 09:14 Go to previous message
Christophe Bouhier is currently offline Christophe Bouhier
Messages: 916
Registered: July 2009
Senior Member
Hi Gabe,

I work with an application which deals with large amounts of time series. Currently the data is stored in CDO with an underlying RDBMS. As the time series end up in a single RDBMS table, you can imagine how large this grows. (Also the cdo_objects table grows substantially). The challenge has been sofar to quickly access a part of the time series. For easy entry into the graph, we have defined groups of CDO folders and CDO resources. but getting the period has been either by iterating over a collection in Java. (very brute slow approach) or use CDO queries, which is way faster but depends on the underlying store. I am looking for an alternative, and the idea you propose (EStructuralFeature of DataType URI pointing to an HDF5 entry) sounds very interesting. I would like to cooperate on this, and see what the actually implementation would look like. (I am not even sure this impacts CDO, I can imagine the application would be dealing with resolving the URI's. Alternatively the URI could be considered as an EMF proxy, and a special resolver would be dealing with this).

Please contact me offline if you wish to explore this further.
best regards,
Christophe Bouhier






Previous Topic:[XCore] Generating artifacts
Next Topic:[EMF] Editor Page for source code field?
Goto Forum:
  


Current Time: Tue Sep 16 01:12:29 GMT 2014

Powered by FUDForum. Page generated in 0.02308 seconds
.:: Contact :: Home ::.

Powered by: FUDforum 3.0.2.
Copyright ©2001-2010 FUDforum Bulletin Board Software