Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [science-iwg] File format for numerical data storage

Hi Alain,

 

We (Diamond Light Source) use HDF5 heavily, as well as many legacy formats (combinations of tiff and text files, some simple binary files).

 

I really like HDF5, but, sure, like everything, it has its issues.

 

Our main joy is that it is a pretty standard format, there are wrappers for it in many languages (but yes as the blog mentions, they tend to use the HDF groups C lib), it’s easy for users to take the data home and open it.

 

We collect a lot of data (think we are on 10 Petabytes), not all of it HDF5, but some of the larger data collections are. The lazy loading is great for investigating manageable sections of multi GB datasets.

 

We have python programs that use it with MPI on our cluster, and we are also using the new Single-Write-Multiple-Read and virtual dataset features in 1.10 using Java (SWMR seems a little more tested than the virtual datasets).

 

As the blog says, there are downsides (using the incorrect chunking for how you want to access the data can kill performance, I haven’t noticed many corrupted files, I think SWMR is supposed to prevent this, but not all software supports the library version needed for SWMR yet).

 

We have code that loads HDF5 data as January LazyDatasets, which is nice for writing code against, especially if you are using h5py and java, since the lazydatasets behave in a similar fashion to the h5py datasets (like January datasets are like numpy datasets).

 

I am really just a general user, but there are others here at Diamond (like the imaging team who use it with MPI) that could answer more specific questions.

 

Hope some of the above is useful.

 

Jake

 

Dr Jacob Filik

Senior Software Scientist

Tel: +441235 77 8690

 

Diamond Light Source Ltd.

Diamond House

Harwell Science & Innovation Campus

Didcot

Oxfordshire

OX11 0DE

 

 

 

 

 

From: science-iwg-bounces@xxxxxxxxxxx [mailto:science-iwg-bounces@xxxxxxxxxxx] On Behalf Of BERNARD, Alain
Sent: 01 December 2017 15:03
To: Science Industry Working Group <science-iwg@xxxxxxxxxxx>
Subject: [science-iwg] File format for numerical data storage

 

Hi Science !

 

In the frame of new data analytics projects for which we want to store numerical results of analyses, we would like to replace some of our old in-house data formats by a more standard one. We have some experience with HDF5, but also found articles saying that it creates some issues, especially in distributed environments (http://cyrille.rossant.net/moving-away-hdf5/).

As science people, do you have any feedback to provide to us regarding this format? Have you experienced other formats?

Also, we are primarily working in Java + Python so if you have experienced libraries for read/write, I would be happy to have some references -- obviously open-source :)

 

Cheers

Alain

 

 
The information in this e-mail is confidential. The contents may not be disclosed or used by anyone other than the addressee. Access to this e-mail by anyone else is unauthorised.
If you are not the intended recipient, please notify Airbus immediately and delete this e-mail.
Airbus cannot accept any responsibility for the accuracy or completeness of this e-mail as it has been sent over public networks. If you have any concerns over the content of this message or its Accuracy or Integrity, please contact Airbus immediately.
All outgoing e-mails from Airbus are checked using regularly updated virus scanning software but you should take whatever measures you deem to be appropriate to ensure that this message and any attachments are virus free.
 
 
 

 

-- 

This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd.
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
 


Back to the top