Coder Social home page Coder Social logo

scipp / scippnexus Goto Github PK

View Code? Open in Web Editor NEW
3.0 5.0 3.0 2.46 MB

h5py-like utility for NeXus files with seamless scipp integration

Home Page: https://scipp.github.io/scippnexus/

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%
h5py hdf5 nexus python

scippnexus's Introduction

Contributor Covenant PyPI badge Anaconda-Server Badge License: BSD 3-Clause DOI

ScippNexus

About

ScippNexus is a h5py-like utility for NeXus files with seamless scipp integration. See the documentation for more details.

Installation

python -m pip install scippnexus

scippnexus's People

Contributors

dependabot[bot] avatar jl-wynen avatar jokasimr avatar nvaytet avatar simonheybrock avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

scippnexus's Issues

Change `nexus.NXobject` interface to use exceptions rather than warnings

load_nexus relies on warnings if it cannot load certain parts of the file, to ensure that something incomplete rather than nothing is returned to the users.

For the new lower-level interface around NXobject this does not seem like the right choice.

  • Refactor to move warning-generating code into functions used by load_nexus, use exceptions in low-level functions shared with NXobject subclasses.

Docs workflow

See #30 and recent actions.

This is almost good to let us publish or update docs. One remaining subtlety:

  • If we make docs of an old branch, e.g., old tag, we may want to use an old package. So we should extend this to not simply take a publish flag, but a version number? Then this can be used to:

    1. Get correct PyPI package (or use local one if dev is specified).
    2. Publish to correct folder (including version tag).

    This would also need to be able to handle extra folders, and specifying 'latest', to just use latest package (and publish into root).

Improve loading of NXtransformations

Currently, when there is a depend_on in, e.g., an NXdetector, the corresponding transformation is loaded as a Scipp affine_transform3. This is fine. However, if there is no depends_on, NXtransformations are just loaded as their datasets. The problem is that vital information for transformations is stored in their attributes. This is cumbersome (hard to use) and currently not supported by scipp.Variable or scipp.DataGroup.

Instead, we should load the transformations in NXtransformations as scipp.Variable with the correct spatial dtype (could be, e.g., rotation3 or translation3). The code for this all exists since it is used for computing the depends_on transformation chain, but it is not called when loading a plain NXtransformations group.

(Nested) Loading of groups

In Scipp, we are considering the addition of a DataGroup container. This would be similar to Dataset, but without coords and without restricting the dims or shapes of the items. This is thus quite similar to a Nexus "group". We would therefore like to support loading groups in ScippNexus, returning a DataGroup. There are a number of things to consider:

  • #57 changed __getitem__ to return Python scalars instead of scipp.Variable if not shape are unit is given. This was for more convenient storage in a Python dict. For DataGroup, the we are currently leaning towards requiring items to have dims and shape. Should we thus undo this change in ScippNexus? Or should DataGroup be more flexible?
  • How to handle errors while loading? In practice these are unfortunately quite frequent, so failing the entire load may not be so useful. One promising approach would be to fall back to a "plain" load as (nested) DataGroup, since most errors are from "higher level" logic, such as trying to interpret fields for an NXevent_data or NXdetector group. There are a number of subtleties here, especially implementation wise, as the current design puts some hurdles here.
    • One subproblem is the handling of coords that fail to insert due to a DimensionError. Currently these are skipped with a warning. We could instead return the entire NXdata as DataGroup, but this would likely not be useful in many cases. But not doing that would be inconsistent.
  • Handling of links and references. For example, NXevent_data (or its fields) has two places that reference it in SNS files. I think any attempt to do something "smart" like returning a tree with sharing setup will just cause headaches. I feel we should probably make a clear statement to not support this (i.e., data will simply be loaded twice).

Avoid fallback when loading NXlog or NXdetector without data?

We currently get a ton of warnings when loading files without real data (see below). In particular after #172 we may be able to avoid some of those.

  warnings.warn(msg)
/home/simon/code/scipp/scippnexus/src/scippnexus/field.py:240: UserWarning: Unrecognized unit 'hz' for value dataset in '/entry/instrument/T0_chopper/rotation_speed/value'; setting unit as 'dimensionless'
  warnings.warn(
/home/simon/code/scipp/scippnexus/src/scippnexus/base.py:376: UserWarning: Failed to load /entry/instrument/band_chopper/delay as NXlog: Could not determine signal field or dimensions. Falling back to loading HDF5 group children as scipp.DataGroup.
  warnings.warn(msg)
/home/simon/code/scipp/scippnexus/src/scippnexus/field.py:240: UserWarning: Unrecognized unit 'hz' for value dataset in '/entry/instrument/band_chopper/rotation_speed/value'; setting unit as 'dimensionless'
  warnings.warn(
CPU times: user 817 ms, sys: 1.93 s, total: 2.75 s
Wall time: 506 ms
/home/simon/code/scipp/scippnexus/src/scippnexus/base.py:376: UserWarning: Failed to load /entry/instrument/monitor_bunker/monitor_bunker_events as NXevent_data: Required field event_time_zero not found in NXevent_data Falling back to loading HDF5 group children as scipp.DataGroup.
  warnings.warn(msg)
/home/simon/code/scipp/scippnexus/src/scippnexus/base.py:376: UserWarning: Failed to load /entry/instrument/monitor_bunker as NXmonitor: Signal is not an array-like. Falling back to loading HDF5 group children as scipp.DataGroup.
  warnings.warn(msg)
/home/simon/code/scipp/scippnexus/src/scippnexus/base.py:376: UserWarning: Failed to load /entry/instrument/monitor_cave/monitor_cave_events as NXevent_data: Required field event_time_zero not found in NXevent_data Falling back to loading HDF5 group children as scipp.DataGroup.
  warnings.warn(msg)
/home/simon/code/scipp/scippnexus/src/scippnexus/base.py:376: UserWarning: Failed to load /entry/instrument/monitor_cave as NXmonitor: Signal is not an array-like. Falling back to loading HDF5 group children as scipp.DataGroup.
  warnings.warn(msg)
/home/simon/code/scipp/scippnexus/src/scippnexus/base.py:376: UserWarning: Failed to load /entry/instrument/overlap_chopper/delay as NXlog: Could not determine signal field or dimensions. Falling back to loading HDF5 group children as scipp.DataGroup.
  warnings.warn(msg)
/home/simon/code/scipp/scippnexus/src/scippnexus/field.py:240: UserWarning: Unrecognized unit 'hz' for value dataset in '/entry/instrument/overlap_chopper/rotation_speed/value'; setting unit as 'dimensionless'
  warnings.warn(
/home/simon/code/scipp/scippnexus/src/scippnexus/base.py:376: UserWarning: Failed to load /entry/instrument/polarizer/rate as NXlog: Could not determine signal field or dimensions. Falling back to loading HDF5 group children as scipp.DataGroup.
  warnings.warn(msg)
/home/simon/code/scipp/scippnexus/src/scippnexus/base.py:376: UserWarning: Failed to load /entry/instrument/pulse_shaping_chopper1/delay as NXlog: Could not determine signal field or dimensions. Falling back to loading HDF5 group children as scipp.DataGroup.
  warnings.warn(msg)
/home/simon/code/scipp/scippnexus/src/scippnexus/field.py:240: UserWarning: Unrecognized unit 'hz' for value dataset in '/entry/instrument/pulse_shaping_chopper1/rotation_speed/value'; setting unit as 'dimensionless'
  warnings.warn(
/home/simon/code/scipp/scippnexus/src/scippnexus/base.py:376: UserWarning: Failed to load /entry/instrument/pulse_shaping_chopper2/delay as NXlog: Could not determine signal field or dimensions. Falling back to loading HDF5 group children as scipp.DataGroup.
  warnings.warn(msg)
/home/simon/code/scipp/scippnexus/src/scippnexus/field.py:240: UserWarning: Unrecognized unit 'hz' for value dataset in '/entry/instrument/pulse_shaping_chopper2/rotation_speed/value'; setting unit as 'dimensionless'
  warnings.warn(
/home/simon/code/scipp/scippnexus/src/scippnexus/base.py:376: UserWarning: Failed to load /entry/instrument/sans_detector/sans_event_data as NXevent_data: Required field event_time_zero not found in NXevent_data Falling back to loading HDF5 group children as scipp.DataGroup.
  warnings.warn(msg)
/home/simon/code/scipp/scippnexus/src/scippnexus/base.py:376: UserWarning: Failed to load /entry/instrument/sans_detector as NXdetector: Signal is not an array-like. Falling back to loading HDF5 group children as scipp.DataGroup.
  warnings.warn(msg)
/home/simon/code/scipp/scippnexus/src/scippnexus/base.py:376: UserWarning: Failed to load /entry/instrument/source/current as NXlog: Could not determine signal field or dimensions. Falling back to loading HDF5 group children as scipp.DataGroup.
  warnings.warn(msg)

More selective fallback loader

Currently, the fallback loader in NXobject catches (nearly) all exceptions from loaders for concrete classes and uses the fallback. This is intended to allow loading files with partially bad structure. But it also hides user errors like a bad index (wrong dim, bad slice, etc.).

We should distinguish between errors originating in the file structure and error originating from the user/caller. Only the former should trigger the fallback.

`FutureWarning` about elementwise comparison

/opt/anaconda3/envs/scippneutron/lib/python3.8/site-packages/scippneutron/file_loading/nxobject.py:166: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if self.dims == [] and shape == [1]:

General benchmarks / profiling

In the current implementation, no major effort was put into optimization. Basically, the implementation is "naive" in most cases, which might, e.g., result in repeated or redundant calls to h5py, etc.

We should profile ScippNexus for the "typical" cases, i.e., files with hundreds of groups and thousands of datasets. Attention should be payed not just to loading large datasets, but first and foremost all of the "overhead" from dealing with small but many file contents.

  • Is the "overhead" significant?
  • Can it be reduced?

Consider interpreting naming conventions from the Nexus format

See https://manual.nexusformat.org/datarules.html?highlight=uncertainties#rules-for-storing-data-items-in-nexus-files, specifically the "Reserved suffixes", e.g., _mask is something we could handle:

Reserved suffixes

When naming a field, NeXus has reserved certain suffixes to the names so that a specific meaning may be attached. Consider a field named DATASET, the following table lists the suffixes reserved by NeXus.

suffix reference meaning
_end NXtransformations end points of the motions that start with DATASET
_errors NXdata uncertainties (a.k.a., errors)
_increment_set NXtransformations intended average range through which the corresponding axis moves during the exposure of a frame
_indices NXdata Integer array that defines the indices of the signal field which need to be used in the DATASET in order to reference the corresponding axis value
_mask   Field containing a signal mask, where 0 means the pixel is not masked. If required, bit masks are defined in NXdetector pixel_mask.
_set target values Target value of DATASET
_weights   divide DATASET by these weights [4]

Writing NeXus files based on application definitions

Follow-up to #63. Basically, given a data structure such as a scipp.DataArray and a NeXus application definition, we want to create/write groups. Currently Nxobject provides

def create_class(self, name: str, nx_class: Union[str, type]) -> NXobject:

We can consider extending this with support for an application definition:

with snx.File(name, definition=NXcanSAS) as f:
  group = f[path]
  group.create_class(name, definition=SASdata, data=my_data_array)

Here definition provides a key for lookup of the child strategy via group._strategy (which was setup from the NXcanSAS root definition). The child strategy must then provide everything necessary for writing the group and its content (attributes, fields+field attributes, child groups, ...) and how these relate to properties of the data.

Design wise, one key aspect to address is how to handle recursion. Should the method on NXobject be allowed to deal with this, i.e., the strategy may write an entire subtree, or should this be handled in another way, in a way that avoids handling the tree in the strategy?

An alternative opportunity to explore is whether NXobject.__setitem__ could be generalized. Currently it only supports creations of fields (from scipp.Variable), as scipp.DataArray does not contain enough information to create a NeXus group. However, an application definition might provide a wrapper for this?

group['sasdata01'] = SASdata(my_data_array)

Need to figure out if the dual purpose of SASdata as a definition/strategy for loading and a wrapper for data is a reasonable design.

Should loading event data no longer create weights with variances?

After scipp/scipp#2895, broadcasting of variances will no longer be supported. This implies that the "standard" paradigm of handling uncertainties will not be feasible any more. This further implies that there is limited use to carrying uncertainties of "counts", since uncertainties can simply be computed later on.

We should therefore consider avoiding the overhead of creating variances for the weights when an NXdetector with NXevent_data is loaded. This would save both memory and compute resources.

Broken release workflow?

With the recent change, this happened when releasing 0.4.1:
image

Is it skipping the second docs build step if the first one gets skipped?

New API: remaining tasks and steps

After #117, the following remains to be done:

  • Consider restoring handling of NXevent_data fields embedded in NXmonitor or NXdetector. This is used by SNS files.
  • Refactor/update docs. See #125
  • Release notes.
  • Release with new API in scippnexus.v2
  • Add deprecation warning in old API?
  • Make new API the default. See #158.
  • Remove old API.
  • Add tool for resolving depends_on chains.
  • Refactor existing pieces for executing depends_on chains.
  • Add tool for converting raw NXoff_geometry and NXcylindrical_geometry to per-detector "shapes".

Slicing with a single index raises error

Example:

from scippnexus import data
filename = data.get_path('PG3_4844_event.nxs')
import scippnexus as snx
f = snx.File(filename)

data = f['entry/bank103']

data['x_pixel_offset', 0]

raises

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Input In [3], in <cell line: 1>()
----> 1 data['x_pixel_offset', 0]

File ~/code/nxus/jupyter/scippnexus/nxobject.py:315, in NXobject.__getitem__(self, name)
    312 def __getitem__(
    313         self,
    314         name: NXobjectIndex) -> Union['NXobject', Field, sc.DataArray, sc.Dataset]:
--> 315     return self._get_child(name, use_field_dims=True)

File ~/code/nxus/jupyter/scippnexus/nxobject.py:307, in NXobject._get_child(self, name, use_field_dims)
    305     else:
    306         return _make(item)
--> 307 da = self._getitem(name)
    308 if (t := self.depends_on) is not None:
    309     da.coords['depends_on'] = t if isinstance(t, sc.Variable) else sc.scalar(t)

File ~/code/nxus/jupyter/scippnexus/nxdata.py:149, in NXdata._getitem(self, select)
    148 def _getitem(self, select: ScippIndex) -> sc.DataArray:
--> 149     signal = self._signal[select]
    150     if self._errors_name in self:
    151         stddevs = self[self._errors_name][select]

File ~/code/nxus/jupyter/scippnexus/nxobject.py:185, in Field.__getitem__(self, select)
    183 shape = list(self.shape)
    184 for i, ind in enumerate(index):
--> 185     shape[i] = len(range(*ind.indices(shape[i])))
    187 variable = sc.empty(dims=self.dims,
    188                     shape=shape,
    189                     dtype=self.dtype,
    190                     unit=self.unit)
    192 # If the variable is empty, return early

AttributeError: 'int' object has no attribute 'indices'

Note that data['x_pixel_offset', :10] works as expected.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.