scipp / scippnexus Goto Github PK
View Code? Open in Web Editor NEWh5py-like utility for NeXus files with seamless scipp integration
Home Page: https://scipp.github.io/scippnexus/
License: BSD 3-Clause "New" or "Revised" License
h5py-like utility for NeXus files with seamless scipp integration
Home Page: https://scipp.github.io/scippnexus/
License: BSD 3-Clause "New" or "Revised" License
We have always had in mind to implement label-based indexing like Scipp. This is required in particular when loading multiple groups at once, where positional indexing is often not meaningful.
In fact, we may consider the current behavior a bug, or at least a discrepancy to how scipp.DataGroup
behaves: When loading a tree of groups, the dimension length may be ambiguous (indicated by None
). ScippNexus nevertheless accepts positional indices. Therefore, we get inconsistent behavior:
with snx.File(name) as f:
dg1 = f['time', 10:20] # may return some zero-length subgroups, if indices out of range
dg2 = f[()]['time', 10:20] # raises if sizes is {'time': None, ...}
See scipp/scipp#3371.
https://scipp.github.io/scippnexus/user-guide/application-definitions.html#Writing-files explains the "advanced" method, but we lacks docs for the pedestrian way.
scn.load
and scn.load_nexus
.scippnexus.v2
Check the code, and add a test.
Also check if there are existing pieces for executing depends_on chains that need refactoring.
/opt/anaconda3/envs/scippneutron/lib/python3.8/site-packages/scippneutron/file_loading/nxobject.py:166: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
if self.dims == [] and shape == [1]:
After scipp/scipp#2895, broadcasting of variances will no longer be supported. This implies that the "standard" paradigm of handling uncertainties will not be feasible any more. This further implies that there is limited use to carrying uncertainties of "counts", since uncertainties can simply be computed later on.
We should therefore consider avoiding the overhead of creating variances for the weights when an NXdetector
with NXevent_data
is loaded. This would save both memory and compute resources.
See https://manual.nexusformat.org/datarules.html?highlight=uncertainties#rules-for-storing-data-items-in-nexus-files, specifically the "Reserved suffixes", e.g., _mask
is something we could handle:
Reserved suffixes
When naming a field, NeXus has reserved certain suffixes to the names
so that a specific meaning may be attached. Consider a field named DATASET
,
the following table lists the suffixes reserved by NeXus.
suffix | reference | meaning |
---|---|---|
_end | NXtransformations | end points of the motions that start with DATASET |
_errors | NXdata | uncertainties (a.k.a., errors) |
_increment_set | NXtransformations | intended average range through which the corresponding axis moves during the exposure of a frame |
_indices | NXdata | Integer array that defines the indices of the signal field which need to be used in the DATASET in order to reference the corresponding axis value |
_mask | Field containing a signal mask, where 0 means the pixel is not masked. If required, bit masks are defined in NXdetector pixel_mask. | |
_set | target values | Target value of DATASET |
_weights | divide DATASET by these weights [4] |
I wanted to test python 3.12 for other packages,
then I found out scippnexus
was using dateutil.parser
which throws DeprecationWarning in python 3.12.
But in the copier answer says it supports up to python 3.12
so I was wondering if it was tested with python 3.12 and I should just pin some dependencies...
See #30 and recent actions.
This is almost good to let us publish or update docs. One remaining subtlety:
If we make docs of an old branch, e.g., old tag, we may want to use an old package. So we should extend this to not simply take a publish
flag, but a version number? Then this can be used to:
dev
is specified).This would also need to be able to handle extra folders, and specifying 'latest', to just use latest package (and publish into root).
Example:
from scippnexus import data
filename = data.get_path('PG3_4844_event.nxs')
import scippnexus as snx
f = snx.File(filename)
data = f['entry/bank103']
data['x_pixel_offset', 0]
raises
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Input In [3], in <cell line: 1>()
----> 1 data['x_pixel_offset', 0]
File ~/code/nxus/jupyter/scippnexus/nxobject.py:315, in NXobject.__getitem__(self, name)
312 def __getitem__(
313 self,
314 name: NXobjectIndex) -> Union['NXobject', Field, sc.DataArray, sc.Dataset]:
--> 315 return self._get_child(name, use_field_dims=True)
File ~/code/nxus/jupyter/scippnexus/nxobject.py:307, in NXobject._get_child(self, name, use_field_dims)
305 else:
306 return _make(item)
--> 307 da = self._getitem(name)
308 if (t := self.depends_on) is not None:
309 da.coords['depends_on'] = t if isinstance(t, sc.Variable) else sc.scalar(t)
File ~/code/nxus/jupyter/scippnexus/nxdata.py:149, in NXdata._getitem(self, select)
148 def _getitem(self, select: ScippIndex) -> sc.DataArray:
--> 149 signal = self._signal[select]
150 if self._errors_name in self:
151 stddevs = self[self._errors_name][select]
File ~/code/nxus/jupyter/scippnexus/nxobject.py:185, in Field.__getitem__(self, select)
183 shape = list(self.shape)
184 for i, ind in enumerate(index):
--> 185 shape[i] = len(range(*ind.indices(shape[i])))
187 variable = sc.empty(dims=self.dims,
188 shape=shape,
189 dtype=self.dtype,
190 unit=self.unit)
192 # If the variable is empty, return early
AttributeError: 'int' object has no attribute 'indices'
Note that data['x_pixel_offset', :10]
works as expected.
Sometimes, when computing a pipeline result, the whole thing hangs. When I kill it, the error message is
---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
Cell In[8], line 1
----> 1 results = direct_beam(pipelines=[pipeline_full, pipeline_bands], I0=I0, niter=6)
2 # Unpack the final result
3 iofq_full = results[-1]['iofq_full']
File ~/code/sans/jupyter/esssans/direct_beam.py:104, in direct_beam(pipelines, I0, niter)
101 for it in range(niter):
102 print("Iteration", it)
--> 104 iofq_full = pipeline_full.compute(BackgroundSubtractedIofQ)
105 iofq_slices = pipeline_bands.compute(BackgroundSubtractedIofQ)
107 if per_layer:
File ~/code/sans/jupyter/sciline/pipeline.py:686, in Pipeline.compute(self, tp)
674 def compute(self, tp: type | Iterable[type] | Item[T]) -> Any:
675 """
676 Compute result for the given keys.
677
(...)
684 Can be a single type or an iterable of types.
685 """
--> 686 return self.get(tp).compute()
File ~/code/sans/jupyter/sciline/task_graph.py:66, in TaskGraph.compute(self, keys)
64 return dict(zip(keys, results))
65 else:
---> 66 return self._scheduler.get(self._graph, [keys])[0]
File ~/code/sans/jupyter/sciline/scheduler.py:78, in DaskScheduler.get(self, graph, keys)
76 dsk = {tp: (provider, *args) for tp, (provider, args) in graph.items()}
77 try:
---> 78 return self._dask_get(dsk, keys)
79 except RuntimeError as e:
80 if str(e).startswith("Cycle detected"):
File ~/software/mambaforge/lib/python3.10/site-packages/dask/threaded.py:90, in get(dsk, keys, cache, num_workers, pool, **kwargs)
87 elif isinstance(pool, multiprocessing.pool.Pool):
88 pool = MultiprocessingPoolExecutor(pool)
---> 90 results = get_async(
91 pool.submit,
92 pool._max_workers,
93 dsk,
94 keys,
95 cache=cache,
96 get_id=_thread_get_id,
97 pack_exception=pack_exception,
98 **kwargs,
99 )
101 # Cleanup pools associated to dead threads
102 with pools_lock:
File ~/software/mambaforge/lib/python3.10/site-packages/dask/local.py:501, in get_async(submit, num_workers, dsk, result, cache, get_id, rerun_exceptions_locally, pack_exception, raise_exception, callbacks, dumps, loads, chunksize, **kwargs)
499 while state["waiting"] or state["ready"] or state["running"]:
500 fire_tasks(chunksize)
--> 501 for key, res_info, failed in queue_get(queue).result():
502 if failed:
503 exc, tb = loads(res_info)
File ~/software/mambaforge/lib/python3.10/site-packages/dask/local.py:138, in queue_get(q)
137 def queue_get(q):
--> 138 return q.get()
File ~/software/mambaforge/lib/python3.10/queue.py:171, in Queue.get(self, block, timeout)
169 elif timeout is None:
170 while not self._qsize():
--> 171 self.not_empty.wait()
172 elif timeout < 0:
173 raise ValueError("'timeout' must be a non-negative number")
File ~/software/mambaforge/lib/python3.10/threading.py:320, in Condition.wait(self, timeout)
318 try: # restore state no matter what (e.g., KeyboardInterrupt)
319 if timeout is None:
--> 320 waiter.acquire()
321 gotit = True
322 else:
KeyboardInterrupt:
I have a feeling this happens when there are some steps that perform a file download from pooch (file was not in local cache).
I've seen that pooch tries to start downloads in parallel when requesting mutliple files. Maybe there is a clash with Dask?
I don't think it happens with sciline.NaiveScheduler
?
load_nexus
relies on warnings if it cannot load certain parts of the file, to ensure that something incomplete rather than nothing is returned to the users.
For the new lower-level interface around NXobject
this does not seem like the right choice.
load_nexus
, use exceptions in low-level functions shared with NXobject
subclasses.I think this should be possible. It might just be the mapping to pixels that assumes it exists. From what I can tell, this requires relatively simple changes in:
scippnexus/src/scippnexus/nxdata.py
Lines 768 to 770 in 6e4e566
event_time_zero
coord exists).Make sure to add tests for both cases:
Accessing a file via scippnexus
with a relative or absolute path that ends in an extraneous '/'
raises a KeyError
due to attempting to access the empty path
''
in the following code snippet
scippnexus/src/scippnexus/base.py
Lines 382 to 387 in 86a4bfe
I expected the trailing forward slash to not raise an error; and for accessing, e.g., '/entry'
and /entry/'
, to result in the same object.
Is there a reason that a trailing '/'
should raise an error?
H5Base.attrs
is annotated to return List[int]
but the actual implementations return Attrs
(dict-like).
After #117, the following remains to be done:
NXevent_data
fields embedded in NXmonitor
or NXdetector
. This is used by SNS files.scippnexus.v2
depends_on
chains.depends_on
chains.NXoff_geometry
and NXcylindrical_geometry
to per-detector "shapes".Example:
import h5py
import scipp as sc
import scippnexus as snx
with h5py.File('dummy.nxs', mode='w', driver="core", backing_store=False) as h5root:
da = sc.DataArray(
sc.array(dims=['xx', 'yy'], unit='m', values=[[1, 2], [4, 5]]),
coords=dict(
xx = sc.array(dims=['xx'], unit='m', values=[1.0, 2.0]),
yy = sc.array(dims=['yy'], unit='m', values=[0.1, 0.0]),
),
)
data = snx.create_class(h5root, 'data1', snx.NXdata)
snx.create_field(data, 'signal', da.data)
snx.create_field(data, 'xx', da.coords['xx'])
data.attrs['axes'] = da.dims
data.attrs['signal'] = 'signal'
data = snx.Group(data, definitions=snx.base_definitions())
print(da['xx', 0:2])
print(data['xx', 0:2])
Output:
<scipp.DataArray>
Dimensions: Sizes[xx:2, yy:2, ]
Coordinates:
* xx float64 [m] (xx) [1, 2]
* yy float64 [m] (yy) [0.1, 0]
Data:
int64 [m] (xx, yy) [1, 2, 4, 5]
<scipp.DataArray>
Dimensions: Sizes[xx:2, yy:2, ]
Coordinates:
* xx float64 [m] (xx) [1, 2]
Data:
int64 [m] (xx, yy) [1, 2, 4, 5]
Expected result: I expected them to be the same.
Actual result: Positional indexing on the scipp DataArray kept the yy
coordinate but positional indexing on the NXdata object did not keep the yy
coordinate.
Not sure if this is the correct behavior or not, but I think it would be more intuitive if they behaved the same.
In the current implementation, no major effort was put into optimization. Basically, the implementation is "naive" in most cases, which might, e.g., result in repeated or redundant calls to h5py
, etc.
We should profile ScippNexus for the "typical" cases, i.e., files with hundreds of groups and thousands of datasets. Attention should be payed not just to loading large datasets, but first and foremost all of the "overhead" from dealing with small but many file contents.
Currently, the fallback loader in NXobject
catches (nearly) all exceptions from loaders for concrete classes and uses the fallback. This is intended to allow loading files with partially bad structure. But it also hides user errors like a bad index (wrong dim, bad slice, etc.).
We should distinguish between errors originating in the file structure and error originating from the user/caller. Only the former should trigger the fallback.
import scippnexus.v2 as snx
# ...
dg = f['event_time_zero', 0:1] # ok, loads one pulse
dg = f['event_time_zero', 0:0] # seems to load everything?
See https://manual.nexusformat.org/classes/base_classes/NXdetector.html#nxdetector-pixel-mask-field. We should turn this into masks of a scipp.DataArray
(the the detector is loaded as a DataArray).
Scipp only really works with boolean masks. Therefore, we should inspect the bitmask, and split it into individual masks. Only bits that are actually in use should result in creation of a corresponding mask.
How @axes
should be encoded/interpreted has been discussed previously in #145
@axes
applied to the dataset
named 'DATA'Comparing what the NeXus standard says regarding how the @axes
attribute should be encoded when applied to the 'DATA' dataset
and its containing NXdata
group, there is some ambiguity.
Namely, notice that both use the term 'array'
Defines the names of the coordinates (independent axes) for this data set as a colon-delimited array.
When axes contains multiple strings, it must be saved as an actual array of strings and not a single comma separated string.
Where this becomes ambiguous, is in that HDF5 allows for array-valued attributes (which h5py reads as numpy arrays), and scippnexus
has interpreted both 'colon-delimited array' and 'actual array of strings' to specify a single string with axes names separated by literal ':' characters.
If someone interprets 'actual array of strings' to mean 'array-valued attribute with string elements', then creates a NeXus HDF5 file, like the one attached; that file can not be opened by scippnexus
due to the use of the split
method here:
scippnexus/src/scippnexus/nxdata.py
Line 160 in 58ffbbe
Add a check that self._signal_axes
is a str
before attempting to use .split
, or alternately a check if it is a numpy array, e.g., replacing the quoted line 160 by
self._signal_axes = tuple(self._signal_axes if isinstance(self._signal_axes, np.ndarray) else self._signal_axes.split(':'))
If the interpretation of 'actual array of strings' should be limited to 'colon-delimited array' like the 'DATA' dataset attribute,
petition the NeXus committee to clarify their boxed-note for the group attribute.
So they behave a bit more like dicts. Should behave like len(grp) == len(grp.keys())
To do: Decide on exact date.
Follow-up to #63. Basically, given a data structure such as a scipp.DataArray
and a NeXus application definition, we want to create/write groups. Currently Nxobject
provides
scippnexus/src/scippnexus/nxobject.py
Line 472 in ece37dc
We can consider extending this with support for an application definition:
with snx.File(name, definition=NXcanSAS) as f:
group = f[path]
group.create_class(name, definition=SASdata, data=my_data_array)
Here definition
provides a key for lookup of the child strategy via group._strategy
(which was setup from the NXcanSAS
root definition). The child strategy must then provide everything necessary for writing the group and its content (attributes, fields+field attributes, child groups, ...) and how these relate to properties of the data
.
Design wise, one key aspect to address is how to handle recursion. Should the method on NXobject be allowed to deal with this, i.e., the strategy may write an entire subtree, or should this be handled in another way, in a way that avoids handling the tree in the strategy?
An alternative opportunity to explore is whether NXobject.__setitem__
could be generalized. Currently it only supports creations of fields (from scipp.Variable
), as scipp.DataArray
does not contain enough information to create a NeXus group. However, an application definition might provide a wrapper for this?
group['sasdata01'] = SASdata(my_data_array)
Need to figure out if the dual purpose of SASdata
as a definition/strategy for loading and a wrapper for data is a reasonable design.
We have allowed UserWarning
in the pytest warning setup. However, scipp.VisibleDeprecationWarning
inherits UserWarning
, so we are unintentionally hiding those.
In Scipp, we are considering the addition of a DataGroup
container. This would be similar to Dataset
, but without coords and without restricting the dims or shapes of the items. This is thus quite similar to a Nexus "group". We would therefore like to support loading groups in ScippNexus, returning a DataGroup
. There are a number of things to consider:
__getitem__
to return Python scalars instead of scipp.Variable
if not shape are unit is given. This was for more convenient storage in a Python dict
. For DataGroup
, the we are currently leaning towards requiring items to have dims
and shape
. Should we thus undo this change in ScippNexus? Or should DataGroup
be more flexible?DataGroup
, since most errors are from "higher level" logic, such as trying to interpret fields for an NXevent_data or NXdetector group. There are a number of subtleties here, especially implementation wise, as the current design puts some hurdles here.
DimensionError
. Currently these are skipped with a warning. We could instead return the entire NXdata
as DataGroup
, but this would likely not be useful in many cases. But not doing that would be inconsistent.Following up on #204 and the fix in #205, I noticed that a path with repeated forward slashes will silently ignore any valid path before the last such '/'
and treat the rest of the path specification as an absolute path. E.g., a NeXus HDF5 file with an NXinstrument
group at /entry/instrument
opened as file
will pass the following assert statement
assert id(file['/entry/instrument']) == id(file['/entry//entry/instrument'])
this behavior is contrary to file system paths under unix-like systems which treat repeated path separators as if they were singular.
Currently, when there is a depend_on
in, e.g., an NXdetector
, the corresponding transformation is loaded as a Scipp affine_transform3
. This is fine. However, if there is no depends_on
, NXtransformations
are just loaded as their datasets. The problem is that vital information for transformations is stored in their attributes. This is cumbersome (hard to use) and currently not supported by scipp.Variable
or scipp.DataGroup
.
Instead, we should load the transformations in NXtransformations
as scipp.Variable
with the correct spatial dtype (could be, e.g., rotation3
or translation3
). The code for this all exists since it is used for computing the depends_on
transformation chain, but it is not called when loading a plain NXtransformations
group.
We currently get a ton of warnings when loading files without real data (see below). In particular after #172 we may be able to avoid some of those.
warnings.warn(msg)
/home/simon/code/scipp/scippnexus/src/scippnexus/field.py:240: UserWarning: Unrecognized unit 'hz' for value dataset in '/entry/instrument/T0_chopper/rotation_speed/value'; setting unit as 'dimensionless'
warnings.warn(
/home/simon/code/scipp/scippnexus/src/scippnexus/base.py:376: UserWarning: Failed to load /entry/instrument/band_chopper/delay as NXlog: Could not determine signal field or dimensions. Falling back to loading HDF5 group children as scipp.DataGroup.
warnings.warn(msg)
/home/simon/code/scipp/scippnexus/src/scippnexus/field.py:240: UserWarning: Unrecognized unit 'hz' for value dataset in '/entry/instrument/band_chopper/rotation_speed/value'; setting unit as 'dimensionless'
warnings.warn(
CPU times: user 817 ms, sys: 1.93 s, total: 2.75 s
Wall time: 506 ms
/home/simon/code/scipp/scippnexus/src/scippnexus/base.py:376: UserWarning: Failed to load /entry/instrument/monitor_bunker/monitor_bunker_events as NXevent_data: Required field event_time_zero not found in NXevent_data Falling back to loading HDF5 group children as scipp.DataGroup.
warnings.warn(msg)
/home/simon/code/scipp/scippnexus/src/scippnexus/base.py:376: UserWarning: Failed to load /entry/instrument/monitor_bunker as NXmonitor: Signal is not an array-like. Falling back to loading HDF5 group children as scipp.DataGroup.
warnings.warn(msg)
/home/simon/code/scipp/scippnexus/src/scippnexus/base.py:376: UserWarning: Failed to load /entry/instrument/monitor_cave/monitor_cave_events as NXevent_data: Required field event_time_zero not found in NXevent_data Falling back to loading HDF5 group children as scipp.DataGroup.
warnings.warn(msg)
/home/simon/code/scipp/scippnexus/src/scippnexus/base.py:376: UserWarning: Failed to load /entry/instrument/monitor_cave as NXmonitor: Signal is not an array-like. Falling back to loading HDF5 group children as scipp.DataGroup.
warnings.warn(msg)
/home/simon/code/scipp/scippnexus/src/scippnexus/base.py:376: UserWarning: Failed to load /entry/instrument/overlap_chopper/delay as NXlog: Could not determine signal field or dimensions. Falling back to loading HDF5 group children as scipp.DataGroup.
warnings.warn(msg)
/home/simon/code/scipp/scippnexus/src/scippnexus/field.py:240: UserWarning: Unrecognized unit 'hz' for value dataset in '/entry/instrument/overlap_chopper/rotation_speed/value'; setting unit as 'dimensionless'
warnings.warn(
/home/simon/code/scipp/scippnexus/src/scippnexus/base.py:376: UserWarning: Failed to load /entry/instrument/polarizer/rate as NXlog: Could not determine signal field or dimensions. Falling back to loading HDF5 group children as scipp.DataGroup.
warnings.warn(msg)
/home/simon/code/scipp/scippnexus/src/scippnexus/base.py:376: UserWarning: Failed to load /entry/instrument/pulse_shaping_chopper1/delay as NXlog: Could not determine signal field or dimensions. Falling back to loading HDF5 group children as scipp.DataGroup.
warnings.warn(msg)
/home/simon/code/scipp/scippnexus/src/scippnexus/field.py:240: UserWarning: Unrecognized unit 'hz' for value dataset in '/entry/instrument/pulse_shaping_chopper1/rotation_speed/value'; setting unit as 'dimensionless'
warnings.warn(
/home/simon/code/scipp/scippnexus/src/scippnexus/base.py:376: UserWarning: Failed to load /entry/instrument/pulse_shaping_chopper2/delay as NXlog: Could not determine signal field or dimensions. Falling back to loading HDF5 group children as scipp.DataGroup.
warnings.warn(msg)
/home/simon/code/scipp/scippnexus/src/scippnexus/field.py:240: UserWarning: Unrecognized unit 'hz' for value dataset in '/entry/instrument/pulse_shaping_chopper2/rotation_speed/value'; setting unit as 'dimensionless'
warnings.warn(
/home/simon/code/scipp/scippnexus/src/scippnexus/base.py:376: UserWarning: Failed to load /entry/instrument/sans_detector/sans_event_data as NXevent_data: Required field event_time_zero not found in NXevent_data Falling back to loading HDF5 group children as scipp.DataGroup.
warnings.warn(msg)
/home/simon/code/scipp/scippnexus/src/scippnexus/base.py:376: UserWarning: Failed to load /entry/instrument/sans_detector as NXdetector: Signal is not an array-like. Falling back to loading HDF5 group children as scipp.DataGroup.
warnings.warn(msg)
/home/simon/code/scipp/scippnexus/src/scippnexus/base.py:376: UserWarning: Failed to load /entry/instrument/source/current as NXlog: Could not determine signal field or dimensions. Falling back to loading HDF5 group children as scipp.DataGroup.
warnings.warn(msg)
The current implementation is minimal. We need more in order to save the relevant metadata to the file. See also scipp/esssans#33
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.