thesofakillers / nowcastlib Goto Github PK

View Code? Open in Web Editor NEW

0.0 2.0 0.0 4.72 MB

🧙‍♂️🔧 Utils that can be reused and shared across and beyond the ESO Nowcast project

Home Page: https://giuliostarace.com/nowcastlib

License: GNU General Public License v3.0

Python 99.86% Makefile 0.14%

nowcast eso-nowcast eso

nowcastlib's People

Contributors

Watchers

nowcastlib's Issues

Postprocessing is slow in general

Splitting data into test/train/val before the vast majority of our postprocessing seems unnecessary, and we actually end up making redundant computations this way. For example when generating new fields. Splitting is basically only necessary for standardization.

Should be in this order

preprocess
sync
postprocess
generate new fields
split
standardize

Get rid of cascade-like nature of pipeline

Currently, because the pipeline assumes an order of operations, running an individual process (e.g. postprocessing) will also run all the individual processes leading up to it.

For example, suppose the user wants to run postprocessing. The pipeline will run preprocessing, synchronization and postprocessing in that order.

At the moment, the best way to keep things truly independent of previous processes is keeping the configuration for those previous processes to a minimum, so that minimal processing is performed.

This is however a bit cumbersome, as the user needs to open, edit and maintain different configuration files for different processes, which defeats the purpose of having a single configuration schema (the DataSet config struct).

The reason the pipeline works this way is that the output of a given process will serve as the input to the next process and the only input the user can specify in the configuration is the input to the first step of the pipeline, i.e. preprocessing. Therefore if a user wishes to run a process, all the processes leading before it need to run so that it receives the right input.

Ideally, the user should be able to have a very complete configuration (if they wished) but choose to run only a part of the pipeline by using the right CLI command and providing the necessary input themselves.

So, if the user wanted to postprocess a synchronized dataset that they already have, they would call nowcastlib postprocess with the relevant configuration and the path to the file they wish to postprocess.

Ideally, this would tell the pipeline to only perform postprocessing, rather than the current form in which preprocessing and synchronization are performed beforehand.

Each subprocess cli command should therefore take at least one additional (optional) argument -i or --input where the user can specify the path to an input file to use, so to be able to skip all the previous steps

skyfield calculations may be overly accurate for requirements at the cost of computation

Have not analyzed big O performance but it is slow enough to be a nuisance for larger datasets, especially since this calculation needs to be repeated across train/test sets and perhaps even across folds.

The following lines need addressing:

nowcastlib/nowcastlib/pipeline/process/postprocess/generate.py

Lines 130 to 133 in b24e0d8

    
           sunset_idxs = np.zeros(len(datetime_series), dtype=int) 
        
           for sunset in sunsets[1:]: 
        
               change = np.where(datetime_series > sunset)[0][0] 
        
               sunset_idxs[change:] += 1

More efficient Data Synchronization

The current data synchronisation implementation, in particular with regards to finding overlapping contiguous chunks across data sources, might ultimately require a lot of memory if the time series is long enough/the sampling is rate is too high.

P. Fluxa mentions:

A colleague of mine and I figured out a "compressed" way for synchronising chunks, which requires knowing of the start and end times of every interval. That is very cheap to obtain and scales as O(n). Then, the operation of finding all relevant intervals (the ones where there is data in all "channels") scales even better as it only depends in the number of intervals found.
This is a quick-and-dirty implementation showing how it works:

"""
Sample script showing the solution of the following problem:

"given N channels of data with R continous ranges each, find all the
ranges where there is data for all N channels"
"""

import random
import pandas
import numpy
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle

# create a set of random ranges. this is just formality
numChan = 5
nRanges = 10
data = list()
for nch in range(numChan):
    ms = random.randint(0, 5)
    for nr in range(nRanges):
        jitter1 = 0
        jitter2 = 1 #random.randint(2, 6)
        width = 7
        start = ms + jitter1
        end = start + width
        entry = dict()
        entry['start'] = start
        entry['sflag'] = 1
        entry['end'] = end
        entry['eflag'] = -1
        entry['channel'] = nch
        entry['rangeidx'] = nr
        data.append(entry)        
        ms = end + jitter2
rangesdf = pandas.DataFrame(data)  
 
# extract all timestamps from ranges, keeping track of whether they
# correspond to start or end of ranges
timest = rangesdf['start'].values.tolist() 
flags = rangesdf['sflag'].values.tolist()
flags += rangesdf['eflag'].values.tolist()
timest += rangesdf['end'].values.tolist()
# build intermediate dataframe
sdf = pandas.DataFrame(dict(st = timest, flag = flags))
sdf.sort_values(by='st', inplace=True)
cumsum = sdf.flag.cumsum()
print(cumsum)
cr = numpy.where(cumsum == numChan)
crlist = cr[0].tolist()
crarr = list()
for e in crlist:
    crarr.append(e)
    crarr.append(e + 1)
crarr = numpy.asarray(crarr)
crmask = tuple((crarr,))
cmnRanges = sdf.iloc[crmask].st.values.reshape((-1, 2))

# make a figure showing the result
fig, ax = plt.subplots()
# plot all ranges
for idx, entry in rangesdf.iterrows():
    xs = entry['start']
    xe = entry['end']
    ys = entry['channel']
    ax.hlines(ys, xs, xe)
# plot commmon ranges
for cr in cmnRanges:
    # avoid drawing ranges with no width
    if cr[1] == cr[0]:
        continue
    ax.vlines(cr[0], 0, numChan, 
        color='red', alpha=0.5, linestyle='--', linewidth=0.5)
    ax.vlines(cr[1], 0, numChan, 
        color='red', alpha=0.5, linestyle='--', linewidth=0.5)
plt.savefig('ranges.pdf')

And this is the kind of the result you get

ModuleNotFoundError: No module named 'importlib_metadata'

When importing nowcastlib, the following error is outputted:

>>> import nowcastlib as ncl
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "~/miniconda3/envs/py3.7/lib/python3.7/site-packages/nowcastlib/__init__.py", line 5, in <module>
    from importlib_metadata import version
ModuleNotFoundError: No module named 'importlib_metadata'

As such the import fails and the library remains unusable

Example in README contains mistake

The example listed here in the README leads to the following error trace:

data_df = pd.DataFrame(
...     [[0, 3, 4, np.NaN], [32, 4, np.NaN, 4], [56, 8, 0, np.NaN]],
...     columns=["A", "B", "C"],
...     index=pd.date_range(start="1/1/2018", periods=4, freq="2min"),
... )
Traceback (most recent call last):
  File "C:\Users\angel\anaconda3\lib\site-packages\pandas\core\internals\construction.py", line 898, in _finalize_columns_and_data
    columns = _validate_or_indexify_columns(contents, columns)
  File "C:\Users\angel\anaconda3\lib\site-packages\pandas\core\internals\construction.py", line 947, in _validate_or_indexify_columns
    f"{len(columns)} columns passed, passed data had "
AssertionError: 3 columns passed, passed data had 4 columns

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 4, in <module>
  File "C:\Users\angel\anaconda3\lib\site-packages\pandas\core\frame.py", line 700, in __init__
    dtype,
  File "C:\Users\angel\anaconda3\lib\site-packages\pandas\core\internals\construction.py", line 483, in nested_data_to_arrays
    arrays, columns = to_arrays(data, columns, dtype=dtype)
  File "C:\Users\angel\anaconda3\lib\site-packages\pandas\core\internals\construction.py", line 799, in to_arrays
    content, columns = _finalize_columns_and_data(arr, columns, dtype)
  File "C:\Users\angel\anaconda3\lib\site-packages\pandas\core\internals\construction.py", line 901, in _finalize_columns_and_data
    raise ValueError(err) from err
ValueError: 3 columns passed, passed data had 4 columns

Tensorflow Compatibility

tensorflow (2.6.0) depends on numpy (>=1.19.2,<1.20.0),
tensorflow (>=2.6.0,<3.0.0) requires numpy (>=1.19.2,<1.20.0).
nowcastlib (3.0.12) depends on numpy (>=1.20.3,<2.0.0)

So nowcastlib cannot be used with tensorflow, unless numpy is downgraded to (>=1.19.2,<1.20.0)

thesofakillers / nowcastlib Goto Github PK

nowcastlib's People

Contributors

Watchers

nowcastlib's Issues

Postprocessing is slow in general

Get rid of cascade-like nature of pipeline

skyfield calculations may be overly accurate for requirements at the cost of computation

More efficient Data Synchronization

ModuleNotFoundError: No module named 'importlib_metadata'

Example in README contains mistake

Tensorflow Compatibility

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	sunset_idxs = np.zeros(len(datetime_series), dtype=int)
	for sunset in sunsets[1:]:
	change = np.where(datetime_series > sunset)[0][0]
	sunset_idxs[change:] += 1