Coder Social home page Coder Social logo

thesofakillers / nowcastlib Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 0.0 4.72 MB

๐Ÿง™โ€โ™‚๏ธ๐Ÿ”ง Utils that can be reused and shared across and beyond the ESO Nowcast project

Home Page: https://giuliostarace.com/nowcastlib

License: GNU General Public License v3.0

Python 99.86% Makefile 0.14%
nowcast eso-nowcast eso

nowcastlib's People

Contributors

thesofakillers avatar

Watchers

 avatar  avatar

nowcastlib's Issues

Postprocessing is slow in general

Splitting data into test/train/val before the vast majority of our postprocessing seems unnecessary, and we actually end up making redundant computations this way. For example when generating new fields. Splitting is basically only necessary for standardization.

Should be in this order

  • preprocess
  • sync
  • postprocess
  • generate new fields
  • split
  • standardize

Get rid of cascade-like nature of pipeline

Currently, because the pipeline assumes an order of operations, running an individual process (e.g. postprocessing) will also run all the individual processes leading up to it.

For example, suppose the user wants to run postprocessing. The pipeline will run preprocessing, synchronization and postprocessing in that order.

At the moment, the best way to keep things truly independent of previous processes is keeping the configuration for those previous processes to a minimum, so that minimal processing is performed.

This is however a bit cumbersome, as the user needs to open, edit and maintain different configuration files for different processes, which defeats the purpose of having a single configuration schema (the DataSet config struct).

The reason the pipeline works this way is that the output of a given process will serve as the input to the next process and the only input the user can specify in the configuration is the input to the first step of the pipeline, i.e. preprocessing. Therefore if a user wishes to run a process, all the processes leading before it need to run so that it receives the right input.


Ideally, the user should be able to have a very complete configuration (if they wished) but choose to run only a part of the pipeline by using the right CLI command and providing the necessary input themselves.

So, if the user wanted to postprocess a synchronized dataset that they already have, they would call nowcastlib postprocess with the relevant configuration and the path to the file they wish to postprocess.

Ideally, this would tell the pipeline to only perform postprocessing, rather than the current form in which preprocessing and synchronization are performed beforehand.


Each subprocess cli command should therefore take at least one additional (optional) argument -i or --input where the user can specify the path to an input file to use, so to be able to skip all the previous steps

skyfield calculations may be overly accurate for requirements at the cost of computation

Have not analyzed big O performance but it is slow enough to be a nuisance for larger datasets, especially since this calculation needs to be repeated across train/test sets and perhaps even across folds.

The following lines need addressing:

sunset_idxs = np.zeros(len(datetime_series), dtype=int)
for sunset in sunsets[1:]:
change = np.where(datetime_series > sunset)[0][0]
sunset_idxs[change:] += 1

More efficient Data Synchronization

The current data synchronisation implementation, in particular with regards to finding overlapping contiguous chunks across data sources, might ultimately require a lot of memory if the time series is long enough/the sampling is rate is too high.

P. Fluxa mentions:

A colleague of mine and I figured out a "compressed" way for synchronising chunks, which requires knowing of the start and end times of every interval. That is very cheap to obtain and scales as O(n). Then, the operation of finding all relevant intervals (the ones where there is data in all "channels") scales even better as it only depends in the number of intervals found.
This is a quick-and-dirty implementation showing how it works:

"""
Sample script showing the solution of the following problem:

"given N channels of data with R continous ranges each, find all the
ranges where there is data for all N channels"
"""

import random
import pandas
import numpy
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle

# create a set of random ranges. this is just formality
numChan = 5
nRanges = 10
data = list()
for nch in range(numChan):
    ms = random.randint(0, 5)
    for nr in range(nRanges):
        jitter1 = 0
        jitter2 = 1 #random.randint(2, 6)
        width = 7
        start = ms + jitter1
        end = start + width
        entry = dict()
        entry['start'] = start
        entry['sflag'] = 1
        entry['end'] = end
        entry['eflag'] = -1
        entry['channel'] = nch
        entry['rangeidx'] = nr
        data.append(entry)        
        ms = end + jitter2
rangesdf = pandas.DataFrame(data)  
 
# extract all timestamps from ranges, keeping track of whether they
# correspond to start or end of ranges
timest = rangesdf['start'].values.tolist() 
flags = rangesdf['sflag'].values.tolist()
flags += rangesdf['eflag'].values.tolist()
timest += rangesdf['end'].values.tolist()
# build intermediate dataframe
sdf = pandas.DataFrame(dict(st = timest, flag = flags))
sdf.sort_values(by='st', inplace=True)
cumsum = sdf.flag.cumsum()
print(cumsum)
cr = numpy.where(cumsum == numChan)
crlist = cr[0].tolist()
crarr = list()
for e in crlist:
    crarr.append(e)
    crarr.append(e + 1)
crarr = numpy.asarray(crarr)
crmask = tuple((crarr,))
cmnRanges = sdf.iloc[crmask].st.values.reshape((-1, 2))

# make a figure showing the result
fig, ax = plt.subplots()
# plot all ranges
for idx, entry in rangesdf.iterrows():
    xs = entry['start']
    xe = entry['end']
    ys = entry['channel']
    ax.hlines(ys, xs, xe)
# plot commmon ranges
for cr in cmnRanges:
    # avoid drawing ranges with no width
    if cr[1] == cr[0]:
        continue
    ax.vlines(cr[0], 0, numChan, 
        color='red', alpha=0.5, linestyle='--', linewidth=0.5)
    ax.vlines(cr[1], 0, numChan, 
        color='red', alpha=0.5, linestyle='--', linewidth=0.5)
plt.savefig('ranges.pdf')

And this is the kind of the result you get

image

ModuleNotFoundError: No module named 'importlib_metadata'

When importing nowcastlib, the following error is outputted:

>>> import nowcastlib as ncl
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "~/miniconda3/envs/py3.7/lib/python3.7/site-packages/nowcastlib/__init__.py", line 5, in <module>
    from importlib_metadata import version
ModuleNotFoundError: No module named 'importlib_metadata'

As such the import fails and the library remains unusable

Example in README contains mistake

The example listed here in the README leads to the following error trace:

data_df = pd.DataFrame(
...     [[0, 3, 4, np.NaN], [32, 4, np.NaN, 4], [56, 8, 0, np.NaN]],
...     columns=["A", "B", "C"],
...     index=pd.date_range(start="1/1/2018", periods=4, freq="2min"),
... )
Traceback (most recent call last):
  File "C:\Users\angel\anaconda3\lib\site-packages\pandas\core\internals\construction.py", line 898, in _finalize_columns_and_data
    columns = _validate_or_indexify_columns(contents, columns)
  File "C:\Users\angel\anaconda3\lib\site-packages\pandas\core\internals\construction.py", line 947, in _validate_or_indexify_columns
    f"{len(columns)} columns passed, passed data had "
AssertionError: 3 columns passed, passed data had 4 columns

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 4, in <module>
  File "C:\Users\angel\anaconda3\lib\site-packages\pandas\core\frame.py", line 700, in __init__
    dtype,
  File "C:\Users\angel\anaconda3\lib\site-packages\pandas\core\internals\construction.py", line 483, in nested_data_to_arrays
    arrays, columns = to_arrays(data, columns, dtype=dtype)
  File "C:\Users\angel\anaconda3\lib\site-packages\pandas\core\internals\construction.py", line 799, in to_arrays
    content, columns = _finalize_columns_and_data(arr, columns, dtype)
  File "C:\Users\angel\anaconda3\lib\site-packages\pandas\core\internals\construction.py", line 901, in _finalize_columns_and_data
    raise ValueError(err) from err
ValueError: 3 columns passed, passed data had 4 columns

Tensorflow Compatibility

tensorflow (2.6.0) depends on numpy (>=1.19.2,<1.20.0),
tensorflow (>=2.6.0,<3.0.0) requires numpy (>=1.19.2,<1.20.0).
nowcastlib (3.0.12) depends on numpy (>=1.20.3,<2.0.0)

So nowcastlib cannot be used with tensorflow, unless numpy is downgraded to (>=1.19.2,<1.20.0)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.