The pyifcb from joefutrelle

build top-level API

the top-level API (the ifcb module) should export the most useful classes and functions.

not sure what the mechanism for this is, or how best to have that reflected in the autodoc.

trigger rate computed incorrectly

len({bin}) is not the same as the number of triggers, and it's being treated that way:

pyifcb/ifcb/data/bins.py

Line 96 in 9a4a934

def trigger_rate(self):

align schemas with names from IFCB v2 header files

header files for new-style instruments include a list of ADC columns. It is not expected that the order of ADC columns will change, but possibly new ones will be added. Users may also want to use those column names instead of constant names from the schema classes.

One approach is to use dictlike schema classes keyed by the names in the header files, but this is a problem because the ADC column names could change per-bin, depending on what the instrument manufacturer decides to put in ADC files.

It's possible that bins could inject ADC column names from headers into subclasses of the schema object that they return, but this is a complicated implementation that breaks the idea that schemas can be derived from pids.

Unit test old and new-style data, and HDF5 output

audit use of lru_cache

The cached property strategy used throughout may be a separation of concerns problem for apps that have different caching procedures. In particular, caching images with hardcoded cache-size of 2 is not appropriate for many applications.

Audit the code and remove caching unless it's used specifically for lazy evaluation of single properties.

ml_analyzed should be computed from hdr and adc info to select the best estimate

Some cases have obviously bad info in the hdr file for ml_analyzed estimates, but other cases are less obvious but still bad. A brute force approach is to compute ml_analyzed both ways, compare the results, and select the adc based value in cases where the difference is outside some tolerance. This presumes the adc value is more likely to be correct (which seems to be true from my inspection of results).

make pid into an ordinary attribute on Pid class

No reason for it to have to be a @property method.

Define bin data access API

Need to define how to access a bin, independent of

underlying representation (e.g., raw, HDF, zip, JSON, XML, etc)
local or remote
ADC schema version
stitching

that API will then be an ABC implemented variously

empty adc file causes crash in AdcFile.csv

raises pandas.errors.EmptyDataError, should probably return an empty DataFrame instead. Callers will still have to deal with an empty dataframe.

Test HDF support for large files

Try a file with >10k ROIs

remote_bin should not fetch files unless needed

remote_bin currently downloads all three files in a fileset even if e.g., no image data is requested.

Should perform that operation lazily, to speed up cases where image (or even ADC) data is not needed.

Implement stitching (without infill, just NaNs)

Zip input/output format

Use different format than current zip format. Don't stitch. Include entire ADC and HDR files instead of modified CSV file. Format images as PNGs, using their LIDs as entry names.

update CI workflow to use current version(s) of Python

BaseBin should support context mgt

Add new ml_analyzed algorithm for certain IFCB5 files

The new implementation will live here:

https://github.com/joefutrelle/pyifcb/blob/master/ifcb/metrics/ml_analyzed.py

fast single-image retrieval

In the current rev of the dashboard, there is an optimized procedure for reading a single image from a fileset, for web requests where it is not possible to leave the file open.

It skips CSV parsing of the ADC file until just the line containing the relevant ADC record.

This is important for any implementation that reads a small number of images from a large number of files.

Pandas supplies options in read_csv that skip rows at the beginning and end, will need to investigate these, which seem to have changed a bit in recent releases. Also: not sure if this actually skips parsing those rows or just parses them and throws out the data (which is slower).

test multi bins in one Hdf file

FilesetBin exposes ADC records without ROIs in its dict-like interface, but not in the 'adc' property

This inconsistency needs to be resolved as it is already creating divergence between Zip and HDF5 support, where the HDF5 support built on the non-public Fileset APIs is writing all ADC records out instead of excluding 0x0 ones.

The consistent behavior would be to always exclude or include 0x0 ADC records in the adc property and the dict-like interface.

Consistent exclusion is probably the most convenient approach; if other ADC data is desired it can be accessed through the lower-level AdcFile class. It is also slightly more efficient, in that 0x0 ADC rows will simply not be written out and so won't have to be parsed later.

A problem with consistent exclusion is the inability to round-trip raw data from the various formats, and that may be an argument for making the adc property expose all ADC rows, including 0x0 rows.

in-memory Bin should contain all image data

Current bin implementations require that some backing file exist in order to provide image access. All images can be iterated over using iteritems or related methods, but there's no simple way to "detach" from the underlying files and still have all image data accessible.

There should be an in-memory bin that provides access to all bin data and metadata in memory, and BaseBin should be able to produce that.

bad new-style ADC data causes bad stitching in InfilledImages

some new-style data has "stitches" in the ADC file where one ROI is completely inside the other ROI, but contains a different part of the camera field.

To prevent this, InfilledImages should work, but refuse to stitch any new-style data.

We're having to work around the current behavior in ifcbdb.

should stitching "massage" a stitched target's tuple values?

Right now stitching just returns images; other than the coordinates property on Stitcher, there is no way to get a modified ADC record with correct image metrics. How should this be handled (if at all?)

eliminate schema keys as attributes on bins

Right now there is a special getattr implementation on bins that provides schema keys. This is an awkward encapsulation violation and should be backed out of, including updating the documentation.

It makes more sense for callers to get the schema keys via {bin}.schema

incorporate new ml_analyzed calculation for new-style bins

replace pysmb with smbprotocol

pysmb does not appear to be actively developed, switch to smbprotocol instead

Change open/close semantics but keep the API clean

The open method should not be supported for bins; instead, constructing a bin should "open" it, even if underlying operations are lazy. A method called close which should be called by __exit__ then needs to release all underlying resources that are "open".

class scores access crashes if version is not recognized

It dies on this line because it should say self.version

product access should check for broken symbolic links

This appears to primarily be an issue with find_product_file where it assumes that all files returned by os.scandir exist. But os.scandir will list broken symbolic links.

os.scandir returns directory entries, it looks like is_file can be used to determine if a symbolic link is broken.

The desired behavior is to raise a KeyError if the product file is a broken symbolic link, just as if no existing file is found.

should Bin's dictlike exclude ADC records that have no associated ROIs?

Or is raw data access too early for such a move?

If the dictlike excludes them, what about the adc property?

api to access metadata from dashboard db should include sample_type

If I'm not mistaken, the dashboard database now include a field called something like sample_type (with entries such as underway, cast, underway_discrete, beads), but the export_metadata api (e.g., https://ifcb-data.whoi.edu/api/export_metadata/SPIROPA) does not include that column. For my current use, that info would be very helpful.

P.S. Not sure if I'm posting this issue in the correct repo...feel free to move it if appropriate.

empty .hdr file crashes parse_hdr

refactor fileset to not include OO interface to files

Fileset should just describe the files and the OO interfaces should be moved to FilesetBin.

Stitch via median?

As an alternative to the InfilledImages class checked into ifcb-anlaysis, a proposed algorithm is to fill the missing regions in a raw stitch with the median value in the image data.

This does not seem to underperform the complex v1 stitching algorithm, and is much easier to implement cross-language.

change header names / values implementation for MATLAB

Instead of a MATLAB struct, use two cell arrays or a char array and a cell array, one called header_names and one called header_values

users need api to query counts in class and class_tag combinations from ifcb-annotate database

Could we have an api that provides per data set results as counts in a matrix of bin rows and class (and any relevant class-tag combos) columns?

P.S. Is this the correct repo for this issues?

complete ADC schemas

ADC schemas are minimal right now, complete them.

python 3 compatibility

numerous issues with that, including depending on functools32 and use of StringIO

ClassScoresFiles opens HDF5 file in append mode

On this line, ClassScoresFile opens the HDF5 file in h5py's default mode, which is 'a'. It should explicitly open it in mode 'r', or else the file modification date will be updated in the OS.

Bin backed by gen1 web services

factor out bin access from DataDirectory

not sure what to name the resulting class. Each represents a different interface to a data directory. One is the Fileset interface, and one is the Bin interface.

Maybe FilesetDirectory and BinDirectory?

simpler access to schema keys?

suppose a user wants to scatterplot the X/Y position from a bin on the dashboard. This works:

from matplotlib import pyplot as plt

URL = 'http://ifcb-data.whoi.edu/mvco/IFCB5_2016_306_130702'

with ifcb.load_url(URL, images=False) as b:
    s = b.schema
    adc = b.adc
    plt.scatter(adc[s.ROI_X], adc[s.ROI_Y])

but it's annoying to always have to do this step. How about this?

...

with ifcb.load_url(URL, images=False) as b:
    plt.scatter(b.adc[b.ROI_X], adc[b.ROI_Y])

The idea being that the schema keys are provisioned as attributes on the bin object. This could probably be implemented in BaseBin.

audit and change all use of assert in non-testing code

I'm using assert in non-testing code. This is going to be incorrect usage in almost every case.

http://stackoverflow.com/questions/944592/best-practice-for-python-assert

In testing code, it's OK for now.

use imageio instead of custom module

Instead of the module currently called "imageio" in pyifcb, use the imageio package:

https://pypi.python.org/pypi/imageio

ml_analyzed error for bad last line of adc files

The MATLAB and Python versions of the ml_analyzed calculation produce an error in some cases where no inhibit time is available from hdr file AND last line of adc file is bad. Previously we had a special case for ml < 0 (from last adc line), but some cases instead produce ml >> 5 ml (not realistic). I think the better criteria may be to compare the 2nd and 23rd entries on the last line to see if they are more different than a few 10s of milliseconds (the normal diff). In the matlab script just commited I've replaced:
if ml_analyzed(count) <= 0
with
if abs(adc.Var23(end)-adc.Var2(end)) > 0.1

I haven't fully tested this, but it works for bin D20180829T144312_IFCB125, which previously gave ml_analyzed = 32 ml, and now gives 3.7348 ml.

MATLAB input / output format support

Just as there's an HDF representation, there should be a MATLAB representation available for raw data.

Should Pid objects sort by timestamp?

They currently sort by alpha, which is faster. The only issue this creates for sorting is with the MVCO time series which is old-style PIDs from two different instruments.

need HDR test

ADC column names are a mishmash, resolve somehow

ADC files have no header rows, and there are various sources of column names, including the v2 instrument software which writes a column name header, and the dashboard which uses a set of common column names so that e.g., the x/y position don't have different names in the different version.

Decide which sources are authoritative. It seems to me that common column names are beneficial, but contradicting the column name header in the v2 HDR files seems like bad behavior for a raw data API.

There are common column numbers, and those should be considered authoritative.

The use of column names as HDF group names is problematic although highly readable.

Accessing ADC DataFrame excluding non-image targets

Right now there's no API for accessing an ADC dataframe that excludes non-image targets. This is a simple operation in Pandas:

adc = bin.adc
images_adc = adc[adc[bin.ROI_WDITH] > 0]

provide this on BaseBin

joefutrelle / pyifcb Goto Github PK

pyifcb's People

Contributors

Stargazers

Watchers

Forkers

pyifcb's Issues

Recommend Projects

Recommend Topics

Recommend Org