Coder Social home page Coder Social logo

pyifcb's People

Contributors

dependabot[bot] avatar joefutrelle avatar mike-kaimika avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

pyifcb's Issues

build top-level API

the top-level API (the ifcb module) should export the most useful classes and functions.

not sure what the mechanism for this is, or how best to have that reflected in the autodoc.

align schemas with names from IFCB v2 header files

header files for new-style instruments include a list of ADC columns. It is not expected that the order of ADC columns will change, but possibly new ones will be added. Users may also want to use those column names instead of constant names from the schema classes.

One approach is to use dictlike schema classes keyed by the names in the header files, but this is a problem because the ADC column names could change per-bin, depending on what the instrument manufacturer decides to put in ADC files.

It's possible that bins could inject ADC column names from headers into subclasses of the schema object that they return, but this is a complicated implementation that breaks the idea that schemas can be derived from pids.

audit use of lru_cache

The cached property strategy used throughout may be a separation of concerns problem for apps that have different caching procedures. In particular, caching images with hardcoded cache-size of 2 is not appropriate for many applications.

Audit the code and remove caching unless it's used specifically for lazy evaluation of single properties.

ml_analyzed should be computed from hdr and adc info to select the best estimate

Some cases have obviously bad info in the hdr file for ml_analyzed estimates, but other cases are less obvious but still bad. A brute force approach is to compute ml_analyzed both ways, compare the results, and select the adc based value in cases where the difference is outside some tolerance. This presumes the adc value is more likely to be correct (which seems to be true from my inspection of results).

Define bin data access API

Need to define how to access a bin, independent of

  • underlying representation (e.g., raw, HDF, zip, JSON, XML, etc)
  • local or remote
  • ADC schema version
  • stitching

that API will then be an ABC implemented variously

remote_bin should not fetch files unless needed

remote_bin currently downloads all three files in a fileset even if e.g., no image data is requested.

Should perform that operation lazily, to speed up cases where image (or even ADC) data is not needed.

Zip input/output format

Use different format than current zip format. Don't stitch. Include entire ADC and HDR files instead of modified CSV file. Format images as PNGs, using their LIDs as entry names.

fast single-image retrieval

In the current rev of the dashboard, there is an optimized procedure for reading a single image from a fileset, for web requests where it is not possible to leave the file open.

It skips CSV parsing of the ADC file until just the line containing the relevant ADC record.

This is important for any implementation that reads a small number of images from a large number of files.

Pandas supplies options in read_csv that skip rows at the beginning and end, will need to investigate these, which seem to have changed a bit in recent releases. Also: not sure if this actually skips parsing those rows or just parses them and throws out the data (which is slower).

FilesetBin exposes ADC records without ROIs in its dict-like interface, but not in the 'adc' property

This inconsistency needs to be resolved as it is already creating divergence between Zip and HDF5 support, where the HDF5 support built on the non-public Fileset APIs is writing all ADC records out instead of excluding 0x0 ones.

The consistent behavior would be to always exclude or include 0x0 ADC records in the adc property and the dict-like interface.

Consistent exclusion is probably the most convenient approach; if other ADC data is desired it can be accessed through the lower-level AdcFile class. It is also slightly more efficient, in that 0x0 ADC rows will simply not be written out and so won't have to be parsed later.

A problem with consistent exclusion is the inability to round-trip raw data from the various formats, and that may be an argument for making the adc property expose all ADC rows, including 0x0 rows.

in-memory Bin should contain all image data

Current bin implementations require that some backing file exist in order to provide image access. All images can be iterated over using iteritems or related methods, but there's no simple way to "detach" from the underlying files and still have all image data accessible.

There should be an in-memory bin that provides access to all bin data and metadata in memory, and BaseBin should be able to produce that.

bad new-style ADC data causes bad stitching in InfilledImages

some new-style data has "stitches" in the ADC file where one ROI is completely inside the other ROI, but contains a different part of the camera field.

To prevent this, InfilledImages should work, but refuse to stitch any new-style data.

We're having to work around the current behavior in ifcbdb.

eliminate schema keys as attributes on bins

Right now there is a special getattr implementation on bins that provides schema keys. This is an awkward encapsulation violation and should be backed out of, including updating the documentation.

It makes more sense for callers to get the schema keys via {bin}.schema

Change open/close semantics but keep the API clean

The open method should not be supported for bins; instead, constructing a bin should "open" it, even if underlying operations are lazy. A method called close which should be called by __exit__ then needs to release all underlying resources that are "open".

product access should check for broken symbolic links

This appears to primarily be an issue with find_product_file where it assumes that all files returned by os.scandir exist. But os.scandir will list broken symbolic links.

os.scandir returns directory entries, it looks like is_file can be used to determine if a symbolic link is broken.

The desired behavior is to raise a KeyError if the product file is a broken symbolic link, just as if no existing file is found.

api to access metadata from dashboard db should include sample_type

If I'm not mistaken, the dashboard database now include a field called something like sample_type (with entries such as underway, cast, underway_discrete, beads), but the export_metadata api (e.g., https://ifcb-data.whoi.edu/api/export_metadata/SPIROPA) does not include that column. For my current use, that info would be very helpful.

P.S. Not sure if I'm posting this issue in the correct repo...feel free to move it if appropriate.

Stitch via median?

As an alternative to the InfilledImages class checked into ifcb-anlaysis, a proposed algorithm is to fill the missing regions in a raw stitch with the median value in the image data.

This does not seem to underperform the complex v1 stitching algorithm, and is much easier to implement cross-language.

factor out bin access from DataDirectory

not sure what to name the resulting class. Each represents a different interface to a data directory. One is the Fileset interface, and one is the Bin interface.

Maybe FilesetDirectory and BinDirectory?

simpler access to schema keys?

suppose a user wants to scatterplot the X/Y position from a bin on the dashboard. This works:

from matplotlib import pyplot as plt

URL = 'http://ifcb-data.whoi.edu/mvco/IFCB5_2016_306_130702'

with ifcb.load_url(URL, images=False) as b:
    s = b.schema
    adc = b.adc
    plt.scatter(adc[s.ROI_X], adc[s.ROI_Y])

but it's annoying to always have to do this step. How about this?

...

with ifcb.load_url(URL, images=False) as b:
    plt.scatter(b.adc[b.ROI_X], adc[b.ROI_Y])

The idea being that the schema keys are provisioned as attributes on the bin object. This could probably be implemented in BaseBin.

ml_analyzed error for bad last line of adc files

The MATLAB and Python versions of the ml_analyzed calculation produce an error in some cases where no inhibit time is available from hdr file AND last line of adc file is bad. Previously we had a special case for ml < 0 (from last adc line), but some cases instead produce ml >> 5 ml (not realistic). I think the better criteria may be to compare the 2nd and 23rd entries on the last line to see if they are more different than a few 10s of milliseconds (the normal diff). In the matlab script just commited I've replaced:
if ml_analyzed(count) <= 0
with
if abs(adc.Var23(end)-adc.Var2(end)) > 0.1

I haven't fully tested this, but it works for bin D20180829T144312_IFCB125, which previously gave ml_analyzed = 32 ml, and now gives 3.7348 ml.

Should Pid objects sort by timestamp?

They currently sort by alpha, which is faster. The only issue this creates for sorting is with the MVCO time series which is old-style PIDs from two different instruments.

ADC column names are a mishmash, resolve somehow

ADC files have no header rows, and there are various sources of column names, including the v2 instrument software which writes a column name header, and the dashboard which uses a set of common column names so that e.g., the x/y position don't have different names in the different version.

Decide which sources are authoritative. It seems to me that common column names are beneficial, but contradicting the column name header in the v2 HDR files seems like bad behavior for a raw data API.

There are common column numbers, and those should be considered authoritative.

The use of column names as HDF group names is problematic although highly readable.

Accessing ADC DataFrame excluding non-image targets

Right now there's no API for accessing an ADC dataframe that excludes non-image targets. This is a simple operation in Pandas:

adc = bin.adc
images_adc = adc[adc[bin.ROI_WDITH] > 0]

provide this on BaseBin

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.