joefutrelle / pyifcb Goto Github PK
View Code? Open in Web Editor NEWIFCB data system, generation 2
License: MIT License
IFCB data system, generation 2
License: MIT License
the top-level API (the ifcb
module) should export the most useful classes and functions.
not sure what the mechanism for this is, or how best to have that reflected in the autodoc.
len({bin})
is not the same as the number of triggers, and it's being treated that way:
Line 96 in 9a4a934
header files for new-style instruments include a list of ADC columns. It is not expected that the order of ADC columns will change, but possibly new ones will be added. Users may also want to use those column names instead of constant names from the schema classes.
One approach is to use dictlike schema classes keyed by the names in the header files, but this is a problem because the ADC column names could change per-bin, depending on what the instrument manufacturer decides to put in ADC files.
It's possible that bins could inject ADC column names from headers into subclasses of the schema object that they return, but this is a complicated implementation that breaks the idea that schemas can be derived from pids.
The cached property strategy used throughout may be a separation of concerns problem for apps that have different caching procedures. In particular, caching images with hardcoded cache-size of 2 is not appropriate for many applications.
Audit the code and remove caching unless it's used specifically for lazy evaluation of single properties.
Some cases have obviously bad info in the hdr file for ml_analyzed estimates, but other cases are less obvious but still bad. A brute force approach is to compute ml_analyzed both ways, compare the results, and select the adc based value in cases where the difference is outside some tolerance. This presumes the adc value is more likely to be correct (which seems to be true from my inspection of results).
No reason for it to have to be a @property
method.
Need to define how to access a bin, independent of
that API will then be an ABC implemented variously
raises pandas.errors.EmptyDataError
, should probably return an empty DataFrame instead. Callers will still have to deal with an empty dataframe.
Try a file with >10k ROIs
remote_bin
currently downloads all three files in a fileset even if e.g., no image data is requested.
Should perform that operation lazily, to speed up cases where image (or even ADC) data is not needed.
Use different format than current zip format. Don't stitch. Include entire ADC and HDR files instead of modified CSV file. Format images as PNGs, using their LIDs as entry names.
The new implementation will live here:
https://github.com/joefutrelle/pyifcb/blob/master/ifcb/metrics/ml_analyzed.py
In the current rev of the dashboard, there is an optimized procedure for reading a single image from a fileset, for web requests where it is not possible to leave the file open.
It skips CSV parsing of the ADC file until just the line containing the relevant ADC record.
This is important for any implementation that reads a small number of images from a large number of files.
Pandas supplies options in read_csv
that skip rows at the beginning and end, will need to investigate these, which seem to have changed a bit in recent releases. Also: not sure if this actually skips parsing those rows or just parses them and throws out the data (which is slower).
This inconsistency needs to be resolved as it is already creating divergence between Zip and HDF5 support, where the HDF5 support built on the non-public Fileset
APIs is writing all ADC records out instead of excluding 0x0 ones.
The consistent behavior would be to always exclude or include 0x0 ADC records in the adc
property and the dict-like interface.
Consistent exclusion is probably the most convenient approach; if other ADC data is desired it can be accessed through the lower-level AdcFile
class. It is also slightly more efficient, in that 0x0 ADC rows will simply not be written out and so won't have to be parsed later.
A problem with consistent exclusion is the inability to round-trip raw data from the various formats, and that may be an argument for making the adc
property expose all ADC rows, including 0x0 rows.
Current bin implementations require that some backing file exist in order to provide image access. All images can be iterated over using iteritems
or related methods, but there's no simple way to "detach" from the underlying files and still have all image data accessible.
There should be an in-memory bin that provides access to all bin data and metadata in memory, and BaseBin
should be able to produce that.
some new-style data has "stitches" in the ADC file where one ROI is completely inside the other ROI, but contains a different part of the camera field.
To prevent this, InfilledImages should work, but refuse to stitch any new-style data.
We're having to work around the current behavior in ifcbdb.
Right now stitching just returns images; other than the coordinates
property on Stitcher
, there is no way to get a modified ADC record with correct image metrics. How should this be handled (if at all?)
Right now there is a special getattr implementation on bins that provides schema keys. This is an awkward encapsulation violation and should be backed out of, including updating the documentation.
It makes more sense for callers to get the schema keys via {bin}.schema
pysmb
does not appear to be actively developed, switch to smbprotocol
instead
The open
method should not be supported for bins; instead, constructing a bin should "open" it, even if underlying operations are lazy. A method called close
which should be called by __exit__
then needs to release all underlying resources that are "open".
It dies on this line because it should say self.version
This appears to primarily be an issue with find_product_file
where it assumes that all files returned by os.scandir
exist. But os.scandir
will list broken symbolic links.
os.scandir
returns directory entries, it looks like is_file
can be used to determine if a symbolic link is broken.
The desired behavior is to raise a KeyError
if the product file is a broken symbolic link, just as if no existing file is found.
Or is raw data access too early for such a move?
If the dictlike excludes them, what about the adc
property?
If I'm not mistaken, the dashboard database now include a field called something like sample_type (with entries such as underway, cast, underway_discrete, beads), but the export_metadata api (e.g., https://ifcb-data.whoi.edu/api/export_metadata/SPIROPA) does not include that column. For my current use, that info would be very helpful.
P.S. Not sure if I'm posting this issue in the correct repo...feel free to move it if appropriate.
Fileset should just describe the files and the OO interfaces should be moved to FilesetBin.
As an alternative to the InfilledImages
class checked into ifcb-anlaysis
, a proposed algorithm is to fill the missing regions in a raw stitch with the median value in the image data.
This does not seem to underperform the complex v1 stitching algorithm, and is much easier to implement cross-language.
Instead of a MATLAB struct, use two cell arrays or a char array and a cell array, one called header_names
and one called header_values
Could we have an api that provides per data set results as counts in a matrix of bin rows and class (and any relevant class-tag combos) columns?
P.S. Is this the correct repo for this issues?
ADC schemas are minimal right now, complete them.
numerous issues with that, including depending on functools32 and use of StringIO
On this line, ClassScoresFile
opens the HDF5 file in h5py's default mode, which is 'a'. It should explicitly open it in mode 'r', or else the file modification date will be updated in the OS.
not sure what to name the resulting class. Each represents a different interface to a data directory. One is the Fileset interface, and one is the Bin interface.
Maybe FilesetDirectory and BinDirectory?
suppose a user wants to scatterplot the X/Y position from a bin on the dashboard. This works:
from matplotlib import pyplot as plt
URL = 'http://ifcb-data.whoi.edu/mvco/IFCB5_2016_306_130702'
with ifcb.load_url(URL, images=False) as b:
s = b.schema
adc = b.adc
plt.scatter(adc[s.ROI_X], adc[s.ROI_Y])
but it's annoying to always have to do this step. How about this?
...
with ifcb.load_url(URL, images=False) as b:
plt.scatter(b.adc[b.ROI_X], adc[b.ROI_Y])
The idea being that the schema keys are provisioned as attributes on the bin object. This could probably be implemented in BaseBin
.
I'm using assert in non-testing code. This is going to be incorrect usage in almost every case.
http://stackoverflow.com/questions/944592/best-practice-for-python-assert
In testing code, it's OK for now.
Instead of the module currently called "imageio" in pyifcb, use the imageio package:
The MATLAB and Python versions of the ml_analyzed calculation produce an error in some cases where no inhibit time is available from hdr file AND last line of adc file is bad. Previously we had a special case for ml < 0 (from last adc line), but some cases instead produce ml >> 5 ml (not realistic). I think the better criteria may be to compare the 2nd and 23rd entries on the last line to see if they are more different than a few 10s of milliseconds (the normal diff). In the matlab script just commited I've replaced:
if ml_analyzed(count) <= 0
with
if abs(adc.Var23(end)-adc.Var2(end)) > 0.1
I haven't fully tested this, but it works for bin D20180829T144312_IFCB125, which previously gave ml_analyzed = 32 ml, and now gives 3.7348 ml.
Just as there's an HDF representation, there should be a MATLAB representation available for raw data.
They currently sort by alpha, which is faster. The only issue this creates for sorting is with the MVCO time series which is old-style PIDs from two different instruments.
ADC files have no header rows, and there are various sources of column names, including the v2 instrument software which writes a column name header, and the dashboard which uses a set of common column names so that e.g., the x/y position don't have different names in the different version.
Decide which sources are authoritative. It seems to me that common column names are beneficial, but contradicting the column name header in the v2 HDR files seems like bad behavior for a raw data API.
There are common column numbers, and those should be considered authoritative.
The use of column names as HDF group names is problematic although highly readable.
Right now there's no API for accessing an ADC dataframe that excludes non-image targets. This is a simple operation in Pandas:
adc = bin.adc
images_adc = adc[adc[bin.ROI_WDITH] > 0]
provide this on BaseBin
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.