Coder Social home page Coder Social logo

file-catalog-indexer's Introduction

PyPI GitHub release (latest by date including pre-releases) PyPI - License Lines of code GitHub issues GitHub pull requests

file-catalog-indexer

Indexing package and scripts for the File Catalog

How To

API

from indexer.index import index

  • The flagship indexing function
  • Find files rooted at given path(s), compute their metadata, and upload it to File Catalog
  • Configurable for multi-processing (default: 1 process) and recursive file-traversing (default: on)
  • Internally communicates asynchronously with File Catalog
  • Note: Symbolic links are never followed.
  • Note: index() runs the current event loop (asyncio.get_event_loop().run_until_complete())
  • Ex:
index(
	index_config,  # see config.py for a description of the fields in these typed dictionaries
	oauth_config,
	rest_config
)

from indexer.index import index_file

  • Compute metadata of a single file, and upload it to File Catalog, i.e. index one file
  • Single-processed, single-threaded
await index_file(
    filepath='/data/exp/IceCube/2018/filtered/level2/0820/Run00131410_74/Level2_IC86.2018_data_Run00131410_Subrun00000000_00000172.i3.zst',
    manager=MetadataManager(...),
    fc_rc=RestClient(...),
)

from indexer.index import index_paths

  • A wrapper around index_file() which indexes multiple files, and returns any nested sub-directories
  • Single-processed, single-threaded
  • Note: Symbolic links are never followed.
sub_dirs = await index_paths(
    paths=['/data/exp/IceCube/2018/filtered/level2/0820', '/data/exp/IceCube/2018/filtered/level2/0825'],
    manager=MetadataManager(...),
    fc_rc=RestClient(...),
)

from indexer.metadata_manager import MetadataManager

  • The internal brain of the Indexer. This has minimal guardrails, does not communicate to File Catalog, and does not traverse file directory tree.
  • Metadata is produced for an individual file, at a time.
  • Ex:
manager = MetadataManager(...)  # caches connections & directory info, manages metadata collection
metadata_file = manager.new_file(filepath)  # returns an instance (computationally light)
metadata = metadata_file.generate()  # returns a dict (computationally intense)

Scripts

python -m indexer.index
  • A command-line alternative to using from indexer.index import index
  • Use with -h to see usage.
  • Note: Symbolic links are never followed.
python -m indexer.generate
  • Like python -m indexer.index, but prints (using pprint) the metadata instead of posting to File Catalog.
  • Simply, uses file-traversing logic around calls to indexer.metadata_manager.MetadataManager
  • Note: Symbolic links are never followed.
python -m indexer.delocate
  • Find files rooted at given path(s); for each, remove the matching location entry from its File Catalog record.
  • Note: Symbolic links are never followed.

.i3 File Processing-Level Detection and Embedded Filename-Metadata Extraction

Regex is used heavily to detect the processing level of a .i3 file, and extract any embedded metadata in the filename. The exact process depends on the type of data:

Real Data (/data/exp/*)

This is a two-stage process (see MetadataManager._new_file_real()):

  1. Processing-Level Detection (Base Pattern Screening)
    • The filename is applied to multiple generic patterns to detect if it is L2, PFFilt, PFDST, or PFRaw
    • If the filename does not trigger a match, only basic metadata is collected (logical_name, checksum, file_size, locations, and create_date)
  2. Embedded Filename-Metadata Extraction
    • After the processing level is known, the filename is parsed using one of (possibly) several tokenizing regex patterns for the best match (greedy matching)
    • If the filename does not trigger a match, the function will raise an exception (script will exit). This probably indicates that a new pattern needs to be added to the list.
      • see indexer.metadata.real.filename_patterns

Simulation Data (/data/sim/*)

This is a three-stage process (see MetadataManager._new_file_simulation()):

  1. Base Pattern Screening
    • The filename is checked for .i3 file extensions: .i3, .i3.gz, .i3.bz2, .i3.zst
    • If the filename does not trigger a match, only basic metadata is collected (logical_name, checksum, file_size, locations, and create_date)
      • there are a couple hard-coded "anti-patterns" used for rejecting known false-positives (see code)
  2. Embedded Filename-Metadata Extraction
    • The filename is parsed using one of MANY (around a thousand) tokenizing regex patterns for the best match (greedy matching)
    • If the filename does not trigger a match, the function will raise an exception (script will exit). This probably indicates that a new pattern needs to be added to the list.
      • see indexer.metadata.sim.filename_patterns
  3. Processing-Level Detection
    • The filename is parsed for substrings corresponding to a processing level
      • see DataSimI3FileMetadata.figure_processing_level()
    • If there is no match, processing_level will be set to None, since the processing level is less important for simulation data.

Metadata Schema

See:

Warnings

Re-indexing Files is Tricky (Two Scenarios)

  1. Indexing files that have not changed locations is okay--this probably means that the rest of the metadata has also not changed. A guardrail query will check if the file exists in the FC with that locations entry, and will not process the file further.
  2. HOWEVER, don't point the indexer at restored files (of the same file-version)--those that had their initial locations entry removed (ie. removed from WIPAC, then moved back). Unlike re-indexing an unchanged file, this file will be fully locally processed (opened, read, and check-summed) before encountering the checksum-conflict then aborting. These files will be skipped (not sent to FC), unless you use --patch (replaces the locations list, wholesale), which is DANGEROUS.
    • Example Conflict: It's possible a file-version exists in FC after initial guardrails
      1. file was at WIPAC & indexed
      2. then moved to NERSC (location added) & deleted from WIPAC (location removed)
      3. file was brought back to WIPAC
      4. now is being re-indexed at WIPAC
      5. CONFLICT -> has the same logical_name+checksum.sha512 but differing locations

Tools

There is a script to help determine if a file tree contains softlinks.

python3 -m resources.softlink /path/to/indexing/root

file-catalog-indexer's People

Contributors

actions-user avatar alemsh avatar blinkdog avatar dsschult avatar jnbellinger avatar ric-evans avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Forkers

lgtm-migrator

file-catalog-indexer's Issues

Add "Good Run(s)" Field

def _get_events_data(self) -> types.EventsData:

Currently, "content_status" is solely based on whether the .i3 file can be read.

There are also the "good runs" list files. Do we want to consider these? This could be a new field in the FC record.

Optionally, we wait until we have an event-based store since this matches the "good run" granularity.

By Default Don't Patch

Replace --no-patch with --patch. Since not patching is the most common usage, it shouldn't require a command-line option.

Publish to PyPI with `wipac-cicd.yml`

We can now use:

We will only need the flake8 and mypy jobs in wipac-cicd.yml, at a minimum. Packaging the repo could have benefits (which means publishing to PyPI, etc), but this is not necessary, though tempting so let's do it! (required for @blinkdog's new disk pipeline).

See WIPACrepo/wipac-dev-tools#20

Edit: Indexer will now be published as a package (see #43)

L2 Indexing Race Condition

There's a potential race condition when indexing L2 files, if the client script is using index_file() directly and sharing a single MetadataManager instance between threads. This isn't an issue for using index().

# get directory's metadata
file_dir_path = os.path.dirname(os.path.abspath(file.path))
if (not self.real_l2_dir_metadata) or (file_dir_path != self.dir_path):
self.dir_path = file_dir_path

Solutions include

  • add a threading.Lock() context manager around the above code
  • creating an instance-attribute dict (self.L2_dir_data) keyed on dir_path (instead of a single self.dir_path & self.real_l2_dir_metadata)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.