Coder Social home page Coder Social logo

energia's Introduction

energia*

For some of our research, we've discovered many JSONified nested arrays (ElasticSearch results, cough cough). In order to more readily interpret the data, we needed a way to compute relative complexities of the nested array structure - signatures, complexity, and so forth. The logic behind this being:

  1. There are probably many known datasets with a predictable signature that we would like to deprioritize ex. example datasets, generic logs, or so forth.
  2. Several of the ElasticSearch results could have been discovered by simply looking for complex nested arrays, ex. deep, inconsistent, broad structure.

Signatures

Stil ruminating. Fuzzy hashing maybe. But that could be hard on our workflow unless we add a new datastore and are comfortable with O(N^2) complexity on certain checks.

Complexity

Partially inspired by Kolmogorov complexity, we implement a five-metric scoring system for complexity which should allow us to distill how complex certain document structures in ElasticSearch results are:

Approximate array dimensions:

  • Count how many items are in the widest array at all depths, returning a list of depth->sum(items) (shape)
  • ... and count how many total arrays there are in this nested array (breadth)
  • maybe others?

Document structural complexity (DSM):

  • Start a counter at 0
  • For each value that maps to another array, add 1
  • For all other values, add 0.1

Item duplication-averse DSM (IDA DSM):

  • Remove any key->value pairs that are duplicated (removing the duplicates but not the original)
  • Recompute DSM

Type duplication-averse DSM (TDA DSM):

  • For any key that is not mapping to another array, make the value a string representation of what type it is ('int', 'str', etc.)
  • Remove any key->value pairs that are duplicated elsewhere (removing the duplicates but not the original), ignoring key->array mappings
  • Recompute DSM

Skeleton Pile of bones DSM (POB DSM):

  • Flatten nested array into one exceptionally long array
  • For any keys that should have mapped to an array, recreate them as null
  • Remove any key->value pairs that are duplicated elsewhere (removing both the original and all duplicates)
  • Recompute DSM

energia's People

Contributors

tweedge avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.