Coder Social home page Coder Social logo

fastq-elasticsearch's Introduction

fastq-elasticsearch

Collects some metadata about FASTQ files and stores them in elasticsearch

Build

mvn clean install

generates the bundle that contains all dependencies.

Start The fastq-elastic.sh can be used to start the app from a console.

Configuration The meta information about the sample files is stored in JSON document format. Before the fastq-elastic tool is started we have to prepare the mapping in the Elasticsearch. The mapping configuration can be found in the git repo (src/main/resources/sampledb-index.json).

Next step is the configuration of the fastq-elastic tool. You must set the custom values in the sample.conf file.

{
    elastic.host = localhost
    elastic.port = 9200

    # Supported file types
    file.extensions = [fastq.gz]

    # List of folders that should be parsed
    folders.root = [
        /sample/folder1,
        /sample/folder2
    ]

    # List of ignored folders
    folders.exclusive = []
}

Cheat sheet The most interesting part of the fastq-elastic service is what and how can we retrieve the collected data from the Elasticsearch. The following section shows some data queries that can be applied from the Kibana console.

Another general cheat sheet about the Kibana is http://elasticsearch-cheatsheet.jolicode.com/.

Counts the number of samples

GET sampledb/_doc/_count
{
  "query": {
    "wildcard": {
      "sample.samplePath": "*"
    }
  }
}

Get sample files that start with 'XXX-KM-34_S34'

GET sampledb/_doc/_search
{
  "query": {
    "wildcard": {
      "sample.sampleName.exact": "XXX-KM-34_S34*"
    }
  }
}

Get all sample file that name contain 'XXX5S' and field length > 30MB

GET sampledb/_doc/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "wildcard": {
            "sample.sampleName.exact": "*XXX5S*"
          }
        }
      ],
      "filter": [
        {
          "range": {
            "sample.fileLength": {
              "gte": "30000000"
            }
          }
        }
      ]
    }
  }
}

Get the top 20 duplicated sample files

GET sampledb/_doc/_search
{
  "size": 0,
  "aggs": {
    "distinct_sample": {
      "terms": {
        "field": "sample.sampleName.exact",
        "size": 20
      }
    }
  }
}

Find largest sample file in MB using aggreagation (in 2 steps)

POST sampledb/_doc/_search
{
  "size": 0,
  "aggs": {
    "largest_sample": {
      "max": {
        "field": "sample.fileLength",
        "script": {
          "source": "_value / params.in_mb",
          "params": {
            "in_mb": 1048576
          }
        }
      }
    }
  }
}
GET sampledb/_doc/_search
{
  "query": {
    "match": {
      "sample.fileLength": 58362878472
    }
  }
}

Find top 3 largest sample files using query and sorting (in 1 step)

GET sampledb/_doc/_search
{
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "_script": {
        "type": "number",
        "script": {
          "lang": "painless",
          "source": "doc['sample.fileLength'].value / params.in_mb",
          "params": {
            "in_mb": 1048576
          }
        },
        "order": "desc"
      }
    }
  ],
  "size": 3
}

Get the sum of the size of the sample files in GB

GET sampledb/_doc/_search
{
  "size": 0,
  "aggs": {
    "largest_sample": {
      "sum": {
        "field": "sample.fileLength",
        "script": {
          "source": "_value / params.in_gb",
          "params": {
            "in_gb": 1073741824
          }
        }
      }
    }
  }
}

fastq-elasticsearch's People

Contributors

fejesa avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.