Coder Social home page Coder Social logo

nf-sqldb's Introduction

SQL DB plugin for Nextflow

This plugin provides support for interacting with SQL databases in Nextflow scripts.

The following databases are currently supported:

NOTE: THIS IS A PREVIEW TECHNOLOGY, FEATURES AND CONFIGURATION SETTINGS CAN CHANGE IN FUTURE RELEASES.

Getting started

This plugin requires Nextflow 22.08.1-edge or later. You can enable the plugin by adding the following snippet to your nextflow.config file:

plugins {
    id 'nf-sqldb'
}

Support for BigQuery is provided in a separate plugin:

plugins {
    id 'nf-bigquery'
}

Configuration

You can configure any number of databases under the sql.db configuration scope. For example:

sql {
    db {
        foo {
            url = 'jdbc:mysql://localhost:3306/demo'
            user = 'my-user'
            password = 'my-password'
        }
    }
}

The above example defines a database named foo that connects to a MySQL server running locally at port 3306 and using the demo schema, with my-name and my-password as credentials.

The following options are available:

sql.db.'<DB-NAME>'.url : The database connection URL based on the JDBC standard.

sql.db.'<DB-NAME>'.driver : The database driver class name (optional).

sql.db.'<DB-NAME>'.user : The database connection user name.

sql.db.'<DB-NAME>'.password : The database connection password.

Dataflow Operators

This plugin provides the following dataflow operators for querying from and inserting into database tables.

fromQuery

The fromQuery factory method queries a SQL database and creates a channel that emits a tuple for each row in the corresponding result set. For example:

include { fromQuery } from 'plugin/nf-sqldb'

channel.fromQuery('select alpha, delta, omega from SAMPLE', db: 'foo').view()

The following options are available:

db : The database handle. It must be defined under sql.db in the Nextflow configuration.

batchSize : Query the data in batches of the given size. This option is recommended for queries that may return large a large result set, so that the entire result set is not loaded into memory at once. : NOTE: this feature requires that the underlying SQL database supports LIMIT and OFFSET.

emitColumns : When true, the column names in the SELECT statement are emitted as the first tuple in the resulting channel.

sqlInsert

The sqlInsert operator collects the items in a source channel and inserts them into a SQL database. For example:

include { sqlInsert } from 'plugin/nf-sqldb'

channel
    .of('Hello','world!')
    .map( it -> tuple(it, it.length) )
    .sqlInsert( into: 'SAMPLE', columns: 'NAME, LEN', db: 'foo' )

The above example executes the following SQL statements into the database foo (as defined in the Nextflow configuration).

INSERT INTO SAMPLE (NAME, LEN) VALUES ('HELLO', 5);
INSERT INTO SAMPLE (NAME, LEN) VALUES ('WORLD!', 6);

NOTE: the target table (e.g. SAMPLE in the above example) must be created beforehand.

The following options are available:

db : The database handle. It must be defined under sql.db in the Nextflow configuration.

into : The target table for inserting the data.

columns : The database table column names to be filled with the channel data. The column names order and cardinality must match the tuple values emitted by the channel. The columns can be specified as a list or as a string of comma-separated values.

statement : The SQL INSERT statement to execute, using ? as a placeholder for the actual values, for example: insert into SAMPLE(X,Y) values (?,?). The into and columns options are ignored when this option is provided.

batchSize : Insert the data in batches of the given size (default: 10).

setup : A SQL statement that is executed before inserting the data, e.g. to create the target table. : NOTE: the underlying database should support the create table if not exist idiom, as the plugin will execute this statement every time the script is run.

Querying CSV files

This plugin supports the H2 database engine, which can query CSV files like database tables using SQL statements.

For example, create a CSV file using the snippet below:

cat <<EOF > test.csv
foo,bar
1,hello
2,ciao
3,hola
4,bonjour
EOF

Then query it in a Nextflow script:

include { fromQuery } from 'plugin/nf-sqldb'

channel
    .fromQuery("SELECT * FROM CSVREAD('test.csv') where foo>=2;")
    .view()

The CSVREAD function provided by the H2 database engine allows you to query any CSV file in your filesystem. As shown in the example, you can use standard SQL clauses like SELECT and WHERE to define your query.

Caveats

Like all dataflow operators in Nextflow, the operators provided by this plugin are executed asynchronously.

In particular, data inserted using the sqlInsert operator is not guaranteed to be available to any subsequent queries using the fromQuery operator, as it is not possible to make a channel factory operation dependent on some upstream operation.

nf-sqldb's People

Contributors

abhi18av avatar bentsherman avatar evanfloden avatar jorgeaguileraseqera avatar pditommaso avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

nf-sqldb's Issues

What's the best way to handle "database is locked" ?

Hey, neato plugin. I may have missed it, but I was wondering how I should deal with rapid writes to a database where there's an inconsistent failure where it seems very clear with:

ERROR ~ [SQLITE_BUSY] The database file is locked (database is locked)

Is there a feature I missed that can just wait and retry several times? Is that needed/appropriate? Am I using it wrong?

I very much may have just read past it, sorry for the hassle if that's the case. I'm known as a space cadet ๐Ÿ‘พ

I assume I can't handle this in the SQL create/insert templates.

Application note FYI

I'm trying to stuff records from hundreds of processes into one sqlite database to gather it. Right now I can just relaunch it, and that usually fixes it, but eventually I want this thing to be hands-off, eventually.

What next

I like this plugin, but I'm also wondering if I could alternatively whip it up in a subworkflow pointing to an absolute path (from a parameter), and then I could setup the waiting logic, and bypass using this plugin.

Alternatively, if you think this is something that should be coded up, I'm interested in learning more about plugins.
( I haven't listened to the podcast episode yet, but I hear they're all the rage ๐Ÿ˜‰ )
I don't know Java/Groovy except in making Nextflow do weird things like loops, but I'd be interested in doing a pull request - if y'all think it is useful (and if it isn't already in the documentation as a feature!!!).

If it was wanted, I am thinking it would be a retry: parameter that setups a loop of tries, and you break if it is successful.

SQLite

It is not clear from the documentation how one would connect to a SQLite database. For example, if I have a SQLite database in the main project directory and need the create various channels based on queries of that database.

DuckDB sqlInsert doesn't work

Example:

include { sqlInsert } from 'plugin/nf-sqldb'
// -- create a table
// CREATE TABLE ducks AS SELECT 3 AS age, 'mandarin' AS breed;
channel
    .of(tuple('3', "mandarin"))
    .sqlInsert( into: 'ducks',
               columns: 'age, breed',
               db: 'playground',
               setup: "CREATE TABLE ducks(age INTEGER, breed VARCHAR);")

Main error via @bentsherman

java.sql.SQLFeatureNotSupportedException: null
    at org.duckdb.DuckDBPreparedStatement.addBatch(DuckDBPreparedStatement.java:606)
    at nextflow.sql.InsertHandler.executeStm(InsertHandler.groovy:189)
    at nextflow.sql.InsertHandler.performAsTuple(InsertHandler.groovy:168)
    at nextflow.sql.InsertHandler.perform(InsertHandler.groovy:144)
    at nextflow.sql.ChannelSqlExtension$_sqlInsert_closure3.doCall(ChannelSqlExtension.groovy:123)

nextflow.log

Then I talked to @abhi18av who hunted down that this line is the issue.

He then dove into the DuckDB JDBC and found the feature isn't supported ๐Ÿ™ƒ

We thought maybe if the old method without batch was used that might be a quick fix. However we also decided running a ton of queries on the headnode was a anti-pattern.

Regardless, good things come to those who wait, there's a PR to add batch functionality!

Request to add optional argument to prevent appending semicolon to query statement

I am trying to use nf-sqldb to access a local Trino database using the code below with working db config trino_db.

include { fromQuery } from 'plugin/nf-sqldb'
query = "select * from sample_table"
ch = channel.fromQuery(query , db: 'trino_db')

The query returns an error:

java.sql.SQLException: Query failed (#20230704_090934_00018_qtmi2): line 1:86: mismatched input ';'. Expecting: <EOF>

Turns out, fromQuery function automatically adds a trailing semicolon when the original query does not already have one. See https://github.com/nextflow-io/nf-sqldb/blob/df3976d6d71ccc3cd4e4995d64da305fa40bfd7f/plugins/nf-sqldb/src/main/nextflow/sql/QueryHandler.groovy#L131C2-L131C2

Some databases/engines like Trino do not like trailing semicolons. If there is a way to work around this problem such as adding an optional argument like trailingSemicolon: false e.g. channel.fromQuery(query , db: 'trino_db', trailingSemicolon: false), it would be greatly appreciated. Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.