tolbertam / sstable-tools Goto Github PK

View Code? Open in Web Editor NEW

162.0 162.0 31.0 230 KB

Tools for parsing, creating and doing other fun stuff with sstables

License: Apache License 2.0

Java 100.00%

cassandra cqlsh dse sstable sstables

sstable-tools's People

Contributors

Stargazers

Watchers

sstable-tools's Issues

Large partitions command

List largest partitions, perhaps include largest disk space wise, cell count

java -jar sstable-tools top

Investigate aggregations / groupings with paging enabled

It's possible aggregations (like count) only operate over the current page, so if you have paging enabled you can get weird results.

Hinted Handoff dump

not really sstable related, but with 3.0 changes cant view them as easily as before

'limit' seems to not consider query criteria

It looks like the limit support does not consider query criteria, i.e.:

cqlsh> select * from sstable where ticker='YHOO' limit 5;
 ┌─────────┬─────────────────────┬────────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
 │ticker   │date                 │adj_close   │close    │high     │low      │open     │volume   │
 ╞═════════╪═════════════════════╪════════════╪═════════╪═════════╪═════════╪═════════╪═════════╡
 │ORCL     │2016-02-19 00:00-0600│36.779999   │36.779999│36.790001│36.419998│36.52    │13118400 │
 ├─────────┼─────────────────────┼────────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
 │ORCL     │2016-02-18 00:00-0600│36.630001   │36.630001│36.869999│36.400002│36.709999│12464800 │
 ├─────────┼─────────────────────┼────────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
 │ORCL     │2016-02-17 00:00-0600│36.630001   │36.630001│36.77    │35.970001│35.970001│13146600 │
 ├─────────┼─────────────────────┼────────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
 │ORCL     │2016-02-16 00:00-0600│35.700001   │35.700001│35.91    │35.419998│35.759998│18685400 │
 ├─────────┼─────────────────────┼────────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
 │ORCL     │2016-02-12 00:00-0600│35.540001   │35.540001│35.549999│34.91    │35.240002│15806800 │
 └─────────┴─────────────────────┴────────────┴─────────┴─────────┴─────────┴─────────┴─────────┘

Remove toJson command

Now that 3.0.4 and 3.4 is released, we should remove the 'toJson' command as sstabledump covers this functionality.

Investigate zero config reads

See if we can get enough information from sstable (ie SerializationHeader in Statistics.db) to build the CFMetaData so we dont need the CQL create statements/thrift/read system tables/etc.

Tests for various tombstones

Range, Row and Cell Tombstone tests are needed.

Add --gc_grace_seconds option to describe

As discussed in #60, the estimated tombstone drop times do not account for gc_grace_seconds because that information is not available in sstable files (see CASSANDRA-12208). Would be nice to add an option to describe to allow the user to pass in gc_grace_seconds so they the drop times can be offset by gc_grace_seconds.

support C* 2.1 version

Hi,
Can this support C* 2.1 version?
as many users are still using C* 2.x version.
Thanks

Use github releases instead of bintray

Recently I found that it's pretty easy to upload the artifacts to tagged github release as part of the maven release plugin. Should do that for sstable-tools to automate the process of making a release.

shell mode

it would also be nice to have an alternative command to select that behaves in the following manner:

A command line mode that accepts the path to the schema, sstable(s) and the query all as files (kind of lame to make the query a file, but don't see a way around that).
An interactive shell that takes the schema and sstable file as an input. The user can then make queries like 'select * from table where blah' in the interactive shell.

This could behave like a limited version of cqlsh:

Usage: cqlsh sstable [sstable...] [-s schema] [-f file]

Options:
  -s , --schema=SCHEMA       The cql schema to use for the given sstable.  If not provided, 
                             query criteria is limited to select * with no where clause.
  -f, --file=FILE            Execute commands from FILE, then exit

I think this could use the ascii table transformer like proposed in #26.

java -jar ~/sstable-tools-3.9.0-alpha9.jar cqlsh always return Tombstone count of 0.

java -jar ~/sstable-tools-3.9.0-alpha9.jar cqlsh

cqlsh> describe sstables

/data/data/ionic_na/logs-a776ecc131b511e7bfae1bf589cdc0b2/mc-22945-big-Data.db

Partitions: 84792
Rows: 84792
Tombstones: 0
Cells: 1086408

Tombstones returns a value =0

When I run an sstabledump -d |grep 'deletedAt"

I get a very large number of rows returned.

cfmetadata from statistics fails on compact storage tables

enable logger with better default configs

Re enable to the logger, but have a config setup to hide things by default. Move System.err.printlns to use log.error

Offline repairs

Given two directories, each representing a node (ie mounted s3 backups). walk through sstables and run a repair, creating a new sstable to drop in each nodes data dir (then nodetool refresh) to make them consistent. Idea being this could be run with EMR job or random node so repairs wont have any cpu/io impact on cluster. Since no worrying about throttling can make it much faster as well.

udt support for select

UDTs currently don't work as there isn't a means to specify their schema.

Make build for cassandra 3.11

Bad assertion in serializePartitionKey when shortKeys are enabled

When running with assertions enabled and no schema is provided via '-c' toJson will fail while serializing a partition key because it expects the number of key components to match the number of partition columns in the metadata (which are empty):

Exception in thread "main" java.lang.AssertionError
    at com.csforge.sstable.JsonTransformer.serializePartitionKey(JsonTransformer.java:83)
    at com.csforge.sstable.JsonTransformer.serializePartition(JsonTransformer.java:149)
    at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
    at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
    at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
    at java.util.Iterator.forEachRemaining(Iterator.java:116)
    at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
    at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
    at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
    at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
    at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
    at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
    at com.csforge.sstable.JsonTransformer.toJson(JsonTransformer.java:58)
    at com.csforge.sstable.SSTable2Json.main(SSTable2Json.java:109)
    at com.csforge.sstable.Driver.main(Driver.java:17)

Composite partition key issues

if theres a composite partition key and the column family metadata is built from the sstable it will fail to render as a table when walking through partition keys since the single composite part key does not match up with the broken up values in the ResultSet.

Update to support 3.4

There were a couple of api changes in C* 3.4 which makes sstable-tools incompatible with it:

[INFO] -------------------------------------------------------------
[ERROR] COMPILATION ERROR : 
[INFO] -------------------------------------------------------------
[ERROR] /Users/atolbert/Documents/Projects/sstable-tools/src/main/java/com/csforge/sstable/reader/CassandraReader.java:[53,39] no suitable method found for iterator(org.apache.cassandra.db.DecoratedKey,org.apache.cassandra.db.filter.ColumnFilter,boolean,boolean)
    method org.apache.cassandra.io.sstable.format.SSTableReader.iterator(org.apache.cassandra.db.DecoratedKey,org.apache.cassandra.db.Slices,org.apache.cassandra.db.filter.ColumnFilter,boolean,boolean) is not applicable
      (actual and formal argument lists differ in length)
    method org.apache.cassandra.io.sstable.format.SSTableReader.iterator(org.apache.cassandra.io.util.FileDataInput,org.apache.cassandra.db.DecoratedKey,org.apache.cassandra.db.RowIndexEntry,org.apache.cassandra.db.Slices,org.apache.cassandra.db.filter.ColumnFilter,boolean,boolean) is not applicable
      (actual and formal argument lists differ in length)
[ERROR] /Users/atolbert/Documents/Projects/sstable-tools/src/main/java/com/csforge/sstable/reader/CassandraReader.java:[54,20] method map in interface java.util.stream.Stream<T> cannot be applied to given types;
  required: java.util.function.Function<? super java.lang.Object,? extends R>
  found: Partition::new
  reason: cannot infer type-variable(s) R
    (argument mismatch; invalid constructor reference
      incompatible types: java.lang.Object cannot be converted to org.apache.cassandra.db.rows.UnfilteredRowIterator)

Had fixed this locally, will push a fix for this later this week. The challenge will making this compatible with both versions or do we configure publishing 2 separate branches? There is probably some black magic where we can achieve this through reflection, although at some point we will probably reach a threshold in the future where we will need to provide per release versions.

json to sstable

Exception when trying to load sstables

Can't seem to get it to work

[automaton@ip-172-31-9-120 device_monitoring_timestamps]$ pwd
/home/automaton/36134/sstables_10.246.171.127/health/device_monitoring_timestamps
[automaton@ip-172-31-9-120 device_monitoring_timestamps]$ dse -v
5.0.2
[automaton@ip-172-31-9-120 device_monitoring_timestamps]$ ls -la
total 19320
drwxrwxr-x 2 automaton automaton     4096 Aug 28 17:18 .
drwxrwxr-x 3 automaton automaton     4096 Aug 28 17:17 ..
-rw-r--r-- 1 automaton automaton     9307 Aug 27 19:20 mc-7074-big-CompressionInfo.db
-rw-r--r-- 1 automaton automaton 16637579 Aug 27 19:20 mc-7074-big-Data.db
-rw-r--r-- 1 automaton automaton     2744 Aug 27 19:20 mc-7074-big-Filter.db
-rw-r--r-- 1 automaton automaton   140381 Aug 27 19:20 mc-7074-big-Index.db
-rw-r--r-- 1 automaton automaton    11047 Aug 27 19:20 mc-7074-big-Statistics.db
-rw-r--r-- 1 automaton automaton      588 Aug 27 19:20 mc-7074-big-Summary.db
-rw-r--r-- 1 automaton automaton     2587 Aug 27 20:58 mc-7075-big-CompressionInfo.db
-rw-r--r-- 1 automaton automaton  2867725 Aug 27 20:58 mc-7075-big-Data.db
-rw-r--r-- 1 automaton automaton     2368 Aug 27 20:58 mc-7075-big-Filter.db
-rw-r--r-- 1 automaton automaton    63530 Aug 27 20:58 mc-7075-big-Index.db
-rw-r--r-- 1 automaton automaton    10144 Aug 27 20:58 mc-7075-big-Statistics.db
-rw-r--r-- 1 automaton automaton      501 Aug 27 20:58 mc-7075-big-Summary.db
[automaton@ip-172-31-9-120 device_monitoring_timestamps]$ java -jar $HOME/36134/sstable-tools-3.11.0-alpha11.jar cqlsh
cqlsh> use mc-7075-big-Data.db
Using: /home/automaton/36134/sstables_10.246.171.127/health/device_monitoring_timestamps/mc-7075-big-Data.db
Exception in thread "main" java.lang.UnsatisfiedLinkError: org.apache.cassandra.utils.NativeLibraryLinux.getpid()J
        at org.apache.cassandra.utils.NativeLibraryLinux.getpid(Native Method)
        at org.apache.cassandra.utils.NativeLibraryLinux.callGetpid(NativeLibraryLinux.java:122)
        at org.apache.cassandra.utils.NativeLibrary.getProcessID(NativeLibrary.java:394)
        at org.apache.cassandra.utils.UUIDGen.hash(UUIDGen.java:388)
        at org.apache.cassandra.utils.UUIDGen.makeNode(UUIDGen.java:367)
        at org.apache.cassandra.utils.UUIDGen.makeClockSeqAndNode(UUIDGen.java:300)
        at org.apache.cassandra.utils.UUIDGen.<clinit>(UUIDGen.java:41)
        at org.apache.cassandra.config.CFMetaData$Builder.build(CFMetaData.java:1293)
        at com.csforge.sstable.CassandraUtils.tableFromSSTable(CassandraUtils.java:263)
        at com.csforge.sstable.CassandraUtils.tableFromBestSource(CassandraUtils.java:99)
        at com.csforge.sstable.Cqlsh.doUse(Cqlsh.java:302)
        at com.csforge.sstable.Cqlsh.evalLine(Cqlsh.java:615)
        at com.csforge.sstable.Cqlsh.startShell(Cqlsh.java:252)
        at com.csforge.sstable.Cqlsh.main(Cqlsh.java:762)
        at com.csforge.sstable.Driver.main(Driver.java:22)
[automaton@ip-172-31-9-120 device_monitoring_timestamps]$ java -jar $HOME/36134/sstable-tools-3.11.0-alpha11.jar cqlsh
cqlsh> use /home/automaton/36134/sstables_10.246.171.127/health/device_monitoring_timestamps/
Using: /home/automaton/36134/sstables_10.246.171.127/health/device_monitoring_timestamps/mc-7074-big-Data.db
Using: /home/automaton/36134/sstables_10.246.171.127/health/device_monitoring_timestamps/mc-7075-big-Data.db
Exception in thread "main" java.lang.UnsatisfiedLinkError: org.apache.cassandra.utils.NativeLibraryLinux.getpid()J
        at org.apache.cassandra.utils.NativeLibraryLinux.getpid(Native Method)
        at org.apache.cassandra.utils.NativeLibraryLinux.callGetpid(NativeLibraryLinux.java:122)
        at org.apache.cassandra.utils.NativeLibrary.getProcessID(NativeLibrary.java:394)
        at org.apache.cassandra.utils.UUIDGen.hash(UUIDGen.java:388)
        at org.apache.cassandra.utils.UUIDGen.makeNode(UUIDGen.java:367)
        at org.apache.cassandra.utils.UUIDGen.makeClockSeqAndNode(UUIDGen.java:300)
        at org.apache.cassandra.utils.UUIDGen.<clinit>(UUIDGen.java:41)
        at org.apache.cassandra.config.CFMetaData$Builder.build(CFMetaData.java:1293)
        at com.csforge.sstable.CassandraUtils.tableFromSSTable(CassandraUtils.java:263)
        at com.csforge.sstable.CassandraUtils.tableFromBestSource(CassandraUtils.java:99)
        at com.csforge.sstable.Cqlsh.doUse(Cqlsh.java:302)
        at com.csforge.sstable.Cqlsh.evalLine(Cqlsh.java:615)
        at com.csforge.sstable.Cqlsh.startShell(Cqlsh.java:252)
        at com.csforge.sstable.Cqlsh.main(Cqlsh.java:762)
        at com.csforge.sstable.Driver.main(Driver.java:22)

optimize paging by reusing scanner

The paging logic in cqlsh reopens the SSTable(s) on each page. This is wasteful and we could just keep track of where we left off in UnfilteredPartitionIterator and the current RowIterator.

ccmbridge tests

integration tests, create a cluster at version, run sstable2json and verify output

persist fails in new version

The deserialization occurs before CassandraUtils can set ClientMode which throws exception since the DatabaseDescriptor calls its loadYaml

Remove describe command when 4.0 release made

CASSANDRA-11483 updates the sstablemetadata command to work like describe so we should consider removing whenever we make a release that targets 4.0.

Could not import schema error

I'm trying to import a simple schema into cqlsh but I'm getting the error:

cqlsh> schema /home/pedro/software/schema.cql
Could not import schema from '/home/pedro/software/schema.cql': line 4:0 mismatched input 'CREATE' expecting EOF (...  AND durable_writes = true;[CREATE]...).

I generated this schema file with "cqlsh -e "describe schema". I'm attaching it to this issue.
schema.txt

Print full stack trace in cqlsh when exceptions encountered

There are several places in Cqlsh code that prints only the exception message. Since weird errors can happen in C*, we should log the full stack trace.

Use java driver to pull meta data

If the -c option is not provided attempt to connect to cluster (use defaults unless overriden) and retrieve the tables schema. Not necessary if #7 is successful.

Updates for 3.9

Offline select

Be able to execute a query on an sstable or set of sstables like:

java -jar sstable-tools.jar select from [sstable or folder containing sstables] where key = 1 and value > 10 with [file containing schema or an ip address of cluster to pull schema from]

Must be safe to run if on C* is currently running and have no side affects (ie no running an embedded)

It doesn't work with C* 2.1.5

In this line method get returns a null which leads to NPE.

Debug InstanceTidier leak that only seems to appear when running in maven console

org.apache.cassandra.utils.concurrent.Ref$State@6b25a2cc) to class org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier@1601791084:/Users/atolbert/Documents/Projects/sstable-tools/ma-1-big was not released before the reference was garbage collected.

Seems to happen for each SSTable opened, but only with mvn exec:java. I can't seem to reproduce using an executable jar.

Use command does not do directory expansion, even though autocompletion works

when using 'USE' autocompletion works on ~/, but when it does actually do the expansion when evaluating the path:

cqlsh> use ~/.ccm/1104/node1/data0/g1/hello_p-70e80041ea3f11e59ba04d93b372bbee/
Cannot find '/Users/atolbert/Downloads/~/.ccm/1104/node1/data0/g1/hello_p-70e80041ea3f11e59ba04d93b372bbee/'.

3.7 updates

An Option to Output for Streaming

Reading the entire SSTable into memory when analyzing the output of sstable2json is rather cumbersome and sometimes not feasible/possible given the size of the resulting JSON.

Having a command line option that would output one partition object per line (or by some other delimiter), would solve this by allowing a user to load one partition at a time into memory. While partition sizes can get rather large, they will not be as large as the SSTable itself.

This output would then have no need to output the beginning and ending array brackets, nor the trailing comma after each of the partitions.

I can also see a possible performance benefit here as you can read the SSTable and output each partition as you read it, rather than reading the entire SSTable in all at once.

sstable-tools select count(*) from data/keyspace/table/ma-*-Data.db where user = 1

tolbertam / sstable-tools Goto Github PK

sstable-tools's People

Contributors

Stargazers

Watchers

Forkers

sstable-tools's Issues

/data/data/ionic_na/logs-a776ecc131b511e7bfae1bf589cdc0b2/mc-22945-big-Data.db

Recommend Projects

Recommend Topics

Recommend Org