tolbertam / sstable-tools Goto Github PK
View Code? Open in Web Editor NEWTools for parsing, creating and doing other fun stuff with sstables
License: Apache License 2.0
Tools for parsing, creating and doing other fun stuff with sstables
License: Apache License 2.0
List largest partitions, perhaps include largest disk space wise, cell count
java -jar sstable-tools top
It's possible aggregations (like count) only operate over the current page, so if you have paging enabled you can get weird results.
not really sstable related, but with 3.0 changes cant view them as easily as before
It looks like the limit support does not consider query criteria, i.e.:
cqlsh> select * from sstable where ticker='YHOO' limit 5;
┌─────────┬─────────────────────┬────────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ticker │date │adj_close │close │high │low │open │volume │
╞═════════╪═════════════════════╪════════════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ORCL │2016-02-19 00:00-0600│36.779999 │36.779999│36.790001│36.419998│36.52 │13118400 │
├─────────┼─────────────────────┼────────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ORCL │2016-02-18 00:00-0600│36.630001 │36.630001│36.869999│36.400002│36.709999│12464800 │
├─────────┼─────────────────────┼────────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ORCL │2016-02-17 00:00-0600│36.630001 │36.630001│36.77 │35.970001│35.970001│13146600 │
├─────────┼─────────────────────┼────────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ORCL │2016-02-16 00:00-0600│35.700001 │35.700001│35.91 │35.419998│35.759998│18685400 │
├─────────┼─────────────────────┼────────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ORCL │2016-02-12 00:00-0600│35.540001 │35.540001│35.549999│34.91 │35.240002│15806800 │
└─────────┴─────────────────────┴────────────┴─────────┴─────────┴─────────┴─────────┴─────────┘
Now that 3.0.4 and 3.4 is released, we should remove the 'toJson' command as sstabledump covers this functionality.
See if we can get enough information from sstable (ie SerializationHeader in Statistics.db) to build the CFMetaData so we dont need the CQL create statements/thrift/read system tables/etc.
Range, Row and Cell Tombstone tests are needed.
As discussed in #60, the estimated tombstone drop times do not account for gc_grace_seconds
because that information is not available in sstable files (see CASSANDRA-12208). Would be nice to add an option to describe to allow the user to pass in gc_grace_seconds
so they the drop times can be offset by gc_grace_seconds.
Hi,
Can this support C* 2.1 version?
as many users are still using C* 2.x version.
Thanks
Recently I found that it's pretty easy to upload the artifacts to tagged github release as part of the maven release plugin. Should do that for sstable-tools to automate the process of making a release.
it would also be nice to have an alternative command to select that behaves in the following manner:
This could behave like a limited version of cqlsh:
Usage: cqlsh sstable [sstable...] [-s schema] [-f file]
Options:
-s , --schema=SCHEMA The cql schema to use for the given sstable. If not provided,
query criteria is limited to select * with no where clause.
-f, --file=FILE Execute commands from FILE, then exit
I think this could use the ascii table transformer like proposed in #26.
java -jar ~/sstable-tools-3.9.0-alpha9.jar cqlsh
cqlsh> describe sstables
Partitions: 84792
Rows: 84792
Tombstones: 0
Cells: 1086408
Tombstones returns a value =0
When I run an sstabledump -d |grep 'deletedAt"
I get a very large number of rows returned.
Re enable to the logger, but have a config setup to hide things by default. Move System.err.printlns to use log.error
Given two directories, each representing a node (ie mounted s3 backups). walk through sstables and run a repair, creating a new sstable to drop in each nodes data dir (then nodetool refresh) to make them consistent. Idea being this could be run with EMR job or random node so repairs wont have any cpu/io impact on cluster. Since no worrying about throttling can make it much faster as well.
UDTs currently don't work as there isn't a means to specify their schema.
When running with assertions enabled and no schema is provided via '-c' toJson will fail while serializing a partition key because it expects the number of key components to match the number of partition columns in the metadata (which are empty):
Exception in thread "main" java.lang.AssertionError
at com.csforge.sstable.JsonTransformer.serializePartitionKey(JsonTransformer.java:83)
at com.csforge.sstable.JsonTransformer.serializePartition(JsonTransformer.java:149)
at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
at java.util.Iterator.forEachRemaining(Iterator.java:116)
at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
at com.csforge.sstable.JsonTransformer.toJson(JsonTransformer.java:58)
at com.csforge.sstable.SSTable2Json.main(SSTable2Json.java:109)
at com.csforge.sstable.Driver.main(Driver.java:17)
if theres a composite partition key and the column family metadata is built from the sstable it will fail to render as a table when walking through partition keys since the single composite part key does not match up with the broken up values in the ResultSet.
There were a couple of api changes in C* 3.4 which makes sstable-tools incompatible with it:
[INFO] -------------------------------------------------------------
[ERROR] COMPILATION ERROR :
[INFO] -------------------------------------------------------------
[ERROR] /Users/atolbert/Documents/Projects/sstable-tools/src/main/java/com/csforge/sstable/reader/CassandraReader.java:[53,39] no suitable method found for iterator(org.apache.cassandra.db.DecoratedKey,org.apache.cassandra.db.filter.ColumnFilter,boolean,boolean)
method org.apache.cassandra.io.sstable.format.SSTableReader.iterator(org.apache.cassandra.db.DecoratedKey,org.apache.cassandra.db.Slices,org.apache.cassandra.db.filter.ColumnFilter,boolean,boolean) is not applicable
(actual and formal argument lists differ in length)
method org.apache.cassandra.io.sstable.format.SSTableReader.iterator(org.apache.cassandra.io.util.FileDataInput,org.apache.cassandra.db.DecoratedKey,org.apache.cassandra.db.RowIndexEntry,org.apache.cassandra.db.Slices,org.apache.cassandra.db.filter.ColumnFilter,boolean,boolean) is not applicable
(actual and formal argument lists differ in length)
[ERROR] /Users/atolbert/Documents/Projects/sstable-tools/src/main/java/com/csforge/sstable/reader/CassandraReader.java:[54,20] method map in interface java.util.stream.Stream<T> cannot be applied to given types;
required: java.util.function.Function<? super java.lang.Object,? extends R>
found: Partition::new
reason: cannot infer type-variable(s) R
(argument mismatch; invalid constructor reference
incompatible types: java.lang.Object cannot be converted to org.apache.cassandra.db.rows.UnfilteredRowIterator)
Had fixed this locally, will push a fix for this later this week. The challenge will making this compatible with both versions or do we configure publishing 2 separate branches? There is probably some black magic where we can achieve this through reflection, although at some point we will probably reach a threshold in the future where we will need to provide per release versions.
Can't seem to get it to work
[automaton@ip-172-31-9-120 device_monitoring_timestamps]$ pwd
/home/automaton/36134/sstables_10.246.171.127/health/device_monitoring_timestamps
[automaton@ip-172-31-9-120 device_monitoring_timestamps]$ dse -v
5.0.2
[automaton@ip-172-31-9-120 device_monitoring_timestamps]$ ls -la
total 19320
drwxrwxr-x 2 automaton automaton 4096 Aug 28 17:18 .
drwxrwxr-x 3 automaton automaton 4096 Aug 28 17:17 ..
-rw-r--r-- 1 automaton automaton 9307 Aug 27 19:20 mc-7074-big-CompressionInfo.db
-rw-r--r-- 1 automaton automaton 16637579 Aug 27 19:20 mc-7074-big-Data.db
-rw-r--r-- 1 automaton automaton 2744 Aug 27 19:20 mc-7074-big-Filter.db
-rw-r--r-- 1 automaton automaton 140381 Aug 27 19:20 mc-7074-big-Index.db
-rw-r--r-- 1 automaton automaton 11047 Aug 27 19:20 mc-7074-big-Statistics.db
-rw-r--r-- 1 automaton automaton 588 Aug 27 19:20 mc-7074-big-Summary.db
-rw-r--r-- 1 automaton automaton 2587 Aug 27 20:58 mc-7075-big-CompressionInfo.db
-rw-r--r-- 1 automaton automaton 2867725 Aug 27 20:58 mc-7075-big-Data.db
-rw-r--r-- 1 automaton automaton 2368 Aug 27 20:58 mc-7075-big-Filter.db
-rw-r--r-- 1 automaton automaton 63530 Aug 27 20:58 mc-7075-big-Index.db
-rw-r--r-- 1 automaton automaton 10144 Aug 27 20:58 mc-7075-big-Statistics.db
-rw-r--r-- 1 automaton automaton 501 Aug 27 20:58 mc-7075-big-Summary.db
[automaton@ip-172-31-9-120 device_monitoring_timestamps]$ java -jar $HOME/36134/sstable-tools-3.11.0-alpha11.jar cqlsh
cqlsh> use mc-7075-big-Data.db
Using: /home/automaton/36134/sstables_10.246.171.127/health/device_monitoring_timestamps/mc-7075-big-Data.db
Exception in thread "main" java.lang.UnsatisfiedLinkError: org.apache.cassandra.utils.NativeLibraryLinux.getpid()J
at org.apache.cassandra.utils.NativeLibraryLinux.getpid(Native Method)
at org.apache.cassandra.utils.NativeLibraryLinux.callGetpid(NativeLibraryLinux.java:122)
at org.apache.cassandra.utils.NativeLibrary.getProcessID(NativeLibrary.java:394)
at org.apache.cassandra.utils.UUIDGen.hash(UUIDGen.java:388)
at org.apache.cassandra.utils.UUIDGen.makeNode(UUIDGen.java:367)
at org.apache.cassandra.utils.UUIDGen.makeClockSeqAndNode(UUIDGen.java:300)
at org.apache.cassandra.utils.UUIDGen.<clinit>(UUIDGen.java:41)
at org.apache.cassandra.config.CFMetaData$Builder.build(CFMetaData.java:1293)
at com.csforge.sstable.CassandraUtils.tableFromSSTable(CassandraUtils.java:263)
at com.csforge.sstable.CassandraUtils.tableFromBestSource(CassandraUtils.java:99)
at com.csforge.sstable.Cqlsh.doUse(Cqlsh.java:302)
at com.csforge.sstable.Cqlsh.evalLine(Cqlsh.java:615)
at com.csforge.sstable.Cqlsh.startShell(Cqlsh.java:252)
at com.csforge.sstable.Cqlsh.main(Cqlsh.java:762)
at com.csforge.sstable.Driver.main(Driver.java:22)
[automaton@ip-172-31-9-120 device_monitoring_timestamps]$ java -jar $HOME/36134/sstable-tools-3.11.0-alpha11.jar cqlsh
cqlsh> use /home/automaton/36134/sstables_10.246.171.127/health/device_monitoring_timestamps/
Using: /home/automaton/36134/sstables_10.246.171.127/health/device_monitoring_timestamps/mc-7074-big-Data.db
Using: /home/automaton/36134/sstables_10.246.171.127/health/device_monitoring_timestamps/mc-7075-big-Data.db
Exception in thread "main" java.lang.UnsatisfiedLinkError: org.apache.cassandra.utils.NativeLibraryLinux.getpid()J
at org.apache.cassandra.utils.NativeLibraryLinux.getpid(Native Method)
at org.apache.cassandra.utils.NativeLibraryLinux.callGetpid(NativeLibraryLinux.java:122)
at org.apache.cassandra.utils.NativeLibrary.getProcessID(NativeLibrary.java:394)
at org.apache.cassandra.utils.UUIDGen.hash(UUIDGen.java:388)
at org.apache.cassandra.utils.UUIDGen.makeNode(UUIDGen.java:367)
at org.apache.cassandra.utils.UUIDGen.makeClockSeqAndNode(UUIDGen.java:300)
at org.apache.cassandra.utils.UUIDGen.<clinit>(UUIDGen.java:41)
at org.apache.cassandra.config.CFMetaData$Builder.build(CFMetaData.java:1293)
at com.csforge.sstable.CassandraUtils.tableFromSSTable(CassandraUtils.java:263)
at com.csforge.sstable.CassandraUtils.tableFromBestSource(CassandraUtils.java:99)
at com.csforge.sstable.Cqlsh.doUse(Cqlsh.java:302)
at com.csforge.sstable.Cqlsh.evalLine(Cqlsh.java:615)
at com.csforge.sstable.Cqlsh.startShell(Cqlsh.java:252)
at com.csforge.sstable.Cqlsh.main(Cqlsh.java:762)
at com.csforge.sstable.Driver.main(Driver.java:22)
The paging logic in cqlsh reopens the SSTable(s) on each page. This is wasteful and we could just keep track of where we left off in UnfilteredPartitionIterator
and the current RowIterator
.
integration tests, create a cluster at version, run sstable2json and verify output
The deserialization occurs before CassandraUtils can set ClientMode which throws exception since the DatabaseDescriptor calls its loadYaml
CASSANDRA-11483 updates the sstablemetadata
command to work like describe
so we should consider removing whenever we make a release that targets 4.0.
I'm trying to import a simple schema into cqlsh but I'm getting the error:
cqlsh> schema /home/pedro/software/schema.cql
Could not import schema from '/home/pedro/software/schema.cql': line 4:0 mismatched input 'CREATE' expecting EOF (... AND durable_writes = true;[CREATE]...).
I generated this schema file with "cqlsh -e "describe schema". I'm attaching it to this issue.
schema.txt
There are several places in Cqlsh code that prints only the exception message. Since weird errors can happen in C*, we should log the full stack trace.
If the -c option is not provided attempt to connect to cluster (use defaults unless overriden) and retrieve the tables schema. Not necessary if #7 is successful.
Be able to execute a query on an sstable or set of sstables like:
java -jar sstable-tools.jar select from [sstable or folder containing sstables] where key = 1 and value > 10 with [file containing schema or an ip address of cluster to pull schema from]
Must be safe to run if on C* is currently running and have no side affects (ie no running an embedded)
In this line method get
returns a null
which leads to NPE
.
org.apache.cassandra.utils.concurrent.Ref$State@6b25a2cc) to class org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier@1601791084:/Users/atolbert/Documents/Projects/sstable-tools/ma-1-big was not released before the reference was garbage collected.
Seems to happen for each SSTable opened, but only with mvn exec:java
. I can't seem to reproduce using an executable jar.
when using 'USE' autocompletion works on ~/, but when it does actually do the expansion when evaluating the path:
cqlsh> use ~/.ccm/1104/node1/data0/g1/hello_p-70e80041ea3f11e59ba04d93b372bbee/
Cannot find '/Users/atolbert/Downloads/~/.ccm/1104/node1/data0/g1/hello_p-70e80041ea3f11e59ba04d93b372bbee/'.
Reading the entire SSTable into memory when analyzing the output of sstable2json is rather cumbersome and sometimes not feasible/possible given the size of the resulting JSON.
Having a command line option that would output one partition object per line (or by some other delimiter), would solve this by allowing a user to load one partition at a time into memory. While partition sizes can get rather large, they will not be as large as the SSTable itself.
This output would then have no need to output the beginning and ending array brackets, nor the trailing comma after each of the partitions.
I can also see a possible performance benefit here as you can read the SSTable and output each partition as you read it, rather than reading the entire SSTable in all at once.
allow merging sstables offline
print pretty ascii tables like cqlsh, be nice with #22. Can have another command to dump it (toTable
) and also add option to switch transformer to use on select
command
when providing sstables allow globs instead of only allowing 1 sstable. Useful to pull in all results from a data directory. i.e.
sstable-tools select count(*) from data/keyspace/table/ma-*-Data.db where user = 1
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.