Coder Social home page Coder Social logo

variety's Introduction

Meet Variety, a Schema Analyzer for MongoDB

This lightweight tool helps you get a sense of your application's schema, as well as any outliers to that schema. Particularly useful when you inherit a codebase with data dump and want to quickly learn how the data's structured. Also useful for finding rare keys.


“I happen to slowly be falling in love with Variety! It is actually one of the most useful tools to get a sense for a messy/unknown data set, and I have put it in a few of our exercises at Zipfian Academy.”

Jon Dinu Co-founder of Zipfian Academy


Also featured on the official MongoDB blog.

An Easy Example

We'll make a collection:

db.users.insert({name: "Tom", bio: "A nice guy.", pets: ["monkey", "fish"], someWeirdLegacyKey: "I like Ike!"});
db.users.insert({name: "Dick", bio: "I swordfight.", birthday: new Date("1974/03/14")});
db.users.insert({name: "Harry", pets: "egret", birthday: new Date("1984/03/14")});
db.users.insert({name: "Geneviève", bio: "Ça va?"});
db.users.insert({name: "Jim", someBinData: new BinData(2,"1234")});

So, let's see what we've got here:

$ mongo test --eval "var collection = 'users'" variety.js

+------------------------------------------------------------------+
| key                | types              | occurrences | percents |
| ------------------ | ------------       | ----------- | -------- |
| _id                | ObjectId           |           5 |    100.0 |
| name               | String             |           5 |    100.0 |
| bio                | String             |           3 |     60.0 |
| birthday           | Date               |           2 |     40.0 |
| pets               | Array(1),String(1) |           2 |     40.0 |
| someBinData        | BinData-old        |           1 |     20.0 |
| someWeirdLegacyKey | String             |           1 |     20.0 |
+------------------------------------------------------------------+

("test" is the database containing the collection we are analyzing.)

Hmm. Looks like everybody has a "name" and "_id". Most, but not all have a "bio".

Interestingly, it looks like "pets" can be either an array or a string, but there are more arrays than strings. Will this cause any problems in the application, I wonder?

Seems like the first document created has a weird legacy key—those damn fools who built the prototype didn't clean up after themselves. If there were a thousand such early documents, I might cross-reference the codebase to confirm they are no longer used, and then delete them all. That way they'll not confuse any future developers.

Results are stored for future use in a varietyResults database.

See Progress When Analysis Takes a Long Time

Tailing the log is great for this. Mongo provides a "percent complete" measurement for you. These operations can take a long time on huge collections.

Analyze Only Recent Documents

Perhaps you have a really large collection, and you can't wait a whole day for Variety's results.

Perhaps you want to ignore a collection's oldest documents, and only see what the collection's documents' structures have been looking like, as of late.

One can apply a "limit" constraint, which analyzes only the newest documents in a collection (unless sorting), like so:

$ mongo test --eval "var collection = 'users', limit = 1" variety.js

Let's examine the results closely:

+----------------------------------------------------+
| key         | types       | occurrences | percents |
| ----------- | ----------- | ----------- | -------- |
| _id         | ObjectId    |           1 |    100.0 |
| name        | String      |           1 |    100.0 |
| someBinData | BinData-old |           1 |    100.0 |
+----------------------------------------------------+

We are only examining the last document here ("limit = 1"). It belongs to Geneviève, and only contains the _id, name and bio fields. So it makes sense these are the only three keys.

Analyze Documents to a Maximum Depth

Perhaps you have a potentially very deep nested object structure, and you don't want to see more than a few levels deep in the analysis.

One can apply a "maxDepth" constraint, which limits the depth Variety will recursively search to find new objects.

db.users.insert({name:"Walter", someNestedObject:{a:{b:{c:{d:{e:1}}}}}});

The default will traverse all the way to the bottom of that structure:

$ mongo test --eval "var collection = 'users'" variety.js

+----------------------------------------------------------------+
| key                        | types    | occurrences | percents |
| -------------------------- | -------- | ----------- | -------- |
| _id                        | ObjectId |           1 |    100.0 |
| name                       | String   |           1 |    100.0 |
| someNestedObject           | Object   |           1 |    100.0 |
| someNestedObject.a         | Object   |           1 |    100.0 |
| someNestedObject.a.b       | Object   |           1 |    100.0 |
| someNestedObject.a.b.c     | Object   |           1 |    100.0 |
| someNestedObject.a.b.c.d   | Object   |           1 |    100.0 |
| someNestedObject.a.b.c.d.e | Number   |           1 |    100.0 |
+----------------------------------------------------------------+

$ mongo test --eval "var collection = 'users', maxDepth = 3" variety.js

+----------------------------------------------------------+
| key                  | types    | occurrences | percents |
| -------------------- | -------- | ----------- | -------- |
| _id                  | ObjectId |           1 |    100.0 |
| name                 | String   |           1 |    100.0 |
| someNestedObject     | Object   |           1 |    100.0 |
| someNestedObject.a   | Object   |           1 |    100.0 |
| someNestedObject.a.b | Object   |           1 |    100.0 |
+----------------------------------------------------------+

As you can see, Variety only traversed three levels deep.

Analyze a Subset of Documents

Perhaps you have a large collection, or you only care about some subset of the documents.

One can apply a "query" constraint, which takes a standard Mongo query object, to filter the set of documents required before analysis.

$ mongo test --eval "var collection = 'users', query = {'caredAbout':true}" variety.js

Analyze Documents Sorted In a Particular Order

Perhaps you want to analyze a subset of documents sorted in an order other than creation order, say, for example, sorted by when documents were updated.

One can apply a "sort" constraint, which analyzes documents in the specified order like so:

$ mongo test --eval "var collection = 'users', sort = { updated_at : -1 }" variety.js

Include Last Value

You can add lastValue property to show values of the last document.

$ mongo test --eval "var collection = 'orders', lastValue = true" variety.js

+--------------------------------------------------------------------------------------------+
| key             | types        | occurrences | percents | lastValue                        |
| --------------- | ------------ | ----------- | -------- | -------------------------------- |
| _id             | ObjectId     |           1 |    100.0 | 5a834b76f4d3fa6e578a67f6         |
| age             | Number       |           1 |    100.0 |                          38.2569 |
| animals         | Array        |           1 |    100.0 | [Array]                          |
| animals.XX.type | String       |           1 |    100.0 | dog                              |
| balance         | NumberLong   |           1 |    100.0 |                 1236458945684846 |
| date            | Date         |           1 |    100.0 |                    1513539969000 |
| fn              | Object       |           1 |    100.0 | [Object]                         |
| fn.code         | String       |           1 |    100.0 | function (x, y){ return x + y; } |
| name            | String       |           1 |    100.0 | John                             |
| nil             | null         |           1 |    100.0 | [null]                           |
| uid             | BinData-UUID |           1 |    100.0 | 3b241101e2bb42558caf4136c566a962 |
+--------------------------------------------------------------------------------------------+

If use without sort it will fetch values of the last natural sorted document. Date is converted into timestamp, ObjectId into string and binary data as hex. Other types shown in square brackets.

Render Output As JSON For Easy Ingestion and Parsing

Variety supports two different output formats:

  • ASCII: nicely formatted tables (as in this README)
  • JSON: valid JSON results for subsequent processing in other tools (see also quiet option)

Default format is ascii. You can select the type of format with property outputFormat provided to Variety. Valid values are ascii and json.

$ mongo test --quiet --eval "var collection = 'users', outputFormat='json'" variety.js

Quiet Option

Both MongoDB and Variety output some additional information to standard output. If you want to remove this info, you can use --quiet option provided to mongo executable. Variety can also read that option and mute unnecessary output. This is useful in connection with outputFormat=json. You would then receive only JSON, without any other characters around it.

$ mongo test --quiet --eval "var collection = 'users', sort = { updated_at : -1 }" variety.js

Log Keys and Types As They Arrive Option

Sometimes you want to see the keys and types come in as it happens. Maybe you have a large dataset and want accurate results, but you also are impatient and want to see something now. Or maybe you have a large mangled dataset with crazy keys (that probably shouldn't be keys) and Variety is going out of memory. This option will show you the keys and types as they come in and help you identify problems with your dataset without needing the Variety script to finish.

$ mongo test --eval "var collection = 'users', sort = { updated_at : -1 }, logKeysContinuously = true" variety.js

Exclude Subkeys

Sometimes you inherit a database full of junk. Maybe the previous developer put data in the database keys, which causes Variety to go out of memory when run. After you've run the logKeysContinuously to figure out which subkeys may be a problem, you can use this option to run Variety without those subkeys.

db.users.insert({name:"Walter", someNestedObject:{a:{b:{c:{d:{e:1}}}}}, otherNestedObject:{a:{b:{c:{d:{e:1}}}}}});

$ mongo test --eval "var collection = 'users', sort = { updated_at : -1 }, excludeSubkeys = [ 'someNestedObject.a.b' ]" variety.js

+-----------------------------------------------------------------+
| key                         | types    | occurrences | percents |
| --------------------------- | -------- | ----------- | -------- |
| _id                         | ObjectId |           1 |    100.0 |
| name                        | String   |           1 |    100.0 |
| someNestedObject            | Object   |           1 |    100.0 |
| someNestedObject.a          | Object   |           1 |    100.0 |
| someNestedObject.a.b        | Object   |           1 |    100.0 |
| otherNestedObject           | Object   |           1 |    100.0 |
| otherNestedObject.a         | Object   |           1 |    100.0 |
| otherNestedObject.a.b       | Object   |           1 |    100.0 |
| otherNestedObject.a.b.c     | Object   |           1 |    100.0 |
| otherNestedObject.a.b.c.d   | Object   |           1 |    100.0 |
| otherNestedObject.a.b.c.d.e | Number   |           1 |    100.0 |
+-----------------------------------------------------------------+

Secondary Reads

Analyzing a large collection on a busy replica set primary could take a lot longer than if you read from a secondary. To do so, we have to tell MongoDB it's okay to perform secondary reads by setting the slaveOk property to true:

$ mongo secondary.replicaset.member:31337/somedb --eval "var collection = 'users', slaveOk = true" variety.js

Save Results in MongoDB For Future Use

By default, Variety prints results only to standard output and does not store them in MongoDB itself. If you want to persist them automatically in MongoDB for later usage, you can set the parameter persistResults. Variety then stores result documents in database varietyResults and the collection name is derived from the source collection's name. If the source collection's name is users, Variety will store results in collection usersKeys under varietyResults database.

$ mongo test --quiet --eval "var collection = 'users', persistResults=true" variety.js

To persist to an alternate MongoDB database, you may specify the following parameters:

  • resultsDatabase - The database to store Variety results in. Accepts either a database name or a host[:port]/database URL.
  • resultsCollection - Collection to store Variety results in. WARNING: This collection is dropped before results are inserted.
  • resultsUser - MongoDB username for results database
  • resultsPass - MongoDB password for results database
$ mongo test --quiet --eval "var collection = 'users', persistResults=true, resultsDatabase='db.example.com/variety' variety.js

Reserved Keys

Variety expects keys to be well formed, not having any '.'s in them (mongo 2.4 allows dots in certain cases). Also mongo uses the pseudo keys 'XX' and keys coresponding to the regex 'XX\d+XX.*' for use with arrays. You can change the string XX in these patterns to whatever you like if there is a conflict in your database using the arrayEscape parameter.

$ mongo test --quiet --eval "var collection = 'users', arrayEscape = 'YY'" variety.js

Command Line Interface

Variety itself is command line friendly, as shown on examples above. But if you are a NPM and Node.js user, you could prefer the variety-cli project. It simplifies usage of Variety and removes all the complexity of passing variables in the --eval argument and providing a path to the variety.js library.

Example of a simplified command-line usage:

variety test/users --outputFormat='json' --quiet

For more details see the documentation of variety-cli project.

"But my dad told me MongoDB is a schemaless database!"

First of all, your father is a great guy. Moving on...

A Mongo collection does not enforce a predefined schema like a relational database table. Still, documents in real-world collections nearly always have large sections for which the format of the data is the same. In other words, there is a schema to the majority of collections, it's just enforced by the application, rather than by the database system. And this schema is allowed to be a bit fuzzy, in the same way that a given table column might not be required in all rows, but to a much greater degree of flexibility. So we examine what percent of documents in the collection contain a key, and we get a feel for, among other things, how crucial that key is to the proper functioning of the application.

Dependencies

Absolutely none, except MongoDB. Written in 100% JavaScript. (mongod's "noscripting" may not be set to true, and 'strict mode' must be disabled.)

Development, Hacking

This project is NPM based and provides standard NPM functionality. As an additional (not required) dependency, Docker can be installed to test against different MongoDB versions.

To install all dev dependencies call as usual:

npm install

By default, tests expect MongoDB available on localhost:27017 and can be executed by calling:

npm test

If you have Docker installed and don't want to test against your own MongoDB instance, you can execute tests against dockerized MongoDB:

MONGODB_VERSION=3.2 npm run test:docker

The script downloads one of official MongoDB images (based on your provided version), starts the database, executes test suite against it (inside the container) and stops the DB.

Reporting Issues / Contributing

Please report any bugs and feature requests on the Github issue tracker. I will read all reports!

I accept pull requests from forks. Very grateful to accept contributions from folks.

Core Maintainers

Special Thanks

Additional special thanks to Gaëtan Voyer-Perraul (@gatesvp) and Kristina Chodorow (@kchodorow) for answering other people's questions about how to do this on Stack Overflow, thereby providing me with the initial seed of code which grew into this tool.

Much thanks also, to Kyle Banker (@Hwaet) for writing an unusually good book on MongoDB, which has taught me everything I know about it so far.

Tools Which Use Variety (Open Source)

Know of one? Built one? Let us know!

Stay Safe

I have every reason to believe this tool will not corrupt your data or harm your computer. But if I were you, I would not use it in a production environment.

Released by Maypop Inc, © 2012–2023, under the [MIT License] (http://www.opensource.org/licenses/MIT).

variety's People

Contributors

acetolyne avatar davidwittman avatar freeeve avatar gimmi avatar jacob111 avatar jamescropcho avatar jamesdelonay avatar jmargeta avatar omachala avatar predictive avatar rutsky avatar tggreene avatar timludwinski avatar todvora avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

variety's Issues

Critical Bug: Wrong percentage indications?

Sometime variety would out put stuff like this:

[66] oti@core-dev:~/Software/variety  [master]  $ /srv/mongodb/bin/mongo  --eval "var db_name = 'client'; var collection = 'rules'" variety.js
MongoDB shell version: 2.6.4
connecting to: test
Variety: A MongoDB Schema Analyzer
Version 1.4.1, released 14 Oct 2014
Using query of { }
Using limit of 84
Using maxDepth of 99
Using sort of { "_id" : -1 }
Using outputFormat of ascii
Using persistResults of false
removing leaf arrays in results collection, and getting percentages
+----------------------------------------------------------------------------------------------+
| key                                      | types          | occurrences | percents           |
| ---------------------------------------- | -------------- | ----------- | ------------------ |
| rules.XX.filter                          | Object         | 5907        | 7032.142857142857  |
| rules.XX.filter.account                  | String         | 5907        | 7032.142857142857  |
| rules.XX.modify                          | Object         | 5907        | 7032.142857142857  |
| rules.XX.name                            | String         | 5907        | 7032.142857142857  |
| rules.XX.modify.channel                  | String         | 5012        | 5966.666666666666  |
| rules.XX.modify.sub_channel              | String         | 5012        | 5966.666666666666  |
| rules.XX.modify.sub2_channel             | String         | 4976        | 5923.809523809524  |
| rules.XX.modify.top_channel              | String         | 4592        | 5466.666666666666  |
| rules.XX.filter.channel                  | String         | 4278        | 5092.857142857143  |
| rules.XX.filter.campaign                 | String         | 3816        | 4542.857142857143  |
| rules.XX.modify.paid_performance_channel | Boolean,String | 1987        | 2365.4761904761904 |
| rules.XX.modify.exclude                  | Boolean        | 1164        | 1385.7142857142858 |
| rules.XX.modify.sea_brand                | String,Boolean | 567         | 675                |
| rules.XX.modify.cpc                      | Number         | 476         | 566.6666666666667  |
| rules.XX.modify.cost                     | Number         | 281         | 334.5238095238095  |
| rules.XX.modify.campaignId               | String         | 168         | 200                |
| rules.XX.modify.crr                      | Number         | 159         | 189.28571428571428 |
| _id                                      | ObjectId       | 84          | 100                |
| author                                   | String         | 84          | 100                |
| rules                                    | Array          | 84          | 100                |
| date                                     | String         | 27          | 32.142857142857146 |
| rules.XX.modify.sea_brand_non_brand      | Boolean        | 4           | 4.761904761904762  |
+----------------------------------------------------------------------------------------------+

This is true for values inside arrays. Notice here the rules.XX.modify is appearing 5907 times in 84 records? This is exactly 5907/84*100=7032.14285714285.

Is it possible to execute variety on a remote mongo db?

This is a sample command I use normally to connect to our remote test mongo db:
mongo rs.instance1/testDB -u username -p password

Here is the updated command to include variety;
mongo rs.instance1/testDB -u username -p password --eval "var collection = 'collection'" path/to/variety.js

When I try to execute variety on it it doesn't seem to initialize the db_name variable and I get this error

Variety: A MongoDB Schema Analyzer
Version 1.2.5, released 21 February 2014
Wed Feb 26 15:17:41.959 TypeError: Cannot call method 'forEach' of undefined at variety.js:23

Object.keys is not a function variety.js;186

Hi All,

I'm trying to run variety.js on a database collection and getting the following error.

Mon Dec 29 12:50:05 TypeError: Object.keys is not a function variety.js:186
failed to load: variety.js

It seems to be an issue when it's converting interim results. The error producing line is here:

newEntry['value'] = {'types':Object.keys(entry['types'])};

I'd be happy to investigate this a bit more, but I'm not sure what the Object is referencing. Any help would be appreciated.

Thanks,
Marc

Getting fault errors

I keep getting the following error when running against a collection with about 14,000 documents. It creates 274 accountKeys and it looks like it is probably done processing because the last entry is an attribute that only occurs once.

Tue Dec 17 08:38:42.345 mongo got signal 11 (Segmentation fault: 11), stack trace: 

Tue Dec 17 08:38:42.508 0x10c208740 0x10c0f9ac3 0x7fff88f505aa 0x10c3f5a6d 0x10c2e5047 0x10c3b2649 0x10c3b24b1 0x10c1c51ab 0x10c1c506f 0x10c1010e6 0x10c101e0f 0x7fff8d5b05fd 0x5 
 0   mongo                               0x000000010c208740 _ZN5mongo15printStackTraceERSo + 64
 1   mongo                               0x000000010c0f9ac3 _Z12quitAbruptlyi + 323
 2   libsystem_platform.dylib            0x00007fff88f505aa _sigtramp + 26
 3   mongo                               0x000000010c3f5a6d _ZN2v88internal6Object23GetPropertyWithReceiverEPS1_PNS0_6StringEP18PropertyAttributes + 125
 4   mongo                               0x000000010c2e5047 _ZN2v88internal15DeoptimizerDataD1Ev + 55
 5   mongo                               0x000000010c3b2649 _ZN2v88internal7Isolate6DeinitEv + 105
 6   mongo                               0x000000010c3b24b1 _ZN2v88internal7Isolate8TearDownEv + 81
 7   mongo                               0x000000010c1c51ab _ZN5mongo7V8ScopeD2Ev + 267
 8   mongo                               0x000000010c1c506f _ZN5mongo7V8ScopeD0Ev + 15
 9   mongo                               0x000000010c1010e6 _Z5_mainiPPcS0_ + 24566
 10  mongo                               0x000000010c101e0f main + 95
 11  libdyld.dylib                       0x00007fff8d5b05fd start + 1
 12  ???                                 0x0000000000000005 0x0 + 5

Using a specific host and port ("failed to load my db")

I'm encountering an error trying to use this. Most likely I'm doing something wrong. I barely ever use the command line or "shell" for anything. I use the C# driver and MongoVue for all of my Mongo interactions. But this is compelling enough to want to take a look :)

On my local dev box I have Mongo installed on port 9123 and have a number of databases. Here is what I've typed at the commandline:

E:\mongo\binaries>mongo 127.0.0.1:9123 phillynjnet --eval "var collection = 'speaker'" variety.js

As far as I understand it, I'm running Mongo, connecting to the local server telling it to evaluate the phillynjnet database and the "speaker" collection. I'm getting the following output including the error:

MongoDB shell version: 2.0.2
connecting to: 127.0.0.1:9123/test
loading file: phillynjnet
Thu Apr 26 09:03:33 file [phillynjnet] doesn't exist
failed to load: phillynjnet

Can you please tell me what I'm doing wrong?

TypeError: Cannot call method 'forEach' of undefined

[ec2-user@devserver variety-master]$ mongo $HOST/$DATABASE -u $USER p $PASSWD --eval "var collection = 'world'" variety.js
MongoDB shell version: 2.4.3
connecting to: $HOST/$DATABASE
Variety: A MongoDB Schema Analyzer
Version 1.2.3, released 01 September 2013
Tue Sep 24 21:45:42.210 JavaScript execution failed: TypeError: Cannot call method 'forEach' of undefined at variety.js:L23
failed to load: variety.js

Variety results database - is it needed?

Hi,
I would like to ask, what is the original reason for creating Variety results database and __Keys collection. Currently results are printed either in readable form to stdout or formatted as a JSON and piped to another tool. In those situations, creating results database and collection seems like unexpected side effect to me.

Also some users are running Variety without admin or write permissions (requested in #40).

Does anyone use the collected data later? What is the typical use case?

Thank you,
Tomas

running script without admin database access

running variety.js with the following command line:
mongo host/database -u user -p password --eval "var collection = 'deviceDetail', maxDepth = 1" scripts/variety-master/variety.js
give the following output:
MongoDB shell version: 2.6.1
connecting to: host/database
Variety: A MongoDB Schema Analyzer
Version 1.3.0, released 30 May 2014
2014-09-03T13:42:25.987-0700 TypeError: Cannot call method 'forEach' of undefined at scripts/variety-master/variety.js:22
failed to load: scripts/variety-master/variety.js

Further checking leads to looking at the following line:
mongos> db.adminCommand('listDatabases')
{
"note" : "not authorized for command: listDatabases on database admin",
"ok" : 0,
"errmsg" : "unauthorized"
}

which is the beginning of line 22 in the source file.

Is there a way to use the script without access to the admin database? It is not open in all of our databases.

Update README to include instruction how to run the script when using authentication

I guess this is trivial for experienced users*, but never the less:

mongo <host_ip>/admin -u <USER_NAME> -p --eval "var db_name = 'databaseToUse'; var collection = 'collectionToScan'; other_option1; other_option2..." variety.js
  • I am not really experienced with JS or with MongoDB, but reading the source code was helpful. Having this mentioned in the readme would spare me some minutes ;-)

XX in key names

In results of analysis I see key names with added '.XX' that do not exist in real collection.
I created collection with commands:

db.blog.insert({title:"Article 1", comments:[{author:"John", body:"it works"}]});
db.blog.insert({title:"Article 2", comments:[{author:"Tom", body:"thanks"}]});

I think result should contain keys _id, title, comments, comments.author, comments.body

The real result looks like this:

tomas@ac100:~/variety-master$ mongo test --eval "var collection = 'blog';" variety.js
MongoDB shell version: 2.2.4
connecting to: test
Variety: A MongoDB Schema Analyzer
Version 1.2.3, released 01 September 2013
Using query of { }
Using limit of 2
Using maxDepth of 99
creating results collection: blogKeys
removing leaf arrays in results collection, and getting percentages
{ "_id" : { "key" : "_id" }, "value" : { "type" : "ObjectId" }, "totalOccurrences" : 2, "percentContaining" : 100 }
{ "_id" : { "key" : "title" }, "value" : { "type" : "String" }, "totalOccurrences" : 2, "percentContaining" : 100 }
{ "_id" : { "key" : "comments" }, "value" : { "type" : "Array" }, "totalOccurrences" : 2, "percentContaining" : 100 }
{ "_id" : { "key" : "comments.XX.author" }, "value" : { "type" : "String" }, "totalOccurrences" : 2, "percentContaining" : 100 }
{ "_id" : { "key" : "comments.XX.body" }, "value" : { "type" : "String" }, "totalOccurrences" : 2, "percentContaining" : 100 }

The '.XX' is added on line 147 of variety.js and not removed.
Is it bug or expected behaviour? Thank you.

Using limit produces wrong percentage

When you use limit = * the percentage seems to be based on the number you enter for the limit even when the results yield less results than the limit number. I believe percentage should not be based on the number you are looking for but rather the number which is returned.

Determine a strategy for which MongoDB version (or versions) to support

It may make life easier for us, if we only support the latest production release of MongoDB. I think we can actually get away with this, because Variety is intended to be run on development machines, where version restrictions are often much less harsh than production environments.

It could be soft, in that Variety could still agree to work, but with a deprecation warning explaining that only, say, version 2.2.2 (current as of writing) is officially supported.

It's not a "perfect world" fix, sure, but it would help make things run smoothly for Variety's little community, given our limited resources.

Any thoughts?

Array.isArray is not a function variety.js:121

Hello there,

I just pulled this interesting project and run into this problem:

mongo mpl --eval "var collection = 'reciepes', limit = 1" variety.js
MongoDB shell version: 2.0.4
connecting to: mpl
Variety: A MongoDB Schema Analyzer
Version 1.4.1, released 14 Oct 2014
Using query of { }
Using limit of 1
Using maxDepth of 99
Using sort of { "_id" : -1 }
Using outputFormat of ascii
Sat Nov 22 22:51:24 TypeError: Array.isArray is not a function variety.js:121
failed to load: variety.js

Any ideas ?

Program should output the name of collection

When printing the output nowhere in the ouput does it show which collection the data is for. When you print the collections you then have to rely on the filename to tell you which collection you are looking at. It would be benificial to add the name of the collection into the output of the program for easier documentation.

Where is the log file?

Hi, your instructions say that you should tail the log file to see percentage complete if it is taking a long time. I don't see any log files appearing in the variety.js folder... Where is the log file in question and what is its filename?

Display ASCII tables' scalars indented to the right

We currently display tables like:

+------------------------------------------------------------+
| key                | types        | occurrences | percents |
| ------------------ | ------------ | ----------- | -------- |
| _id                | ObjectId     | 127         | 100      |
| name               | String       | 15          | 100      |
| bio                | String       | 3           | 60       |
| birthday           | String       | 2           | 40       |
| pets               | String,Array | 2           | 40       |
| someBinData        | BinData-old  | 1           | 20       |
| someWeirdLegacyKey | String       | 1           | 6        |
+------------------------------------------------------------+

For the sake of quick reading/scanning, it would be a significant improvement to display tables like:

+------------------------------------------------------------+
| key                | types        | occurrences | percents |
| ------------------ | ------------ | ----------- | -------- |
| _id                | ObjectId     | 127         | 100      |
| name               | String       |  15         | 100      |
| bio                | String       |   3         |  60      |
| birthday           | String       |   2         |  40      |
| pets               | String,Array |   2         |  40      |
| someBinData        | BinData-old  |   1         |  20      |
| someWeirdLegacyKey | String       |   1         |   6      |
+------------------------------------------------------------+

-James

error while using query with $gte, $lt

Hello.

I ran this command,

/usr/bin/mongo log -eval "var collection='join', query={'date':{'$gte':'2013-09-17','$lt':'2013-09-19'}}, limit=10000, maxDepth=1" public/javascripts/variety.js

and got this result.

MongoDB shell version: 2.4.1
connecting to: log
Variety: A MongoDB Schema Analyzer
Version 1.2.3, released 01 September 2013
Using query of {"date":{"":"2013-09-19"}}
Using limit of 10000
Using maxDepth of 1
creating results collection: joinKeys
removing leaf arrays in results collection, and getting percentages

The query string of the result is different from my command string.

Please help me.

find a cool way to run it

I'm thinking of something perhaps less flexible like
variety <usual mongo shell parameters for connecting to a db>

variety could be a (unix) shell script... just sayin'

uncaught exception: map reduce failed

Tried running variety.js on a modestly-sized collection. After about 10min the script terminated:

Wed May  2 17:15:26 uncaught exception: map reduce failed:{
    "assertion" : "Invalid BSONObj size: 18709033 (0x297A1D01) first element: 0: { types: [ undefined, \"number\", \"array\" ] }",
    "assertionCode" : 10334,
    "errmsg" : "db assertion failure",
    "ok" : 0
}

Thoughts?

Multiple key types not showing in MongoDB 2.0.4 and 2.2.1

Followed the easy example - type for pets only showing as "String". Tried on two versions of MongoDB 2.0.4 and 2.2.1. Below is the the output of 2.0.4. Both versions are on a 64-bit Ubuntu 12.04 machine. Any ideas? Thanks!

$ mongo test --eval "var collection = 'users'" variety.js
MongoDB shell version: 2.0.4
connecting to: test
Variety: A MongoDB Schema Analyzer
Version 1.2.2, released 04 November 2012
Using limit of 5
Using maxDepth of 99
creating results collection: usersKeys
removing leaf arrays in results collection, and getting percentages
{ "_id" : { "key" : "_id" }, "value" : { "type" : "ObjectId" }, "totalOccurrences" : 5, "percentContaining" : 100 }
{ "_id" : { "key" : "name" }, "value" : { "type" : "String" }, "totalOccurrences" : 5, "percentContaining" : 100 }
{ "_id" : { "key" : "bio" }, "value" : { "type" : "String" }, "totalOccurrences" : 3, "percentContaining" : 60 }
{ "_id" : { "key" : "pets" }, "value" : { "type" : "String" }, "totalOccurrences" : 2, "percentContaining" : 40 }
{ "_id" : { "key" : "birthday" }, "value" : { "type" : "Date" }, "totalOccurrences" : 2, "percentContaining" : 40 }
{ "_id" : { "key" : "someBinData" }, "value" : { "type" : "BinData-old" }, "totalOccurrences" : 1, "percentContaining" : 20 }
{ "_id" : { "key" : "someWeirdLegacyKey" }, "value" : { "type" : "String" }, "totalOccurrences" : 1, "percentContaining" : 20 }

Update README to reflect our new and pretty ASCII output format

The README displays Variety terminal output in several places, and is still using the old, pre-ASCII, invalid-JSON-style. We should update the README when we have a moment so that the documentation is up-to-date and not confusing (as well as showing off our new and highly readable ASCII tables!).

db.adminCommand is not a function

MongoDB shell version: 1.6.3

Variety: A MongoDB Schema Analyzer
Version 1.2.2, released 04 November 2012
Fri Jan 25 17:32:19 TypeError: db.adminCommand is not a function variety.js:21

anyone can help ?

tks

Output into valid json

I think this would be adding commas to the end of each line except the last, and enclosing the whole grouping in square braces. In the meantime, I used this in python:

$ mongo dbname --eval \"var collection = 'collname'\" /path/to/variety.js | awk '/{/' > fname

data = []
with open(fname, 'r') as f:
    for line in f.readlines():
        l = json.loads(line)
        key = l['_id']['key']
        datatype = l['value']['type']
        occurence = l['totalOccurrences']
        coverage_pct = l['percentContaining']
        data.append(dict(key=key,datatype=datatype,occurence=occurence,coverage_pct=coverage_pct))

json.dumps(data)

Feedback requested: collection as array feature implemented

Guys I love this project alot and have been working hard to give some contributions in my latest push to my repo I have added the ability to specify an array for the collection name. It still works with the single string as well. I have made a couple fixes to ensure that json formatting still works, I may make some minor output formatting changes but it seems to work great. Please test and let me know what you think the syntax is now
mongo test --eval "var collection = 'users'" variety.js For a single collection
mongo test --eval "var collection = ['users', 'articles', 'collection3']" variety.js For an array

In the next day or two I should also have a working copy that will do all collections in the database I was thinking of using "var mode = recursive" Please test and give feedback then I will do a pull request if the feature is liked. Thanks have a wonderful day!!

[NON-ISSUE] Need a PHP Version?

James,

You commented on my Snipplr post for finding out the keys used in a MongoDB collection, as seen here: http://snipplr.com/view/59334/list-of-keys-used-in-mongodb-collection/

I feel your Javascript library would be very valuable if it was converted into a PHP library.

Would you be interested in giving me the ability to commit, if I was to create the PHP class for doing what your JS class is doing?

Look forward to hearing from you,

David

@DL_JR on Twitter
[email protected]

no method 'toSource'

JavaScript execution failed: TypeError: Object # has no method 'toSource' at variety.js:L55

It is in the print("Using query of " + query.toSource()); line.

Commented out the line and the script worked perfectly.

3.13m documents results in "Too much data for sort() with no index"

Sample output on a mongo collection of 3.13m records and and unknown number of unique keys:

[root@leonardo variety]# mongo **** --eval "var collection = '****', maxDepth = 3" variety.js
MongoDB shell version: 2.4.1
connecting to: ****
Variety: A MongoDB Schema Analyzer
Version 1.2.2, released 04 November 2012
Using limit of 3132369
Using maxDepth of 3
creating results collection: ****Keys
removing leaf arrays in results collection, and getting percentages
Sat Mar 30 15:13:22.341 JavaScript execution failed: error: {
        "$err" : "too much data for sort() with no index.  add an index or specify a smaller limit",
        "code" : 10128
} at src/mongo/shell/query.js:L128
failed to load: variety.js

maxDepth=2 works fine, but anything beyond that (namely 3 levels) results in this error.

Do you need any other information? If it can be easily resolved with a modification to mongodb's core config, I'd be willing to do that since this is just a dev server.

We appear to have regressed to "type" output rather than "types"

Just noticed this. Using Variety to analyze the README's sample collection, I get the following:

james@laptop:~/variety$  mongo test --eval "var collection = 'users'" variety.js
MongoDB shell version: 2.4.9
connecting to: test
Variety: A MongoDB Schema Analyzer
Version 1.2.6, released 28 March 2014
Using query of { }
Using limit of 5
Using maxDepth of 99
Using sort of { "_id" : -1 }
creating results collection: usersKeys
removing leaf arrays in results collection, and getting percentages
{ "_id" : { "key" : "_id" }, "value" : { "type" : "ObjectId" }, "totalOccurrences" : 5, "percentContaining" : 100 }
{ "_id" : { "key" : "name" }, "value" : { "type" : "String" }, "totalOccurrences" : 5, "percentContaining" : 100 }
{ "_id" : { "key" : "bio" }, "value" : { "type" : "String" }, "totalOccurrences" : 3, "percentContaining" : 60 }
{ "_id" : { "key" : "pets" }, "value" : { "type" : "String" }, "totalOccurrences" : 2, "percentContaining" : 40 }
{ "_id" : { "key" : "birthday" }, "value" : { "type" : "Date" }, "totalOccurrences" : 2, "percentContaining" : 40 }
{ "_id" : { "key" : "someBinData" }, "value" : { "type" : "BinData-old" }, "totalOccurrences" : 1, "percentContaining" : 20 }
{ "_id" : { "key" : "someWeirdLegacyKey" }, "value" : { "type" : "String" }, "totalOccurrences" : 1, "percentContaining" : 20 }

As the README states, we are expecting, for example, types of values for the pets key to include both String and Array.

Not sure how this happened, or if I'm making a mistake.

I will say that it doesn't appear to be due to my 1.2.6 version bump via JSHint validation, as I get the same results after reverting that commit.

James

So does Variety have a Java wrapper?

Tomáš,

I was thinking: our test suite is in Java. Does this mean we can very easily (or even already) have a Java wrapper Variety?

If it's already there or very easy to get to from here, I would like to add this to our README.

Let me know.

Thank You,
James

Should we leverage Github Releases?

Hello all,

Tomáš (@tovdora) has suggested the following:

We can also consider Github release infrastructure. Plus we get download functionality for free with it - easier for users to download and use Variety. What do you think?

He's referring to Github Releases (note the capital R indicates a proper noun): https://github.com/blog/1547-release-your-software

I barely glanced at it, but it looks cool. Anybody used it? Any thoughts?

Tomáš, do you think it looks like a good move?

-James

Specify subtree instead of maxDepth

I'd be nice to specify using a tree which fields should be analysed:

subtree = {f1: 1, f2: 1, f3: {f31: 1, f32: 2}}

(subtree is just a random name; something else might be better)

Even more, it'd be nice to specify which fields to leave out!

mask = {f2: {f24: {f24_field_with_dynamic_data: 1}}}

This is because maxDepth just cuts off at a certain level but this might not be applicable to all use cases (incl. mine).

This feature would also allow for analysis of just specific subdocuments.

uncaught exception: map reduce failed - could not create cursor

Different error than the first uncaught exception: map reduce failed bug, might be related. Mine is easily recreated using the following source code.

Error
$ mongo garden --eval "var collection = 'user.actions'" variety.js
MongoDB shell version: 2.0.3
connecting to: garden
Variety: A MongoDB Schema Analyzer
Version 1.1, released 03 June 2012
Using limit of 37
Using maxDepth of 99
Fri Jul 20 06:54:45 uncaught exception: map reduce failed:{
"assertion" : "could not create cursor over garden.user.actions for query : {} sort : { _id: -1.0 }",
"assertionCode" : 15876,
"errmsg" : "db assertion failure",
"ok" : 0
}
failed to load: variety.js

Source code - logging.rb

require 'rubygems' 
require 'mongo' 

VIEW_PRODUCT = 0 
ADD_TO_CART = 1 
CHECKOUT = 2 
PURCHASE = 3 

@con = Mongo::Connection.new 
@db = @con['garden'] 

@db.drop_collection("user.actions") 

@db.create_collection("user.actions", :capped => true, :size => 1024) 

@actions = @db['user.actions'] 

40.times do |n| 
  doc = { 
    :username => "kbanker", 
    :action_code => rand(4), 
    :time => Time.now.utc, 
    :n => n 
  } 
  @actions.insert(doc) 
end 

execution
$ ruby logging.rb
$ mongo garden --eval "var collection = 'user.actions'" variety.js

Triggering Variety in Node.JS with an EventEmitter

Since Variety is command line based, I am using Node.JS's childprocess.exec to run it through Node. If "exec(cmd...)" is on its own line, the tool works fine. However, if I try to trigger it with a Node.JS eventemitter, the tool runs multiple times until it errors.

child_process.js:945 throw errnoException(process._errno, 'spawn'); ^ Error: spawn EMFILE

Does anyone have any thoughts? The way I'm structuring my eventemitter is as follows:

var events = require('events');
var eventEmitter = new events.EventEmitter();

var cmd = "mongo " + config.dbUrl + " --eval " + '\"var collection = \'' + config.collect[0] + '\'\"' + " variety.js "
function key_gen() {
    exec(cmd, puts);
}

eventEmitter.on('db_changes', key_gen);

var target = db[config.collect[0]]; //eg. varietyResults.test3
watcher.watch(target, function(event) { //Watch Mongo oplog for any changes to source
    eventEmitter.emit('db_changes');
});

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.