Comments (14)
I'm going down the path of trying https://docs.mongodb.com/manual/reference/command/parallelCollectionScan/ with a configurable amount of cursors which read in separate go routines. I need to feature check this and only make it available for 2.6+.
from gtm.
@makhdumi thank you for pointing me to the mongo-hadoop-connector code. I've added split vector support to monstache and gtm now because of that. I'll take a look into the oplog read lag at some point.
from gtm.
gtm now supports splitting collections and reading them in multiple go routines with separate connections using range selectors.
from gtm.
Hi, the direct read impl was changed recently in this commit:
Due to some feedback received in this issue with respect to a 10 million document collection.
It went from multiple cursors each with coordinated skips and limits to what you see today with a gte
query.
The report was closed on feedback that the situation got better for the large 10m collection after the change. It would be nice to go back to concurrent readers but keep the improvements for the other fella.
I will be happy to look at a PR. Maybe you could base it off the code prior to the commit referenced above. At that point multi-cursor was optional via DirectReadersPerCol
. As far as the chunking method maybe you could specify that in the Options
passed in.
from gtm.
By the way, with regard to the sharded vs non-sharded, gtm should be connecting directly to the shards and not going through a mongos connection. So, I'm not sure the query logic needs to take sharding into account. The example suggests connecting to the config server, getting the shards, using a MultiContext
, which would be a session for each shard, and setting up a listener on the config server oplog to address new shards being added. Maybe I'm missing something?
from gtm.
The gte
change from skip
+limit
makes sense. With multiple cursors, I was thinking of multiple range cursors with gte
and lt
.
Knowing the sharding info just helps with determining the ranges to query on (using the chunks
collection), and what the sharding key is.
Otherwise I guess if you don't know that the collection is sharded, you have to query off of _id
, i.e. treat it as an unsharded collection? You also have to sort of guess the good ranges: e.g. a simple method would be to just use the total collection size and the min/max _id
(I use this for unsharded collections and it performs badly in our case, since the _ids
are not at all evenly distributed). Another way would be to "binary search" for each range.
When I use the simple method that results in bad chunking, I get the ~35 Mbps average. But for sharded collections, when I use the already-available ranges in the chunks
collection, I get a pretty stable ~80-90 MBps average.
I didn't think of handling when new chunks are added or rebalanced though, which is important (wasn't in my specific situation). I think that makes it a lot trickier.
from gtm.
I was totally unaware of parallelCollectionScan
!
Knowing about this helps a lot with my own stuff too, thank you. I'm looking forward to the implementation in gtm.
from gtm.
Problem is it doesn't seem to work very well with WiredTiger. Looks like it always returns 1 cursor for that engine. And I'm actually seeing slightly better results on 1 million docs with the current code. I guess I need to try other engines to see if it is better in any case.
https://jira.mongodb.org/browse/SERVER-17688
from gtm.
mmapv1 is working great though. Multiple cursors get returned and it is pretty fast.
from gtm.
@makhdumi the parallelCollectionScan
stuff has been committed to master now. Please let me know how that goes. Of course, it only works for mmapv1 currently, but should transparently upgrade when WiredTiger is enhanced.
from gtm.
Just added commit 453cc54 which puts each in cursor on a separate connection.
This gives pretty good thoughput for 10 milliion tiny docs. The syncing to ES is turned off in the run, so just mongo read performance via mmapv1.
time go run monstache.go -direct-read-namespace test.test -exit-after-direct-reads -direct-read-cursors 100
INFO 2018/03/16 21:35:08 Successfully connected to MongoDB version 3.6.3
INFO 2018/03/16 21:35:08 Parallel collection scan command returned 17/100 cursors requested for test.test
INFO 2018/03/16 21:35:08 Starting 17 go routines to read test.test
10000000
INFO 2018/03/16 21:35:58 Shutting down
real 0m52.016s
user 2m9.244s
sys 0m24.640s
With the code that doesn't utilize the collection scan it takes 3X as long.
time go run monstache.go -direct-read-namespace test.test -exit-after-direct-reads -direct-read-cursors 1
INFO 2018/03/16 21:40:39 Successfully connected to MongoDB version 3.6.3
INFO 2018/03/16 21:40:39 Parallel collection scan command returned 1/1 cursors requested for test.test
INFO 2018/03/16 21:40:39 Reverting to single-threaded collection read
10000000
INFO 2018/03/16 21:43:21 Shutting down
real 2m43.619s
user 2m37.740s
sys 0m35.496s
from gtm.
You might want to give monstache another try. Recent changes have removed removed the storage of metadata in Mongodb for the purposes of later deletion. This may have been what was slowing it down for you.
from gtm.
That might have been it, thank you. I actually had to abandon monstache after I was seeing ~70 ms between each oplog read, but I didn't have time to dig further, so I'm not sure if it was my setup or monstache itself.
Thank you for the initial dump improvements as well. I hadn't seen the latest one.
from gtm.
Thanks. I’m looking into adding split vector support. Seems parallel collection scan is going away when mmapv1 does.
from gtm.
Related Issues (20)
- Unmarshal Op.Data to a struct instead of a map HOT 6
- Delete
- What to do when ctx.ErrC channel provides an error? HOT 1
- License? HOT 1
- add some tests HOT 1
- Add semver tags HOT 5
- if i want resume-from-timestamp ,How should I configure it? HOT 2
- how to use gtm to sync to a mongo primary HOT 21
- Error importing v2.0.6 with new official mongo golang package HOT 4
- what’s the namespace ? HOT 1
- What is the difference between Op.Data and Op.Doc ? HOT 1
- Limit object reads to specific databases/collections HOT 2
- unset slice occur HOT 3
- column order problem HOT 1
- how to drop doc in es by custom id
- Possible key collision across namespaces
- Implement fanning out to multiple go routines with explicit ordering semantics
- Only default BufferDuration if zero
- Missing op.Data when alterations to same ns + id happen within same Timestamp Epoch second HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gtm.