Conceptually, if the trees are independent, shouldn't we be able to build the trees in

Parallelize build?,about spotify/annoy

denkuzin commented on May 21, 2024 3

guys, is there any news on the multicore index building?

from annoy.

erikbern commented on May 21, 2024 2

roughly, but it's a bit more complicated._make_tree also modifies the datastructure. You also need a mutex around line 643-644 and line 702-703. Maybe that mutex could just be folded into _allocate_size and we could rename the function to something like _add_item ... probably makes more sense

then you obviously need to create pthreads and collect them afterwards... honestly I haven't done that in like 10 years, but iirc it's not too hard

from annoy.

erikbern commented on May 21, 2024 1

not afaik :(

from annoy.

os-gabe commented on May 21, 2024 1

I think what you are saying makes sense. Thanks for the tip on shared mutex - I hadn't seen that before. It was introduced in c++17 though which may be it's own problem.

I'll take another look and see what I can figure out.

from annoy.

erikbern commented on May 21, 2024

It wouldn't be too hard – you would need a mutex in a couple of places and also watch out for reallocations, but other than that it should be easy

from annoy.

ravimody commented on May 21, 2024

Cool thanks, I'll look into it (although my C++ is a little rusty :))

from annoy.

thomas4g commented on May 21, 2024

Has there been any further work on this?

from annoy.

ravimody commented on May 21, 2024

No work on it from my end.

from annoy.

thomas4g commented on May 21, 2024

Bummer. I'd love to help out, but I don't think I'm familiar enough with annoy yet (or frankly competent enough with C++) to spearhead anything. If anyone starts working on this and wants a hand, I'm happy to help out.

from annoy.

thomas4g commented on May 21, 2024

Actually, I take it back. It looks like a pretty simple change. If I understand it properly, it's line 496 of annoylib.h:

_roots.push_back(_make_tree(indices));

That can get parallelized, but with a mutex around _roots to avoid simultaneous push_back calls, right?

from annoy.

tjrileywisc commented on May 21, 2024

@erikbern Actually I was working on a branch to add this just this weekend. It doesn't quite work yet though (I can only get it to run the precision_test.cpp example, only a debug binary works and it crashes sometimes even then); should I submit a PR?

I didn't touch anything towards the bottom of annoylib.h (near those lines that you mentioned), that might be the problem...

from annoy.

erikbern commented on May 21, 2024

sure, feel free to submit a PR, just make sure to put "WIP" in the subject or something

you definitely need a mutex around those lines, that's probably the issue :)

from annoy.

thomas4g commented on May 21, 2024

@tjrileywisc what a coincidence! I also started working on this over the weekend. Mine also doesn't quite work... but I've pushed a copy to my fork: https://github.com/thomas4g/annoy/tree/parallelize_build

I hope to keep working on it tonight, but if you're further along let me know if you'd like any help! Feel free to ping me here or shoot me an email [email protected]

from annoy.

tjrileywisc commented on May 21, 2024

@thomas4g
Just took a peak at your fork - I'm actually using std::thread instead of pthread . I think Windows doesn't have support for pthread built in. std::thread should be available for gcc and MSVC.

from annoy.

thomas4g commented on May 21, 2024

@tjrileywisc ahh, I wanted to use that but got thrown off by the build not supporting my #include <mutex> by default. I didn't want to fiddle with build settings.

from annoy.

tjrileywisc commented on May 21, 2024

Just submitted a PR for my build_trees_threaded branch.

#246

from annoy.

os-gabe commented on May 21, 2024

I thought I might have a go at this. The actual threading machinery is all straightforward but I'm having some trouble seeing exactly which bits of _make_tree() modify common data and need to be guarded with a mutex. As per previous comments I wrapped some functionality into a thread safe _add_item() function but I'm still getting segfaults indicating data is getting modified out from under my threads somewhere else.

S _add_item() {
    std::lock_guard<std::mutex> guard(_mutex);
    _allocate_size(_n_nodes + 1);
    S item = _n_nodes++;
    return item;
}

Any help with this is appreciated.

from annoy.

erikbern commented on May 21, 2024

interesting – you're probably right that it should be fairly straightforward, but i can't think of other critical sections off the top of my head

from annoy.

os-gabe commented on May 21, 2024

Unless I'm misunderstanding the code this may be more complicated than I first thought. It appears that there are many places within _make_tree that touch the underlying data. For example _get would need a mutex since it could be attempting to access _nodes while another thread is calling _add_item. But worse, since it returns a pointer, anywhere that pointer is used would also need a mutex. I believe this is looking like you would wind up needing mutexes around quite a large portion of _make_tree which would mostly undo the benefits of parallelizing it. It's possible I'm still not understanding the code fully though.

from annoy.

erikbern commented on May 21, 2024

From what you say I think the main issue is that the underlying memory can be reallocated at any point in time, and that invalidates any pointers held by any other thread.

But those reallocations are actually pretty rare so there should be some way to fix.

I haven't dealt with concurrency code in C++ since maybe 2007 so my knowledge is a bit rusty but couldn't you use a shared lock for this? Almost all access will be nonexclusive (so near-zero overhead), but the few times when you need to reallocate the underlying storage, you would have to acquire an exclusive lock. Does that make sense?

from annoy.

erikbern commented on May 21, 2024

I meant a shared mutex. This looks like the right concurrency primitive: https://en.cppreference.com/w/cpp/thread/shared_mutex

So basically acquire a shared lock when writing individual vectors, acquire an exclusive lock when you have to resize the underlying data storage.

But I'm mostly speculating, could be wrong :)

from annoy.

chikubee commented on May 21, 2024

@os-gabe Hey, Is it solved? Any insights on how to parallelize build?

from annoy.

erikbern commented on May 21, 2024

no, this would have to be implemented by someone

from annoy.

chikubee commented on May 21, 2024

@erikbern That's sad. I am trying to build an index for over 1M vectors and it is crashing even with on_disk_build. The process took up more than 30 GB memory and crashed.

from annoy.

erikbern commented on May 21, 2024

i'm not sure if parallelization would have helped, though. do you know what's causing it to crash?

from annoy.

ravimody commented on May 21, 2024

Yeah agreed that parallelization probably would not help.

1M vectors isn't that much unless they are extremely high dimensional vectors. Did you try with a smaller number of vectors to see where it starts failing?

from annoy.

erikbern commented on May 21, 2024

good point – annoy isn't meant for super high dimensionality, so if that's what you're facing then you should probably run dimensionality reduction outside of annoy first!

from annoy.

os-gabe commented on May 21, 2024

@os-gabe Hey, Is it solved? Any insights on how to parallelize build?

Unfortunately I had to move on to other things and did not get parallel build working

from annoy.

Parallelize build? about annoy HOT 28 OPEN

Comments (28)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent