Coder Social home page Coder Social logo

Comments (8)

joshlk avatar joshlk commented on July 18, 2024 4

I've had a rethink...

I've had a look at how scikit-learn defines sample_weights:

The algorithm supports sample weights, which can be given by a parameter sample_weight. This allows to assign more weight to some samples when computing cluster centers and values of inertia. For example, assigning a weight of 2 to a sample is equivalent to adding a duplicate of that sample to the dataset .

Which I think its different to what I said:

I think it would be equivalent to weighting the distances.

and how you described it:

can we have the minimum and maximum bounds account for the sum of all weights instead of the count of all samples?

All of the above is possible - it's just about figuring out what to weight. Feel free to have a shot at it. I will also have a longer think about what is needed

from k-means-constrained.

hectoradrian961030 avatar hectoradrian961030 commented on July 18, 2024 1

@joshlk I think I have a similar need. In the problem I'm trying to solve, size_max is the sum of the weights of a cluster instead the size of a cluster. A point of X is the centroid of a polygon and the weight of that polygon (point of X) is the sum of its vertices. Do you think the algorithm can be easily modified to handle this?

from k-means-constrained.

joshlk avatar joshlk commented on July 18, 2024

Hi, thanks for your interest in the library πŸ™‚

It would be possible to support that. I think it would be equivalent to weighting the distances. The distances are generated here. You could change the euclidean_distances function to accept a weight parameter. I haven’t looked at how scikit-learn does it but it might be possible to backport the code from there.

I accept contributions to the library. If you would be willing to write a PR - I could review it

from k-means-constrained.

ischmidt20 avatar ischmidt20 commented on July 18, 2024

Just stumbled across this project, nice work @joshlk ! I also have the same use case (each point has a weight, constraints are on the sum of the weights, distances can be weighted or unweighted). Thankfully all of my weights are integers (and none are very big), so I think as a workaround for now I will literally just duplicate the sample points when feeding them into the model. I may also have a look at your codebase and see if I think of anything.

from k-means-constrained.

ischmidt20 avatar ischmidt20 commented on July 18, 2024

So after thinking about this a little more, I don't think having the weights factor into the bounds is feasible. This would be equivalent to adjusting the supply at each node in the min-cost network flow problem. However, even assuming the weights are integers, there is no requirement that all of the flow from one node would follow the same arc. This means that when solving the linear program, fractions of points could be assigned to different clusters (e.g. a point with weight of 10 could have 6 units of flow to one cluster and 4 to another). Ensuring that all of the flow goes in one direction would elevate the problem to an integer program and while it would probably still be solvable for most use cases, it would increase running time dramatically. Of course, the algorithm could simply return the fractional assignments, but I can't imagine why this would be useful, and it would require adjusting the API (what would .labels_ be?)

I think it would be equivalent to weighting the distances.

This should still be feasible, considering that all this would do is weight the costs of each arc in the network flow problem. Sample weights could be used in this way without the algorithm needing to change at all to achieve the same result.

For example, assigning a weight of 2 to a sample is equivalent to adding a duplicate of that sample to the dataset.

The way this reads to me is that this is just the combination of the above. In unconstrained K-means, "adding a duplicate point" is only equivalent to weighting the distances, because two points at the same location will always end up in the same cluster. However, in the constrained problem, when you solve the min-cost network flow, the two copies could end up in different clusters, depending on the constraints. So this is also not really feasible.

So it seems as if any sample_weight parameter really only can weight the distances, as described. However, as the conversation here so far indicates, there are many different interpretations of what "sample weight" could mean, so it is important that if such a feature is added, the documentation is clear on the matter.

from k-means-constrained.

joshlk avatar joshlk commented on July 18, 2024

I’m happy to take contributions to the project if you would like to have a go adding this as a feature

from k-means-constrained.

joshlk avatar joshlk commented on July 18, 2024

I'm closing this ticket for now. Feel free to re-open the ticket and comment if your still looking at this and want more assistance

from k-means-constrained.

Frank-Triolo avatar Frank-Triolo commented on July 18, 2024

I am also looking for this feature for a project I am working on. I would love to contribute but I am not sure how - there are valid design concerns stated by @ischmidt20 that I'm not sure I know how to handle best. I do believe that I know how weights should be applied - by a combination of point 1 and point 3 from your rethink comment (should be equivalent to duplicating a point as well as the bounds being dependent on the sum of all weights, as this would work the same way with weights == 1). I'm not sure if simply weighting the edges would solve this problem, or if it could be resolved by simply adjusting the means to account for the weights after they have been assigned, but I will definitely take a look at it!

from k-means-constrained.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.