The scikit-learn KMeans algorithm allows support for supplying a weight for each sampl

I've had a rethink... I've had a look at how <a href="https://scikit

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Just stumbled across this project, nice work <a class="user-mention notranslate" data-

[Enhancement] Add support for sample_weight in the fit function about k-means-constrained HOT 8 CLOSED

RowdyHowell commented on July 18, 2024

[Enhancement] Add support for sample_weight in the fit function

from k-means-constrained.

Comments (8)

joshlk commented on July 18, 2024 4

I've had a rethink...

I've had a look at how scikit-learn defines sample_weights:

The algorithm supports sample weights, which can be given by a parameter sample_weight. This allows to assign more weight to some samples when computing cluster centers and values of inertia. For example, assigning a weight of 2 to a sample is equivalent to adding a duplicate of that sample to the dataset .

Which I think its different to what I said:

I think it would be equivalent to weighting the distances.

and how you described it:

can we have the minimum and maximum bounds account for the sum of all weights instead of the count of all samples?

All of the above is possible - it's just about figuring out what to weight. Feel free to have a shot at it. I will also have a longer think about what is needed

from k-means-constrained.

hectoradrian961030 commented on July 18, 2024 1

@joshlk I think I have a similar need. In the problem I'm trying to solve, size_max is the sum of the weights of a cluster instead the size of a cluster. A point of X is the centroid of a polygon and the weight of that polygon (point of X) is the sum of its vertices. Do you think the algorithm can be easily modified to handle this?

from k-means-constrained.

joshlk commented on July 18, 2024

Hi, thanks for your interest in the library 🙂

It would be possible to support that. I think it would be equivalent to weighting the distances. The distances are generated here. You could change the euclidean_distances function to accept a weight parameter. I haven’t looked at how scikit-learn does it but it might be possible to backport the code from there.

I accept contributions to the library. If you would be willing to write a PR - I could review it

from k-means-constrained.

ischmidt20 commented on July 18, 2024

Just stumbled across this project, nice work @joshlk ! I also have the same use case (each point has a weight, constraints are on the sum of the weights, distances can be weighted or unweighted). Thankfully all of my weights are integers (and none are very big), so I think as a workaround for now I will literally just duplicate the sample points when feeding them into the model. I may also have a look at your codebase and see if I think of anything.

from k-means-constrained.

ischmidt20 commented on July 18, 2024

So after thinking about this a little more, I don't think having the weights factor into the bounds is feasible. This would be equivalent to adjusting the supply at each node in the min-cost network flow problem. However, even assuming the weights are integers, there is no requirement that all of the flow from one node would follow the same arc. This means that when solving the linear program, fractions of points could be assigned to different clusters (e.g. a point with weight of 10 could have 6 units of flow to one cluster and 4 to another). Ensuring that all of the flow goes in one direction would elevate the problem to an integer program and while it would probably still be solvable for most use cases, it would increase running time dramatically. Of course, the algorithm could simply return the fractional assignments, but I can't imagine why this would be useful, and it would require adjusting the API (what would .labels_ be?)

I think it would be equivalent to weighting the distances.

This should still be feasible, considering that all this would do is weight the costs of each arc in the network flow problem. Sample weights could be used in this way without the algorithm needing to change at all to achieve the same result.

For example, assigning a weight of 2 to a sample is equivalent to adding a duplicate of that sample to the dataset.

The way this reads to me is that this is just the combination of the above. In unconstrained K-means, "adding a duplicate point" is only equivalent to weighting the distances, because two points at the same location will always end up in the same cluster. However, in the constrained problem, when you solve the min-cost network flow, the two copies could end up in different clusters, depending on the constraints. So this is also not really feasible.

So it seems as if any sample_weight parameter really only can weight the distances, as described. However, as the conversation here so far indicates, there are many different interpretations of what "sample weight" could mean, so it is important that if such a feature is added, the documentation is clear on the matter.

from k-means-constrained.

joshlk commented on July 18, 2024

I’m happy to take contributions to the project if you would like to have a go adding this as a feature

from k-means-constrained.

joshlk commented on July 18, 2024

I'm closing this ticket for now. Feel free to re-open the ticket and comment if your still looking at this and want more assistance

from k-means-constrained.

Frank-Triolo commented on July 18, 2024

I am also looking for this feature for a project I am working on. I would love to contribute but I am not sure how - there are valid design concerns stated by @ischmidt20 that I'm not sure I know how to handle best. I do believe that I know how weights should be applied - by a combination of point 1 and point 3 from your rethink comment (should be equivalent to duplicating a point as well as the bounds being dependent on the sum of all weights, as this would work the same way with weights == 1). I'm not sure if simply weighting the edges would solve this problem, or if it could be resolved by simply adjusting the means to account for the weights after they have been assigned, but I will definitely take a look at it!

from k-means-constrained.

[Enhancement] Add support for sample_weight in the fit function about k-means-constrained HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent