Comments (8)
I've had a rethink...
I've had a look at how scikit-learn defines sample_weights
:
The algorithm supports sample weights, which can be given by a parameter sample_weight. This allows to assign more weight to some samples when computing cluster centers and values of inertia. For example, assigning a weight of 2 to a sample is equivalent to adding a duplicate of that sample to the dataset .
Which I think its different to what I said:
I think it would be equivalent to weighting the distances.
and how you described it:
can we have the minimum and maximum bounds account for the sum of all weights instead of the count of all samples?
All of the above is possible - it's just about figuring out what to weight. Feel free to have a shot at it. I will also have a longer think about what is needed
from k-means-constrained.
@joshlk I think I have a similar need. In the problem I'm trying to solve, size_max is the sum of the weights of a cluster instead the size of a cluster. A point of X is the centroid of a polygon and the weight of that polygon (point of X) is the sum of its vertices. Do you think the algorithm can be easily modified to handle this?
from k-means-constrained.
Hi, thanks for your interest in the library π
It would be possible to support that. I think it would be equivalent to weighting the distances. The distances are generated here. You could change the euclidean_distances
function to accept a weight parameter. I havenβt looked at how scikit-learn does it but it might be possible to backport the code from there.
I accept contributions to the library. If you would be willing to write a PR - I could review it
from k-means-constrained.
Just stumbled across this project, nice work @joshlk ! I also have the same use case (each point has a weight, constraints are on the sum of the weights, distances can be weighted or unweighted). Thankfully all of my weights are integers (and none are very big), so I think as a workaround for now I will literally just duplicate the sample points when feeding them into the model. I may also have a look at your codebase and see if I think of anything.
from k-means-constrained.
So after thinking about this a little more, I don't think having the weights factor into the bounds is feasible. This would be equivalent to adjusting the supply at each node in the min-cost network flow problem. However, even assuming the weights are integers, there is no requirement that all of the flow from one node would follow the same arc. This means that when solving the linear program, fractions of points could be assigned to different clusters (e.g. a point with weight of 10 could have 6 units of flow to one cluster and 4 to another). Ensuring that all of the flow goes in one direction would elevate the problem to an integer program and while it would probably still be solvable for most use cases, it would increase running time dramatically. Of course, the algorithm could simply return the fractional assignments, but I can't imagine why this would be useful, and it would require adjusting the API (what would .labels_
be?)
I think it would be equivalent to weighting the distances.
This should still be feasible, considering that all this would do is weight the costs of each arc in the network flow problem. Sample weights could be used in this way without the algorithm needing to change at all to achieve the same result.
For example, assigning a weight of 2 to a sample is equivalent to adding a duplicate of that sample to the dataset.
The way this reads to me is that this is just the combination of the above. In unconstrained K-means, "adding a duplicate point" is only equivalent to weighting the distances, because two points at the same location will always end up in the same cluster. However, in the constrained problem, when you solve the min-cost network flow, the two copies could end up in different clusters, depending on the constraints. So this is also not really feasible.
So it seems as if any sample_weight
parameter really only can weight the distances, as described. However, as the conversation here so far indicates, there are many different interpretations of what "sample weight" could mean, so it is important that if such a feature is added, the documentation is clear on the matter.
from k-means-constrained.
Iβm happy to take contributions to the project if you would like to have a go adding this as a feature
from k-means-constrained.
I'm closing this ticket for now. Feel free to re-open the ticket and comment if your still looking at this and want more assistance
from k-means-constrained.
I am also looking for this feature for a project I am working on. I would love to contribute but I am not sure how - there are valid design concerns stated by @ischmidt20 that I'm not sure I know how to handle best. I do believe that I know how weights should be applied - by a combination of point 1 and point 3 from your rethink comment (should be equivalent to duplicating a point as well as the bounds being dependent on the sum of all weights, as this would work the same way with weights == 1). I'm not sure if simply weighting the edges would solve this problem, or if it could be resolved by simply adjusting the means to account for the weights after they have been assigned, but I will definitely take a look at it!
from k-means-constrained.
Related Issues (20)
- [BUG] installation issues with numpy < 1.23 HOT 3
- facing the issue during running HOT 2
- [BUG] Failed to build k-means-constrained HOT 3
- Wrong clustering HOT 1
- Wrong clustering with pre_computed adjacency matrix HOT 3
- QGIS implementation HOT 1
- ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject HOT 2
- [How to classify the new instances after obtaining a constrained clustering] HOT 1
- [BUG] error compiling on apple silicon. HOT 2
- Maybe tag a new release? HOT 1
- Issue in importing k-means-constrained in Google Colab notebook HOT 2
- Possibility to use this on k-modes HOT 2
- Can't install k-means-constrained HOT 2
- Resource intensity HOT 2
- Weighting observations HOT 1
- import k_means_constrained HOT 4
- Issue with min cost flow input HOT 2
- Segmentation fault when import k_means_constrained
- Is it possible to implement MiniBatchKmeansConstrained?
- [BUG] IndexError: index 10000 is out of bounds for axis 0 with size 10000
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from k-means-constrained.