The K-means algorithm seeks out clusters of values in data.
In a simple (and perhaps silly) example let us consider the manufacturing and marketing of ketchup.
Note that this example (with only two variables) is nice because we can easily place each buyer on a 2D scatter plot.
When using the K-means algorithm, we first choose a value for K, which is this case, will be number of ketchup varieties.
The program labels each buyer with a lower case letter, (a-p), and each variety of ketchup as an upper case letter, (A-D)
To start, it randomly assigns each buyer to a variety. and randomly scatters buyers and varieties, across the plot. So the resultant plot is pretty chaotic.
On the plot buyers have the color of their current chosen variety, and just to make the presentation a bit more dramatic, we connect the buyer to the variety with a line of the same color.
In the first step the buyers may chose a more optimal variety (and most do) than the one randomly assigned. Just buyers f and j "decide" to stick with their original assignments. A great deal of order suddenly appears.
Following the swap, varieties A and D move (change their formulations) to the mean of their buyer positions to be optimally close to their buyers.
But then some buyers whose favorite variety moved away from them may decide to switch to another variety that is suddenly a better match to their preferences.
In this example, after six iterations of shuffle-and-swap the program reaches a stable configuration and no swaps occur after the last shuffle.
This should be an overall optimum distribution.
Interestingly, variety D lost all its buyers with the first swap but by the end of the process, it had the most buyers. On the other hand B's three buyers stuck with it. All four varieties moved around quite a bit.
So, in summary, to create 4 varieties of ketchup that bring the most happiness to the most folks, we have computed formulations that now have salt and sugar amounts shown in (A, B, C, D) in the scatter plot.
You can see running by clicking this animation. Use the browsers back button to return here.
You may use any version of python (2.7 or greater) to run these animations.
$ python kmeans.py seed=180
generates the animation above. The seed=180 presets the random number generator. The default screen size is 600x600 pixels.
$ python kmeans.py seed=180 pixels=300
will do the same thing at half the size.
You can watch the buyers switch to different varieties, and then how vendors adjust their formulation to better cater to their current set of customers. Click the mouse inside the graphic window to step through. If you want to exit early you can also click the window-close button in the upper right.
So, in nine steps, the program will
With every second click, buyers may switch to another variety and vendors may re-adjust their formulations.
The arguments on the command line are:
Try the following. Results will be random since we are not seeding the random number generator.
$ python kmeans.py varieties=3 buyers=26
You might notice that these command line arguments look much like arguments in Python function calls. This is all handled in the module sarg. I find it useful for ad-hoc programs.
Once the animated example is seen, the Python code should seem pretty clear and intuitive and you can probably imagine how you might go about coding it. My approach is to make two classes, Variety and Buyer. You can download the project code here. kmeans.zip
There are two object classes Variety and Buyer.
class Variety has the following attributes:
25 class Variety :
26 def __init__ (self, label) :
27 self.x,self.y = randomXY()
28 self.label = label
29 self.buyers = []
30
31 def attachBuyer (self, buyer) :
32 self.buyers.append(buyer)
33
34 def detachBuyer (self, buyer) :
35 if buyer in self.buyers :
36 self.buyers.remove(buyer)
37
38 def centerToBuyers(self) :
39 nBuyers = len(self.buyers)
40 if nBuyers :
41 xsum = ysum = 0.0
42 for buyer in self.buyers :
43 xsum += buyer.x
44 ysum += buyer.y
45 self.x = xsum/nBuyers
46 self.y = ysum/nBuyers
47
48 def distTo (self, buyer) :
49 xdist = abs(self.x-buyer.x)
50 ydist = abs(self.y-buyer.y)
51 return math.sqrt(xdist*xdist + ydist*ydist)
52
Instances of the Buyer class have attributes for a label, position and current variety. These latter two are initially assigned at random. (58). (To make this easy, the varieties are created before the buyers.)
Method choose(), (61), lets a buyer trade his current variety for another that is closer. It looks at every alternative to find the best Variety at the best Distance, and returns true if a swap took place.
54 class Buyer :
55 def __init__ (self, label, varieties) :
56 self.x,self.y = randomXY()
57 self.label = label
58 self.variety = varieties[random.randint(0,len(varieties)-1)]
59 self.variety.attachBuyer(self)
60
61 def choose(self, varieties) :
62 best Variety = self.variety
63 bestDist = best Variety.distTo(self)
64 for tryout in varieties :
65 if tryout != self.variety :
66 tryDist = tryout.distTo(self)
67 if tryDist < bestDist :
68 best Variety = tryout
69 bestDist = tryDist
70 if best Variety != self.variety :
71 self.variety.detachBuyer(self)
72 best Variety.attachBuyer(self)
73 self.variety = best Variety
74 return True # changed variety
75 return False
76
Function kmeans() is the main engine which drives the algorithm.
81 def kmeans(seed) :
82 labels = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
83 babels= labels.lower()
84
85 varieties = [Variety(labels[i]) for i in range(nVarieties)]
86 buyers = [Buyer(babels[i], varieties) for i in range(nBuyers)]
87 fig = 0
88 display(varieties, buyers, "Initial Placement")
89 while True :
90 changed = False
91 fig += 1
92 for b in buyers :
93 changed = b.choose(varieties) or changed
94 display(varieties, buyers, "Fig %d Buyers Choose Closest"%fig)
95
96 if not changed : break
97
98 fig += 1
99 for v in varieties :
100 v.centerToBuyers()
101 display(varieties, buyers, "Fig %d Varieties Move to Center"%fig)
102 display(varieties, buyers, "Fig %d Done. Stability found"%fig)
103
The function display() erases and fills the graphic window with each step. When done it waits for the user to click the mouse (131) in the window before returning. The last click exits the program. You
getcolor() maps the a label to a color from the colors list.
77 def getColor(label) :
78 slot = ord(label.lower())-ord('a')
79 return colors[slot%len(colors)] #nifty wrap-around
80
104 def display(varieties, buyers, banner) :
105 for obj in drawn : obj.undraw()
106
107 txtBanner = Text(Point(pixels//2,20), banner)
108 txtBanner.draw(win)
109 drawn.append(txtBanner)
110
111 for b in buyers :
112 v = b.variety
113 color = getColor(v.label)
114 p1 = Point(b.x,b.y)
115 p2 = Point(v.x,v.y)
116
117 line = Line(p1,p2)
118 line.setFill(color)
119 line.draw(win)
120 drawn.append(line)
121
122 txtLabel = Text(p1, b.label)
123 txtLabel.draw(win)
124 drawn.append(txtLabel)
125
126 for v in varieties :
127 color = getColor(v.label)
128 txtVar = Text(Point(v.x,v.y),v.label)
129 txtVar.draw(win)
130 drawn.append(txtVar)
131 win.getMouse()
132
At the bottom of kmeans.py we launch the function kmeans with a seed to generate random numbers.
For debugging purposes as well as for reproducing interesting runs, it is necessary to be able to use a known seed to exactly reproduce a computational run. If no seed= is provided the one chosen is printed so you can reuse it. (line 138)
133 if __name__ == "__main__" :
134 seed = sarg.Int("seed", 0)
135 if seed : random.seed(seed)
136 else :
137 seed = random.randint(101,999)
138 print("Seed is %s" % seed)
139 random.seed(seed)
140 kmeans(seed)
141
Again, the code files can be found in kmeans.zip
If you have comments or suggestions You can email me at mail me
Copyright © 2014-2021 Chris Meyers and Fred Obermann
…