K-Means is an algorithm that solves the well known clustering problem. The algorithm starts with 'k' centroids and then all the documents are divided into them. With that way it creates the clusters. Each document is assigned to the closest centroid. The new centroids are calculated and the procedure is repeated. The initial centroids determine final clusters. It is common, when the algorithm is executed with the same documents, but different initial centroids, the results of the clusters we get are different. Usually, as much farther the initial centroids are, the better results we get.
How does this program work?
1. Insert vectors
- Manually
- From file
- txt
- xml
- Euclidean distance
- Manhattan distance
You have to fill in fields with vectors. The number of vectors must be at least three.
You have to give only txt or xml file format. For convenience, you can download the example of file you want to see the template.
The file should consist of all the vectors, each one of them must be in a line.
This file should consist not only of all the vectors, but also: the number of clusters, the number of execution iterations, the distance method (Euclidean or Manhattan), the cluster assignment (Random or Specified) and if the last is specified, the vectors that will be used as centroids must be given.
Each vector corresponds to a unique document. For example, you
can give '1, 4, 5, 7'. Duplicates are deleted!
After you fill out everything you want, click on tab '2. Insert parameters'.
2. Insert parameters
This step is skipped by uploading an xml file!
At this point, all the parameters of the algorithm must be given except for the documents (vectors) that were given before.
Field 'Cluster assignment' can take the value 'Random', if you
want centroids to be generated randomly from the documents.
The number of centroids will be equal to the number you give at field 'Number of
clusters' that ranges from 2 to 10.
Otherwise, 'Cluster assignment'
will have the value 'Specified' and you should check out at least two of the documents,
you want to be centroids, at field 'Clusters'.
Field 'Execution iterations' is how many iterations you want the algorithm to execute and ranges from 1 to 10.
Field 'Distance method' has the method that will be used for calculating the distance between vectors.
If A(x1,y1,z1) and B(x2,y2,z2), then the Euclidean distance shown below.
Dist(A, B) = √ (x2-x1)2 + (y2-y1)2 + (z2-z1)2
If A(x1,y1,z1) and B(x2,y2,z2), then the Manhattan distance shown below.
Dist(A, B) = |x2-x1| + |y2-y1| + |z2-z1|
After you fill out everything you want, click on tab '3. Results'.
3. Results
You see the results.