I have a set (2k - 4k) of small strings (3-6 characters) and I want to cluster them. Since I use strings, previous answers on How does clustering (especially String clustering) work?, informed me that Levenshtein distance is good to be used as a distance function for strings. Also, since I do not know in advance the number of clusters, hierarchical clustering is the way to go and not k-means.
Although I get the problem in its abstract form, I do not know what is the easie way to actually do it. For example, is MATLAB or R a better choice for the actual implementation of hierarchical clustering with the custom function (Levenshtein distance). For both software, one may easily find a Levenshtein distance implementation. The clustering part seems harder. For example Clustering text in MATLAB calculates the distance array for all strings, but I cannot understand how to use the distance array to actually get the clustering. Can you any of you gurus show me the way to how to implement the hierarchical clustering in either MATLAB or R with a custom function?
See Question&Answers more detail:os