I have a distance matrix from a RandomForest and I want to use Hierarchical clustering with Ward's linkage to look for clusters via Sci-Kit learn AgglomerativeClustering
. I know that Ward linkage only works with Euclidean distances and the RandomForest distance matrix consist of (squared) Euclidean distances. I also know that I can use affinity = 'precomputed'
and input my distance matrix, but then I cannot use Ward's linkage (according to Sci-Kit learn's documentation). Should I then just input my square distance matrix (n x n) and use affinity = 'euclidean'
and linkage = 'ward'
instead, since I am not violating any mathematical assumptions in my humble opinion?
I also read that I could perform a PCA on the distance matrix (after double-centering) and then use this in Kmeans, since Kmeans doesn't handle (implicilty) distance matrices. Normally, it takes sample x feature matrix (Data matrix) as input. Would this post-PCA matrix be a better input for AgglomerativeClustering with Ward than the distance matrix?