Executive Summary : | This proposal aims to develop dis(similarity) measures in a metric space for Numerical, Categorical, and Mixed datasets by employing the information entropy to capture the disorderliness and ensemble property of the data distribution along features. The salient feature of the proposal is to capture the statistical significance of individual attributes of the dataset from the possible number of microstates for that feature. Further, entropy would be employed to compute the weight of each individual attributes to signify the contribution of different features. The proposed measure would be free from any user defined parameters and also independent of the distribution of datapoints. 1. In general, the characteristic length of any system suggests its scale in the Euclidean feature space. The characteristic length of a feature defines the measure of the wideness/inhomogenity of all-pair differences. Large value of characteristic length indicates all-pair absolute differences are widely distributed and this is a good measure of the weight for that feature. Based on this, a weighted metric would be proposed for numerical data to improve the performance of the clustering methods. 2. The characteristic length and Boltzmann entropy would be employed to capture the intra-attribute statistical information along features to discover the significance of attributes for clustering categorical data. 3. Similarly, both intra and inter-attribute data distribution would be captured by entropy to devise a dis(similarity) measure for mixed datasets. |