Research

Mathematical Sciences

Title :	Study and development of Information Entropy-based distance measures for Categorical and Continuous Data in a Metric Space for Clustering
Area of Research :	Mathematical Sciences
Principal Investigator :	Dr. Sraban Kumar Mohanty, Pandit Dwarka Prasad Mishra Indian Institute Of Information Technology, Design & Manufacturing, Jabalpur, Madhya Pradesh
Contact info :	sraban@gmail.com
Timeline Start Year :	2023
Timeline End Year :	2026
Total Budget (INR):	6,60,000

Details

Executive Summary :

This proposal aims to develop dis(similarity) measures in a metric space for Numerical, Categorical, and Mixed datasets by employing the information entropy to capture the disorderliness and ensemble property of the data distribution along features. The salient feature of the proposal is to capture the statistical significance of individual attributes of the dataset from the possible number of microstates for that feature. Further, entropy would be employed to compute the weight of each individual attributes to signify the contribution of different features. The proposed measure would be free from any user defined parameters and also independent of the distribution of datapoints. 1. In general, the characteristic length of any system suggests its scale in the Euclidean feature space. The characteristic length of a feature defines the measure of the wideness/inhomogenity of all-pair differences. Large value of characteristic length indicates all-pair absolute differences are widely distributed and this is a good measure of the weight for that feature. Based on this, a weighted metric would be proposed for numerical data to improve the performance of the clustering methods. 2. The characteristic length and Boltzmann entropy would be employed to capture the intra-attribute statistical information along features to discover the significance of attributes for clustering categorical data. 3. Similarly, both intra and inter-attribute data distribution would be captured by entropy to devise a dis(similarity) measure for mixed datasets.

Organizations involved

Implementing Agency :	Pandit Dwarka Prasad Mishra Indian Institute Of Information Technology, Design & Manufacturing, Jabalpur, Madhya Pradesh
Funding Agency :	Anusandhan National Rsearch Foundation (ANRF)/Science and Engineering Research Board (SERB)
Source:	Science and Engineering Research Board (SERB), DST 2022-23