Authors:
Xinyu Chen
and
Trilce Estrada
Affiliation:
University of New Mexico, United States
Keyword(s):
Scalable Clustering, Privacy Preserving, Big Data.
Related
Ontology
Subjects/Areas/Topics:
Artificial Intelligence
;
Big Data
;
Business Analytics
;
Data Analytics
;
Data Engineering
;
Data Management and Quality
;
Data Management for Analytics
;
Data Structures and Data Management Algorithms
;
Knowledge Discovery and Information Retrieval
;
Knowledge-Based Systems
;
Symbolic Systems
Abstract:
Clustering high-dimensional data is often a crucial step of many applications. However, the so called "Curse of dimensionality" is a challenge for most clustering algorithms. In such high-dimensional spaces, distances between points tend to be less meaningful and the spaces become sparse. Such sparsity needs more data points to characterize the similarities so more distance comparisons are computed. Many approaches have been proposed for reduction of dimensionality, such as sub-space clustering, random projection clustering, and feature selection technique. However, approaches like these become unfeasible in scenarios where data is geographically distributed or cannot be openly used across sites. To deal with the location and privacy issues as well as mitigate the expensive distance computation, we propose an index-based clustering algorithm that generates a spatial \emph{key} for each data point across all dimensions without needing an explicit knowledge of the other data points.
Then it performs a conceptual Map-Reduce procedure in the index space to form a final clustering assignment. Our results show that this algorithm is linear and can be parallelized and executed independently across points and dimensions. We present a Numba implementation and preliminary study of this algorithm's capabilities and limitations.
(More)