SC²S Colloquium - November 18, 2015

From Sccswiki
Revision as of 15:41, 26 October 2015 by Bakhtiar (talk | contribs)
Jump to navigation Jump to search
Date: November 18, 2015
Room: 02.07.023
Time: 3:00 pm, s.t.

Denys Sobchyshak: Online Graph Clustering with Sparse Grids Density Estimation

During the past century fast-paced developments in technology have substantially increased information flow and the need of its understanding. Within such data centric environment an ability to make quick strategic decisions based on generalized understanding of immense amounts of rapidly changing data became of great importance and a major deciding factor between success and missed opportunity. Today such problems are tackled by using efficient data mining techniques, which allow an extraction of useful information out of cluttered and not seldom obfuscated data sets and place it into clusters or densities of certain shape. While the task itself is not novel, methods and approaches of solving it have gone a long way and with recent developments have opened new opportunities. We want to focus on performing graph stream clustering in mini-batches of constantly flowing data, with a goal of capturing topological information and adapting to its change, as more recent data takes over older one, known as concept drift. We propose to consider dimensionality reduction techniques as means to transform graph topological information into its feature space, which can be done by applying certain weighting scheme based on dissimilarities between graph nodes and its corresponding Laplacian matrix. Furthermore, we utilize density estimation to identify regions of high and low node coupling, which subsequently serve to perform clustering. While one of the most popular ways to tackle unsupervised learning problem of density estimation is by applying kernel methods, we argue that, because their evaluation depends on all of the data points at any given time, sparse grids approach stands as a better option, since not only it depends on grid points and thus can be used at any point of stream processing, it also poses an efficient way of decreasing the grid size, requiring fewer points stored to perform estimation. We demonstrate how our algorithm works on real-world data and discuss some of the unexpected findings, when performing clustering on a feature space of high-dimensionality.