SC²S Colloquium - June 24, 2015

From Sccswiki
Jump to navigation Jump to search
Date: June 24, 2015
Room: 02.07.023
Time: 3:00 pm, s.t.

Denys Sobchyshak: Mini-batch network clustering via sparse grids density estimation

During the past century fast-paced developments in technology have substantially increased information flow and the need of its un- derstanding. Within such data centric environment an ability to make quick strategic decisions based on generalized understanding of im- mense amount of rapidly changing data became of great importance and a major deciding factor between success and missed opportunity. One of most optimal ways to tackle such problems today is by using data analysis methods in conjunction with their efficient parallel im- plementations which allow an extraction of useful information out of cluttered and not seldom obfuscated data sets in the smallest time possible. The task itself is not novel, however, methods and approaches of efficiently tackling it have gone a long way and with recent devel- opments in graph clustering have opened new opportunities. In this thesis we want to focus on performing network (a.k.a big sized graph) clustering in mini-batches of constantly flowing data. The goal is to develop the method and then show its performance against data sets with known clustering as well as entirely new ones. Presented approach is not completely novel but is rather a combi- nation of recent developments. Most of network clustering algorithms focus on processing data in one go whether in parallel or using dis- tributed systems. While such approaches might work well in a general case, utilizing knowledge about the domain can significantly reduce computational resources required to perform the processing. We assume that the data sets used against our method are actually com- posed of small to big communities within the network and propose to utilize mini-batch approach.

One of the most common ways to tackle unsupervised learning problems of discovering differently sized clusters is via density estima- tion, which is construction of an approximation p of density function p based on provided data. Kernel density estimation has grown to be most popular one in a non-parametric setting, however it faces problems when dealing with huge data sets (a.k.a. big data). Partic- ularly, evaluation of the approximation requires evaluating all kernels centered at all data points, which is rather costly. Thus, we suggest utilizing a sparse grids based density estimator. As was already mentioned, it is assumed that knowledge discov- ery is performed against graph based data, thus to perform density estimation on it one requires certain mapping from weighted graph G = (V, E, f ) comprising a set V of vertices, E of edges and a func- tion f , a mapping of edges to their weights, onto vectorial data. There have been several developments in the area but we will restrict ourselves to a multidimentional scaling (MDS) ap- proach, which transforms structural data in a form of graph Laplacian to an N dimensional space while preserving vertex distances and their correlations. When making a decision one is not seldom interested in both present and past state of things, which, adapted to our context, means that to judge upon current clustering of data one might be interested in its evolution process and how the data changed to give rise to cur- rent clustering. In this master thesis we do not focus on extending original algorithm with means to persist or visualize the history of clustering evolution, however, the topic in itself is rather interesting and is encouraged to be considered for further research.