Granules: Scaleable, distributed stream processing


	Overview \| Documents \| API \| Software \| Contact \| HOME \|

Overview

Granules supports the processing of data streams over a distributed collection of processing elements. Such streams can be generated in settings involving observational and monitoring equipment, simulations, and computational workflows. In Granules these computations can be long running, with multiple rounds of execution, with the ability to retain state across successive rounds. Granules allows a collection of related computations to be expressed as directed graphs that have cycles in them, and orchestrates the completion of such distributed processing. The processing encapsulated within these computations can be arbitrary, and encoded in C, C++, C#, Java, R and Python.

There is no central component in Granules, and the system can scale-out assimilating one node at time thus harnessing the availability of new machines. The system can orchestrate such stream processing computations within traditional clusters, collection of desktops, or IaaS VM-based settings. To maximize resource utilizations Granules interleaves hundreds of computations on the same resource.

Granules manages the lifecycle and finite state machine associated with computations. Computations specify a scheduling strategy that allow them to scheduled for execution: when data streams are available, at periodic intervals, a fixed number of times, or some combination thereof. Granules also incorporates support for variants of the MapReduce paradigm that make it amenable for scientific applications.

When developing computation to process streams, developers are freed from coding for networking or disk I/O and operation in a distributed environment. Granules abstracts the complexities of doing such I/O and the vagaries of execution in distributed settings. This allows a domain scientist to focus on the problem on at hand and not on the artifacts related to deployments in large-scale distributed systems.

A broad class of compute and data intensive applications can benefit from the capabilities available in Granules. Some of the application domains that Granules is currently deployed in include brain computer interfaces, epidemiological modeling, handwriting recognition, data clustering algorithms, and bio-informatics (mRNA sequencing).

Salient features in Granules include support for:
	[1]	Real time processing of data streams over a distributed collection of processing elements
	[2]	Iterative, periodic, and data-driven computations
	[3]	Creation of cyclic execution graphs that span multiple machines and which can themselves be recursive, iterative, or periodic.
	[4]	Interleaving hundreds of computations on a single resource
	[5]	Assimilating the availability of new machines
	[6]	Scientific extensions to the basic MapReduce framework