Lambda Joins
Database joins are one of the key technologies that make data processing practical. As more data is
distributed over the Internet, the ability to join data located in two different global locations is
becoming critical. There are two fundamental problems: finding efficient protocols to move data over
long distances, and finding efficient algorithms to merge two data streams.
For a demo at SC 2002, Project DataSpace, in collaboration with researchers from Chicago, Ottawa
and Amsterdam, a stream of data was moved from a computer cluster at SARA in Amsterdam to a three-node
computer cluster at StarLight in Chicago, at over 2.8 Gbps. At the same time, a stream of data was
moved from a cluster at CANARIE in Ottawa to the same cluster at StarLight in Chicago, at over 2 Gbps.
The two streams of data were then merged using the StarLight cluster at over 500 Mbps, per node.
Both the algorithm for joining the lambda streams and the high-performance data transport streaming
protocol used (SABUL) were developed by NCDM/Laboratory for Advanced Computing at the University of
Illinois at Chicago.
To many network engineers, lambda and lightpath are used interchangeably to describe a low layer
end-to-end dedicated communications channel of effective guaranteed bandwidth. Using protocols such
as SABUL, it is now possible to use lambdas to move large datasets over long distances as fast as the
data can be pulled from disk. Using lambda joins, it is now possible to merge two such streams and
look for patterns in data, even if the data is distributed worldwide.
Project DataSpace won the SC 2002 High Performance Bandwidth Challenge Award for Innovative, High
Speed, Data Correlation -- Best Use of Emerging Infrastructure.
Collaborators
University of Illinois at Chicago, USA; CANARIE, Canada; SARA, the Netherlands
Contact
Robert Grossman
National Center for Data Mining (NCDM)
grossman@uic.edu
http://www.dataspaceweb.net