Citigroup Moves to a Large-Scale Grid for High-Performance Computing

Citigroup is in the middle of a major shift in its approach to high-performance computing, moving from clustered servers to a large-scale grid, according to John van Uden, the bank's SVP of capital markets and banking technology. "We have about 10,000 CPUs globally in individual clusters, all doing high-performance computing," he says. "We drew the line two years ago and said we weren't going to continue that route."

The trouble with clusters, van Uden says, is the lack of reuse. "If you imagine 10,000 CPUs and more than 50 data centers at Citigroup, they're not all accessible to everyone," he explains. "You have a server/hub mentality because those clusters are independently owned." The driver of the grid project is the need to reduce cost but still perform the same calculations, van Uden adds.

Citigroup built its first grid -- based on software from Platform Computing (Markham, Ontario) and HP servers -- two years ago to handle value-at-risk (VaR) back-testing, according to van Uden. The project was deemed successful and expanded to encompass other applications in the capital markets group, primarily for evaluating the risk of complex products such as collateralized debt obligations, he adds. So far, Citigroup has 11 projects "in flight" on the grid at various stages, from proofs of concept to being used 24/7, van Uden relates, noting that while today 2,000 cores operate on the grid, Citi hopes to have 7,000 CPUs running on it by the end of 2007. The grid has two main sites, located in Texas and London.

Moving to shared versus distributed computing has been a learning experience for Citigroup's IT and business users. "The emotional and political angles of trying to do this far outweigh the technical impact of trying to do it," van Uden notes. "Pitching 1,000 to 4,000 boxes together is not difficult, as long as you have the space and the power. But trying to get a group of 25 applications to stop server hogging is demanding."

On the positive side, with monitoring tools in place, van Uden and Andrew Dolan, head of grid computing, find the grid is actually easier to manage than the clusters. "The difficult part is getting the tools in place -- scaling the existing tools from what we used to use to manage our server estate to manage 500 servers at one time," says Dolan. "But now that we have those tools in place, we'll forget we're managing 500 servers. It should be like we're managing two -- a grid infrastructure and the actual compute nodes." --P.C.