Cluster computing, with large data, for the classroom

This week’s Perspectives is a two-parter: an interview and companion screencast on the topic of cluster computing in the classroom. The interview is with Kyril Faenov, the General Manager of the Windows HPC (high performance computing) unit, and the screencast is with Rich Ciapala, a program manager for Microsoft HPC++ Labs.

The project demonstrated in the screencast, and discussed in the interview, is called CompFin Lab. It’s a system that enables professors to in turn enable their students to run computationally expensive financial models on large quantities of data. From the student’s perspective, you go to a SharePoint server, select a computational model, pick a basket of stocks, and run the model. Behind the scenes the task is partitioned and sprayed across a cluster of computers, then the results are gathered and presented in an Excel spreadsheet.

From the professor’s point of view, some .NET programming is required. But a framework abstracts the mechanics of dealing with the cluster, so the professor can focus on the logic of the model itself.

There are couple of key points about the evolution of high-performance computing that I want to highlight here. First, there’s what Kyril calls “the gravitational pull of data.” Increasingly, people and organizations are building vast repositories of data that other people and organizations will want to analyze in computationally expensive ways. It’s great to have access to a compute cluster in the cloud that can do the heavy lifting, but when datasets get really big you get bottlenecked trying to send the data to where the code runs. At a certain point you’d rather send the code to where the data lives.

A second and related point is that in our current model for large-scale cloud-based computing, there are only a handful of what I call intergalactic clusters — namely, those operated by Google, Yahoo, Amazon, and Microsoft. These are one-of-a-kind behemoths. You can’t replicate one of them locally and apply it to your terabytes of data. So as Kyril and his team build out their cloud-based HPC services, they’re working to ensure the services can be replicated locally.

Maybe the most optimal thing is for you to stand up a 1000-node cluster with each node having a terabyte of disk. We want to enable that. We want to be able to tell our customers: Here’s how we run this large-scale data-driven HPC applications, and here’s how, within a day or two, you can stand up one of these yourself.

The idea is that if you build one of those for your own terabyte trove of astronomical or climatalogical data, you can run your own computations against that data, and you can also share that capability with other people and organizations who want to run their code against your data.

Posted in .

5 thoughts on “Cluster computing, with large data, for the classroom

  1. Hey Jon, have been really enjoying this new podcast, but I have one minor technical criticism: you should make sure the mp3s that are enclosed in the rss feed have good metadata. Right now they have bad titles and no artist or album info. Until you wrote this post, I hadn’t even noticed that this most recent episode had gone out since the only thing that showed up in my itunes was a track with the title “hpc”. ID3 info is a great metadata opportunity (there’s 60+ fields so you can get really rich with it and include full show notes and anything else you can think of) and a vital one since the mp3 very quickly moves beyond the original context in which it reaches the user. Obviously, I’m preaching to the converted here on this…

  2. “you should make sure the mp3s that are enclosed in the rss feed have good metadata.”

    Busted. Thanks for the reminder. I was being lazy about it because I’d started to think that most people probably do what I do, which is to screen stuff in the podcatcher where the RSS metadata provides the context. (Though I’ve just noticed there’s a bit more to add there as well).

    I wonder what percentage of folks never screen stuff in the podcatcher and rely mainly on the player? In event, it’s nonzero, so I’ll update the existing files and put in title/date/description ID3 tags going forward.

    Thanks again for the nudge.

Leave a Reply