A very interesting article on the difficulties of managing a large scale super computer system. Ironically it leads me to believe that they in fact do not belong in the office...., period.
I don't believe todays technology is ready to support supercomputers in the office. Maybe in 20 to 50 more years with more processor shrinkage, efficiency or breakthrough's in computing power. Then I believe supercomputer power will be truly portable, and no your SLI or Crossfire high end video system doesn't count.
Decent approach... what he misses is that the biggest challenge facing supercomputing is system interconnects. Infiniband is the flavor of the month and it still is way behind where we should be in interconnecting supercomputers. Low latency, high performing interconnects are the bottleneck these days. On the Linux thing.... Linux clusters perform great when they stay up (dirty little secret) and when they're given a single task like the linpack (for Top500 submission.) Give it a truly diverse workload of many users with varying needs and see how well it performs. Lastly, MPI is message passing interface, an API for allowing parallel systems to communicate.
On the office thing, you might want to replace your building's furnace with the output of your supercomputing complex. The heat is awe-inspiring.
In what way is this proposed 'Office Supercomputer' cluster unlike a network of office desktops - a distributed processing system?
It seems, one solution, would be to have a sort of 'master computer' coordinating the work of any available processor/system on the network. I seem to recall that, once upon a time, some movie production company bought high-end workstations for their office workers and, after hours, used that network to crunch CGI for their movies.
I realize that a typical Cat-5 network probably couldn't handle true 'supercomputer' throughput - mainly because the definition of 'supercomputer' has changed as technology has progressed. The quad-core uber-gaming system currently available from Dell, a small box that sits on your desktop, has more power than many early 'supercomputers'. However, the technology exists to run a fiber-optic network - which ought to be able to handle massive bandwidth.
So, I wonder, why not use the experience from the various 'folding' and other distributed processing ventures to redefine the idea of 'supercomputer' again. Not as a single monstrous processing unit, not as a dedicated array of specialized processor-units, but as a network of general-purpose processors that all work together. The resulting super-computer-network would be eminently Scalable. After all, not every office needs the same level of 'super' in their computing.
Even more interesting, what if any one of the computers in the network, when faced with a difficult task, could automatically borrow extra clock-cycles from the rest of the network? Suddenly, you have a sort of supercomputer-on-demand - because most offices don't need a supercomputer all the time.
Anyway, just some random thoughts from a relative newbie.
This is following in Fireheart's train of thought. Could it be possible in an office environment full of multicore workstations to allocate an entire processor from each system and then have a server set up just for management and distribution? Maybe using something like a Beowulf cluster? The way it works in my head (with my limited knowledge) is that each workstation would not suffer much heat increase and only marginal power consumption increases.
In an office with 100+ workstations, the processing power of just one core from each would provide a considerable amount of computational power. Probably not on par with normal supercomputers, but maybe a workable solution for smaller demands.
Anyhow, just my two cents. Probably already been tried and failed.
Supercomputing platforms leverage network interconnects of performance never seen or imagined on CAT5. IBM's Federation Switch used for their supercomputing interconnects run at 2 GB/sec each and are configured in pairs for a total of 4 GBytes (not Gbits) per second bandwidth fully non-blocking to all other nodes in the cluster. You can argue that 10Gb Ethernet can handle that if you use multiples. But there is so much more to the equation. Latency. These interconnects are very low latency. Ethernet latency is well above 40 microseconds where Federation runs at between 7 and 14 microseconds. MPI which is used to glue these separate systems together requires these low latency connections. Using something like Ethernet moves you out of high performance and closer into the "SETI at Home" crowd (not really, just making a point that Ethernet isn't remotely in the same class of interconnect.)
I think the comms over CAT5 may be OK, but CAT5E that works at Gigabit would be better. I've worked on many telecom systems that use CAT5 ethernet as the interconnects between the clusters. Some of the present telecoms equipments uses Gigabit, with the odd 10 Gig link.
One of the problems is co-ordinating all the processing. For this you could use a server or a very high end workstaion machine with 2 or more 10 Gig connections, multi cored, and a large amount of memory (16 gig for starters but I'm guessing). The 10 Gig links could use link agregation for resilience.
The connections to each core could be over multiple VLANS, with seperate VLANS for cluster co-ordination so control messages could get past loads of data to be processed. There would need to be 2 of each type of VLAN each going over a seperate links from the switch to the controlling cluster. You could use HSRP or BFD to switch between the VLANS of one of the 10 Gig links to the controlling cluster when down. I would not worry about dual connections to each worker cluster. If you have many worker clusters losing one of them is not a problem. Loosing half of them because one of the links from the controlling cluster goes down is a problem.
Multi threading and inter process communications are already available under Linux as far as I know. One of the systems I worked on before used an OS that was POSIX compliant. POSIX sets the API for mutli-threading, and inter process communications using message queues. On the project I worked on message queues were much more effecient than using stream based communications. This may not be the case on some systems however. This is relatively easy to test, so test it and go with the one that works best. The software to make this communications work over ethernet has been written many times and may already exist in the public domain. Maybe you could use SCTP for this.
For the worker clusters you could use the native linux if it is on the machine, but you could use VM ware and install linux as another OS. The worker clusters could all then use linux. The worker clusters would need to check if someone using the machine, and if they were limit the CPU bandwidth they used, keep it to 50% max on a dual care machine. The amount of memory used would need to be limited, so that it does not affect the user of the machine to much. As the worker clusters will not be using any GUIs I would hope the memory requirements would not be to high.
If the worker clusters needed disk storage then this opens up privacy problems, and if possible this would be best avioded. Maybe an area to store temp data would be possible, but some extra memory on the worker clusters may provide a solution to this.
The controling cluster would have to try and determine which jobs are going to use a set of processes and try to make sure that the job had all processes running on one worker cluster. This would cut down on traffic over the net, and inter process communications between local processes are much quicker. This would be a difficult task, and maybe the way to approach it would be to split large jobs up into smaller jobs self contained jobs.
This would leaves a machine to control access, and mantain user sessions. There is not much point going to this amount of trouble and having a "mini super" computer that no-one can use. This machine would have to do basic analysis of the job, so that jobs with obvious errors are rejected at this stage, not after they have used loads of CPU and network bandwidth. This machine may provide a GUI front end to the "mini super" computer. This may need a dedicated high spec machine similar to the controler cluster, or maybe it could be on the controling cluster. If this was on the controling cluster I would use the QoS settings to mark the VLANs used for control as the highest priority, the VLANs carrying data to the worker clusters as the next highest, and the VLANs for user sessions as the lowest priority.
On the office thing, you might want to replace your building's furnace with the output of your supercomputing complex. The heat is awe-inspiring.
The University of Minnesota's Supercomputing Institute used to heat their building with their Cray Supercomputers. This was enough to keep the building warm even in the sub-zero weather of a Minnesota winter.
Supercomputing platforms leverage network interconnects of performance never seen or imagined on CAT5. Ethernet isn't remotely in the same class of interconnect.)
Thanks, RISCguru, I didn't really expect CAT5 Ethernet to be up to the job of handling super-computing traffic and your information confirms it. In fact, I didn't expect any of the currently common infrastructure would serve the needs of this sort of super-computing scheme. Although, 1G Ethernet could probably handle a lower intensity of 'super'... After all, how 'super' does this 'office-level' supercomputer need to be? Consider scalability.
I had envisioned a Fiber-optic LAN with optimized switches and certainly a completely different protocol from TCP/IP. The super-computing traffic would need to run on a streamlined model - there's no time to go through the OSI layers. My understanding was that Fiber could handle multiple streams of data on different bandwidths, so there's no reason why the same LAN couldn't carry Ethernet connectivity as well.
So I'm thinking in terms of a step or two up from current infrastructure equipment and a more industry-wide approach to supercomputer design and creation. Not just contributions from processor and software engineers, but network communication engineers, as well.
Lokigreybush's concept of abstracting and dedicating a whole core from each workstation isn't so bad, but I was thinking of a more flexible model. I thought that every computer in the super-computing-network would contribute according to availability. Sure, it may be a Communistic system, but if "Joe in Accounting" is taking a long coffee-break, his computer can still contribute to the processing power of the company net. After all, the 'drones' down in the R&D lab need all of the clock-cycles they can get!
Err, and while I might wonder at the value of "SETI At Home", that model of distributed computing is the basic foundation for my idea - Actually, I was thinking of Folding@Home. Same concept, but technologically enhanced and concentrated.
The main 'problems' that I see with this idea is in the distributed nature of the system. Instead of a single 'hardened' power, cooling, communication and control system supplying a central unit, these services need to be supplied to every unit in the network. Of course, that need exists anyway... a super-computing-network just needs a bit more of everything.
Anyway, the central idea here, is that the processors are already there, on everyone's desktop. The controlling Server may already be there, as well and, if not, it shouldn't be that expensive. The network itself is already there, in concept. So, all that's missing is the controlling software - and for higher levels of 'super', the more powerful network infrastructure.
Of course, all of this ignores the question of whether current desktop systems can work at super-computing speed. Can the PCI bus handle the needed bandwidth at all? Or does super-computing require the sort of processor-to-processor connection provided by the System bus? Is even the System bus of the typical desktop machine robust enough to handle 'super' throughput?
What is the 'minimum hardware configuration' that can handle 'super' level computing? A 'server-farm'? A 'blade-server'? Is a giant, industrial cabinet the size of a refrigerator and stuffed with a million dollars worth of circuitry really necessary to be considered 'super'?
Thanks, RISCguru, I didn't really expect CAT5 Ethernet to be up to the job of handling super-computing traffic and your information confirms it. In fact, I didn't expect any of the currently common infrastructure would serve the needs of this sort of super-computing scheme. Although, 1G Ethernet could probably handle a lower intensity of 'super'... After all, how 'super' does this 'office-level' supercomputer need to be? Consider scalability.
I had envisioned a Fiber-optic LAN with optimized switches and certainly a completely different protocol from TCP/IP. The super-computing traffic would need to run on a streamlined model - there's no time to go through the OSI layers. My understanding was that Fiber could handle multiple streams of data on different bandwidths, so there's no reason why the same LAN couldn't carry Ethernet connectivity as well.
What business aplications would need the processing power of a supercomputer?
Maybe for experimental engineering firms or research and development labs, maybe for trying to find a cure for cancer or something, but not for spreadsheets and word processing.
Distributed computing, using the existing network infrastructure, would be far more efficient. Taking advantage of idle computers already connected to the network is what I would consider to be a better plan.
On the office thing, you might want to replace your building's furnace with the output of your supercomputing complex. The heat is awe-inspiring.
The University of Minnesota's Supercomputing Institute used to heat their building with their Cray Supercomputers. This was enough to keep the building warm even in the sub-zero weather of a Minnesota winter.
While I'm not certain, I'm doubtful the whole building was heated by the Cray. They certainly made heat, but not that much. Modern systems are _much_ hotter.
P.S. I know, I am in charge of the systems at the University of Minnesota's Supercomputing Institute.
Ethernet latency is well above 40 microseconds where Federation runs at between 7 and 14 microseconds.
The latencies on modern interconnects are even lower, on the order of 1 microsecond. But, those numbers are meaningless by themselves as they are always measured with a zero byte payload, simple ping-pong. What matters is how the latency scales with the data payload and multiple messages moving across the fabric.
I would still love to be a labrat in a supercomputer environment. Just maintaining that thing would be awesome!!!
It's not as glamorous as it sounds.
But, if you want a job, you need a deep understanding of Linux, networking, specialized libraries, queuing systems, schedulers, maintaining environmental variables, LDAP, system provisioning, log management, and a whole lot more.