A particular HPC customer had very large (1-2 Tb) datasets for their computational jobs. Traditional solutions using a monolithic supercomputer introduced a single point of failure. Distributed batch solutions introduced unacceptable startup delays while the dataset was copied over the network to the target system. Network-attached storage avoided startup delays, but reduced the effective speed of the application itself.
Instead of using local storage or a NAS, LTI developed a custom solution to track and manage user data on SAN-attached storage. Because the data is located on a SAN, speeds comparable to local storage can be achieved. Additionally, because LTI-developed software is managing the data, it is automatically connected to the batch computer to work on a specific dataset.
This solution combines the skills of team members with experience in Linux, Perl programming, batch systems and storage into a single, unified application interface.
By replacing a large SMP computer with a cluster of smaller systems, LTI was able to reduce total system downtime by 66%, while increasing the total number of simultaneous datasets available for user jobs by a factor of 3. Overall time to completion of a batch run was reduced by 2%, thereby delivering an additional 800 hours of effective processing time a year. Lastly, data management tasks have been simplified, allowing the customer to delegate this work to less expensive resources.