Source: The Platform | Nicole Hemsoth | October 19, 2015

While much of the attention around the new crop of supercomputers tends to focus on the hardware story, which is difficult to downplay given the relatively high performance and densities expected as soon as early next year, the lesser told story (perhaps, in part because it is application specific) is perhaps far more important.

It’s about the codes set to run on these machines, which as many are already aware, will leave a lot of that sexy hardware performance on the table if they are not brought up to date. All the high core and thread counts and memory increases are useless without applications designed to scale, after all. Accordingly, even though it doesn’t tend to make the news as often as the latest specs for the next generation of massive-scale systems, there is a whole lot of momentum happening at the centers where these next big supers will reside.

With many legacy codes that have not yet been modernized to exploit the thread-level parallelism and curve around memory access challenges, this is a long process, but it is one that is already in full swing at centers like Oak Ridge National Laboratory (where teams are preparing for the Summit supercomputer) and more recently, at Sandia National Laboratories, which has several codes it will want to push to the approximately 40 petaflops Intel Knights Landing-based “Trinity” supercomputer, which will be housed at Los Alamos National Laboratory.

In the process of doing code modernization footwork to get a relatively obscure but representative legacy code (the Laplace mesh smoothing algorithm used on a hex mesh) to scale to meet the capabilities of the upcoming Knights Landing architecture, William Roshan Quadros from Sandia was able to condense the process into a few steps. While still not simple or practical for all complex HPC codes, this method does provide a framework for approaching the problem of finding code hotspots and then making the critical decision to refactor or rewrite that code.

Using a seven node testbed outfitted with Knights Corner processors, Quadros set about developing the procedure for scaling legacy code by starting at the profiling stage. It is here, using tuning and optimization tools like TAU to scan for hotspots, that the refactor or rewrite decision is made. In general, as one might suspect, if there are too many hotspots, the decision to rewrite the code will become quite apparent. His team, for instance, found that one area of their code was gobbling 30 percent of the meshing runtime and focused on that exclusively instead of writing an entirely new mesh generator.

Naturally, the results will be different depending on the code in question, and there are a number of tools like TAU that can pinpoint the hotspots and define the strategy. However, as Quadros notes in his detailed code modernization case study on the Sandia testbed cluster, this is just to make the critical refactor or rewrite decision. The next steps are defining and then implementing the programming models, which is described in depth in the full paper.

One of the most notable aspects to the Sandia work is how the Kokkos library was instrumental in bringing the code up to speed. Quadros says that the data parallelism using Kokkos achieved node performance speedup of 20X on a Knights Landing device. Kokkos was authored at Sandia and providers a programming model that will allow for performance portability among a number of multi-core architectures (Knights Landing being just one).

There are a range of profiling and debugging tools as well as optimization frameworks that are useful in the code modernization process. “The results recommend use of a high-level performance portable library such as Kokkos, which can handle multiple advanced architecture specific memory access pattern performance constraints without having to modify the user code.”

As one might expect, code modernization efforts and tooling have been a significant priority in the last couple of years. With announcements around the architecture of next-generation pre-exascale systems out in the open, there have been a number of new papers and presentations on the topic, including some we will see at SC15 in Austin next month.

Among some of the birds of a feather sessions that will be delving into code modernization approaches next month at SC15:

Software always lives longer than expected. Hardware changes over this lifetime are hard to ignore. Current hardware presents software written in the 1990s with tremendous on-node parallelism such that an MPI-only model is insufficient. Modifying these large, complex, MPI-only applications to run well on current hardware require extensive and invasive changes. Further, new programming models for exploiting the on-node parallelism typically assume a start-from-scratch-and-application-wide approach making them difficult to use. In this BoF a panel of experts will discuss migration paths that will allow legacy applications to perform better on current and future hardware. (Presented by a team from Sandia and Los Alamos).

This BoF session aims to bring together researchers, developers, vendors and other enthusiasts interested in user-level threading and tasking models to understand the current state of art and requirements of the broader community. The idea is to utilize this BoF as a mechanism to kick-off a standardization effort for lightweight user-level threads and tasks. If things go as planned, this BoF series would be continued in the future years to provide information of the standardization process to the community and to attract more participants. (Presented by team from Argonne National Lab).

Software engineering (SWE) for computational science and engineering (CSE) is challenging, with ever-more sophisticated, higher fidelity simulation of ever-larger and more complex problems involving larger data volumes, more domains and more researchers. Targeting high-end computers multiplies these challenges. We invest a great deal in creating these codes, but we rarely talk about that experience. Instead we focus on the results. The  goal is to raise awareness of SWE for CSE on supercomputers as a major challenge, and to begin the development of an international “community of practice” to continue these important discussions outside of annual workshops and other “traditional” venues.

This BoF continues the history of community building among those developing HPC applications for systems incorporating the Intel Xeon Phi many-integrated core (MIC) processor. The next-generation Intel MIC processor code-named Knights Landing introduces innovative features which expand the parameter space for optimizations. The BoF will address these challenges together with general aspects as threading, vectorization, and memory tuning. The BoF will start with Lightning Talks that share key insights and best practices, followed by a moderated discussion among all attendees. It will close with an invitation to an ongoing discussion through the Intel Xeon Phi Users Group (IXPUG).