Back to Top

What is Charm++?

Charm++ is a mature, highly scalable parallel programming system. Though written in a C++ skeleton, it is compatible with Fortran, C, and C++. Code written in MPI can call Charm++, and Charm++ can call MPI, OpenMP, CUDA, and more.


Work Paralellization

The problem is broken down into logical units, which are automatically mapped to processors.

Load Balancing

Load imbalance arises in many HPC applications, and also occurs on mixed-hardware clusters. Rather than make every program solve this on its own, Charm++ provides automatic load balancing for all applications.

Automatic Communication & Computation Overlap

Charm++ exploits logical decomposition to enable dynamic overlap of communication and computation as the application executes.

Checkpointing & Resilience

Applications written in Charm++ can automatically checkpoint and restart with no extra code or special OS support. They can also run through node failures!

Power/Temperature Management

Charm++ can adapt execution to limit power consumption, reduce hotspots to improve reliability, and conserve energy to reduce cluster TCO.


Projections: an extensive suite for understanding the performance of your applications.

LiveViz: get live visualization data from any application.

CharmDebug: debug your Charm++ code interactively


Collision Detection: a highly scalable collision detection library written in Charm++.

Sorting: a highly efficient Charm++ library that can be used to sort billions of keys.


In partnership with the University of Illinois, Charmworks is the exclusive commercial licensor for the Charm++ parallel programming system and its associated tools. Licenses are offered for a wide range of needs.

Developer Developer licenses cover the full compilation toolchain for Charm++ and mixed Charm++/MPI applications, including tools for debugging, performance analysis, and visualization, as well as runtime licenses for correctness and scaling validation.
Runtime Runtime licenses enable usage of Charm++ codes on production-scale clusters and supercomputers.
Embedded Library Embedded library licenses cover particular components built in Charm++ called from non-Charm++ applications.
Application Distribution Application distribution licenses are useful for ISVs that wish to incorporate Charm++ into their products.

Contact us today for more information on pricing and licensing options!



Our staff is available to teach courses in parallel computing, focusing on scalable algorithm design and efficient implementation using the Charm++ system. These courses can range in length from short introductions of a few hours to a week-long hands-on tutorial. We tailor the coverage and presentation of each course to your group’s knowledge and experience.


We can provide the following consulting services:

  • Problem analysis and solution method development
  • Application Development: architecture, design, testing, debugging, verification, and validation
  • Performance Engineering: analysis, tuning, restructuring, and optimization
  • Integration with existing applications and work-flows
  • Cluster hardware purchasing assistance
  • Cloud and utility computing deployment


We offer many remote support options to make sure there are no problems

  • Installation matched to your environment
  • Integration with your scheduler & resource manager
  • Rapid solutions for any bugs encountered

Contact us today for more information about our services.



Explore how Charm++ has been used in other applications using the slider below.

Domain: Classical MD

Converted From: PVM

Scale: 500k CPU cores

NAMD, recipient of a 2002 Gordon Bell Award, is a parallel molecular dynamics application designed for high-performance simulation of large biomolecular systems. NAMD is a result of many years of collaboration between Prof. Kale, Prof. Robert D. Skeel, and Prof. Klaus J. Schulten at the Theoretical and Computational Biophysics Group (TCBG) of Beckman Institute.

Charm++, developed by Prof. Kale and co-workers, simplifies parallel programming and provides automatic load balancing, which was crucial to the performance of NAMD. It is used by tens of thousands of biophysical researchers with production versions installed on most supercomputing platforms. NAMD scales to hundreds of cores for small simulations and beyond 300,000 cores for the largest simulations.

The dynamic components of NAMD are implemented in the Charm++ parallel language. It is composed of collections of C++ objects, which communicate by remotely invoking methods on other objects. This supports the multi-partition decompositions in NAMD. Also data-driven execution adaptively overlaps communication and computation. Finally, NAMD benefits from Charm++'s load balancing framework to achieve unsurpassed parallel performance. See PPL NAMD research page for more details.

Domain: N-body gravity & SPH

Converted From: MPI

Scale: 500k CPU cores

ChaNGa (Charm++ N-Body Gravity Simulator) is a cosmological simulator to study formation of galaxies and other large scale structures in the Universe. It is a result of interdisciplinary collaboration between Prof. Kale, Prof. Thomas Quinn of University of Washington and Prof. Orion Lawlor of University of Alaska Fairbanks. ChaNGa is a production code with the features required for accurate simulation, including canonical, comoving coordinates with a symplectic integrator to efficiently handle cosmological dynamics, individual and adaptive time steps, periodic boundary conditions using Ewald summation, and Smooth Particle Hydrodynamics (SPH) for adiabatic gas.

ChaNGa implements the well-known Barnes-Hut algorithm, which has N log N computational complexity, organizing the particles involved in the simulation into a tree based on Oct, Orthogonal Recursive Bisection (ORB), or SFC decompositions. In order to compute the pair interaction between particles and collections of particles on different processors, parts of the tree needed for the computation are imported from the remote processors which own them. ChaNGa uses the Charm++ Salsa parallel visualization and analysis tool. Visualization in the context of cosmology involves a large amount of data, possibly spread over multiple processors.

ChaNGa has been scaled to 32K cores, and has been ported to GPU clusters. Over time, ChaNGa is being actively developed and improved, with an eye for efficient utilization and scaling of current and future supercomputing systems.

Domain: Agent-based epidemiology

Converted From: MPI

Scale: 500k CPU cores

The study of contagion effects in extremely large social networks, such as the spread of disease pathogens through a population, is critical to many areas of our world. Scaling Agent-based Simulation of Contagion Diffusion over Dynamic Networks on Petascale Machines Applications that model dynamical systems involve large scale, irregular graph processing. These applications are difficult to scale due to the evolutionary nature of their workload, irregular communication and load imbalance. EpiSimdemics is a collaborative project between PPL and Virginia Tech to create a Charm++ version of EpiSimdemics.

EpiSimdemics implements a graph based system that captures dynamics among co-evolving entities, while simulating contagious diffusion in extremely large and realistic social contact networks. EpiSimdemics relies on individual-based models, thus allowing studies in great detail. The implementation of EpiSimdemics in Charm++ enables future research by social, biological and computational scientists at unprecedented data and system scales. We have presented new methods for application-specific decomposition of graph data and predictive dynamic load migration and demonstrate the effectiveness of these methods on Cray XE6/XK7 and IBM Blue Gene/Q.

Domain: PDES

Converted From: MPI

Scale: 500k CPU cores

See paper


Discrete event simulations (DES) are central to exploration of ``what-if'' scenarios in many domains including networks, storage devices, and chip design. Accurate simulation of dynamically varying behavior of large components in these domains requires the DES engines to be scalable and adaptive in order to complete simulations in a reasonable time. This paper takes a step towards development of such a simulation engine by redesigning ROSS, a parallel DES engine in MPI, in Charm++, a parallel programming framework based on the concept of message-driven migratable objects managed by an adaptive runtime system.

In the paper, we first show that the programming model of Charm++ is highly suitable for implementing a PDES engine such as ROSS. Next, the design and implementation of the Charm++ version of ROSS is described and its benefits are discussed. Finally, we demonstrate the performance benefits of the Charm++ version of ROSS over its MPI counterpart on IBM's Blue Gene/Q supercomputers. We obtain up to 40% higher event rate for the PHOLD benchmark on two million processes, and improve the strong-scaling of the dragonfly network model to 524,288 processes with up to 5x speed up at lower process counts.

Domain: Electronic Structure

Converted From: MPI

Scale: 128k CPU cores

Many important problems in material science, chemistry, solid-state physics, and biophysics require a modeling approach based on fundamental quantum mechanical principles. A particular approach that has proven to be relatively efficient and useful is Car-Parrinello ab initio molecular dynamics (CPAIMD). Parallelization of this approach beyond a few hundred processors is challenging, due to the complex dependencies among various subcomputations, which lead to complex communication optimization and load balancing problems. We are parallelizing CPAIMD using Charm++. The computation is modeled using a large number of virtual processors, which are mapped flexibly to available processors with assistance from the Charm++ runtime system.

This project began as a NSF funded collaboration involving us (PPL: Laxmikant Kale) and Drs. Roberto Car, Michael Klein, Glenn Martyna, Mark Tuckerman, Nick Nystrom and Josep Torrellas. It then shifted to a collaborative development to scale both OpenAtom and NAMD under the LCF ORNL grant "Scalable Atomistic Modeling Tools with Chemical Reactivity for Life Sciences", as a continuing collaboration with PPL, Kale on computer scienece, Martyna and Tuckerman on the QM side, Klaus Schulten on the MD side and Jack Dongarra on performance optimization for ORNL LCF. Currently, the OpenAtom project is a collaboration of Kale with Glenn Martyna and Sohrab Ismail-Beigi.

Domain: Relativistic MHD

Scale: 100k CPU cores

SpECTRE is a Charm++ application used for research on relativistic astrophysics. The design of SpECTRE relies on arrays of charm objects to represent its Discontinuous Galerkin elements and distribute them over processors and nodes. By taking advantage of asynchronous execution and adaptive overlap, SpECTRE has been scaled to run on the full 22,000 nodes of the Blue Waters supercomputer with excellent efficiency.

See paper


We introduce a new relativistic astrophysics code, SpECTRE, that combines a discontinuous Galerkin method with a task-based parallelism model. SpECTRE's goal is to achieve more accurate solutions for challenging relativistic astrophysics problems such as core-collapse supernovae and binary neutron star mergers. The robustness of the discontinuous Galerkin method allows for the use of high-resolution shock capturing methods in regions where (relativistic) shocks are found, while exploiting high-order accuracy in smooth regions. A task-based parallelism model allows efficient use of the largest supercomputers for problems with a heterogeneous workload over disparate spatial and temporal scales. We argue that the locality and algorithmic structure of discontinuous Galerkin methods will exhibit good scalability within a task-based parallelism framework. We demonstrate the code on a wide variety of challenging benchmark problems in (non)-relativistic (magneto)-hydrodynamics. We demonstrate the code's scalability including its strong scaling on the NCSA Blue Waters supercomputer up to the machine's full capacity of 22,380 nodes using 671,400 threads.

Domain: Astrophysics/Cosmology

Converted From: MPI

Scale: 64k CPU cores

Cello is a Charm++ framework for multi-physics adaptive mesh refinement (AMR) simulations. The next generation of the Enzo cosmological hydrodynamics code, Enzo-P, is built on Cello. The design of Enzo-P and Cello relies on fully-distributed Charm++ object array data structures. Some of the largest AMR simulations in the world have been used with Cello, and Cello has achieved almost perfect scaling on 64,000 cores of the NCSA Blue Waters supercomputer.

Domain: Quantum Chemistry

Converted From: OpenMP

Scale: 50k CPU cores

See paper


We present a hybrid OpenMP/Charm++ framework for solving the O(N) Self-Consistent-Field eigenvalue problem with parallelism in the strong scaling regime, P  N. This result is achieved with a nested approach to Spectral Projection and the Sparse Approximate Matrix Multiply [Bock and Challacombe, SIAM J. Sci. Comput. 35 C72, 2013], which involves an N-Body approach to occlusion and culling of negligible products in the case of matrices with decay. Employing classic technologies associated with the N-Body programming model, including over-decomposition, recursive task parallelism, orderings that preserve locality and persistence-based load balancing, we obtain scaling better than P ∼ 500 N for small water clusters ([H2O]N , N = 30, 90, 150) and find support for an increasingly strong scalability with increasing system size, N.

Domain: Systems Hydrology

Scale: 1000 CPU cores

ADHydro is a large-scale, high-resolution, multi-physics watershed simulation created by the CI-WATER watershed modeling team. ADHydro was specifically developed for high performance computing environments rather than single computers allowing larger scale and higher resolution simulation domains. ADHydro was parallelized in October of 2014 using the Charm++ run-time environment, and run have been completed on the UWyo Advanced Research Computing Cluster using 512 and more cores.

Domain: Textile & rigid body dynamics

Converted From: TBB

Scale: 768 CPU cores

See paper


This paper presents a scalable implementation of the Asynchronous Contact Mechanics (ACM) algorithm, a reliable method to simulate flexible material subject to complex collisions and contact geometries. As an example, we apply ACM to cloth simulation for animation. The parallelization of ACM is challenging due to its highly irregular communication pattern, its need for dynamic load balancing, and its extremely fine-grained computations.

We utilize CHARM++, an adaptive parallel runtime system, to address these challenges and show good strong scaling of ACM to 384 cores for problems with fewer than 100k vertices. By comparison, the previously published shared memory implementation only scales well to about 30 cores for the same examples. We demonstrate the scalability of our implementation through a number of examples which, to the best of our knowledge, are only feasible with the ACM algorithm. In particular, for a simulation of 3 seconds of a cylindrical rod twisting within a cloth sheet, the simulation time is reduced by 12× from 9 hours on 30 cores to 46 minutes using our implementation on 384 cores of a Cray XC30.

Domain: Velocimetry reconstruction

Scale: 512 CPU cores

See paper


Particle-tracking methods are widely used in fluid mechanics and multi-target tracking research because of their unique ability to reconstruct long trajectories with high spatial and temporal resolution. Researchers have recently demonstrated 3D tracking of several objects in real time, but as the number of objects is increased, real-time tracking becomes impossible due to data transfer and processing bottlenecks. This problem may be solved by using parallel processing.

In this paper, a parallel-processing framework has been developed based on frame decomposition and is programmed using the asynchronous object-oriented Charm++ paradigm. This framework can be a key step in achieving a scalable Lagrangian measurement system for particle-tracking velocimetry and may lead to real-time measurement capabilities.

The parallel tracking algorithm was evaluated with three data sets including the particle image velocimetry standard 3D images data set #352, a uniform data set for optimal parallel performance and a computational-fluid-dynamics-generated non-uniform data set to test trajectory reconstruction accuracy, consistency with the sequential version and scalability to more than 500 processors. The algorithm showed strong scaling up to 512 processors and no inherent limits of scalability were seen. Ultimately, up to a 200-fold speedup is observed compared to the serial algorithm when 256 processors were used. The parallel algorithm is adaptable and could be easily modified to use any sequential tracking algorithm, which inputs frames of 3D particle location data and outputs particle trajectories.