One of the major benefits of going down the pureScale route is resilience: it is possible to configure a system so there is no single point of failure, and the loss of any given component will not result in an application outage. Automatic Online Member Recovery makes it possible for DB2 to detect the loss of a member (if an LPAR or server crashes, for example) and automatically restart the failed environment on one of the surviving LPARs to allow recovery action to be taken. In the meantime, client connections are automatically re-routed to the surviving members, with workload balancing ensuring that all of them shoulder an equal proportion of the increased load. Once the failed server/LPAR is available again, DB2 automatically detects the fact and restarts the failed member on the original host. Once again, the workload balancing feature will re-distribute incoming work to ensure all members receive approximately the same load.
A similar situation exists in the event of a primary CF failure: DB2 will detect the fact that it’s no longer getting a heartbeat from the CF and temporarily suspend all work until the secondary CF is brought completely up to date. Once that’s done, the secondary CF takes over as the new primary (in simplex mode) and work is allowed to continue. In practice, this means a “blip” in transaction response times while the secondary CF takes over. (Note if the secondary CF fails DB2 merely continues in simplex mode until such time as the CF can be re-started).
Below is a chart of some internal testing showing a steady transaction rate (using an 80/20 read/write ratio with 100% data sharing) until the primary CF is intentionally crashed. As you can see, the throughput drops to zero for around 10 seconds until the secondary CF takes over and work resumes again, with no transactions lost.