Quick recap
The team discussed the challenges and potential solutions for achieving high uptime and reliability for the Mojaloop software, with a focus on infrastructure design and failover strategies. They also debated the importance of clear documentation and the role of manufacturers in improving system statistics and reliability. Lastly, they explored the complexities of probability calculations, the need for software components' availability, and the invariance of the Mojaloop software.
Summary
Pull Requests, Work Streams, and API Issues
James reported that two of his three pull requests had been merged, with the last one pending. James also shared updates about the pull requests, specifically about the 'coordinated vulnerability disclosure policy' and 'cyber security architecture' which had been published after some revisions. James agreed to address Sam's point about the API gateway issue in the initial days of Mojaloop to avoid similar situations in the future.
Mojaloop Deployment and Availability Calculation
James and Paul discussed the deployment of Mojaloop with the infrastructure necessary for 5 nines of uptime. Paul questioned the definition of appropriate infrastructure and the calculation of the availability figure, given redundancy in multiple nodes. They agreed this was a challenging probabilistic calculation involving the reliability of hardware and software. James reiterated his intent to include in the statement that Mojaloop makes use of suitable architectural patterns across many layers of the platform to enable this level of uptime.
Uptime Issues and Probability Calculations
James and Paul discussed issues related to uptime and the display of certain information. Paul pointed out a discrepancy in the display he was seeing, which James initially couldn't explain but later attributed to a cached display in his browser. The group also delved into the complexities of probability calculations, specifically regarding the combination of probabilities and the potential for multiple outcomes. Paul shared his statistical background and highlighted the subjective nature of the calculations.
Statistical Data Presentation and Architectural Decisions
The team discussed the challenges of presenting statistical data, with James suggesting that the focus should be on the general solution and architecture, rather than specific numerical values. Paul agreed and proposed emphasizing the redundancy built into the system, which can be scaled up. They debated the use of the term "uptime" versus "reliability", with James suggesting the latter, and Paul advocating for the former as it is more commonly used in the industry. The team also touched upon the architectural decisions of the deployer and the need for flexibility to adapt to changing conditions.
Hardware Redundancy and Site Reliability Discussion
James and Paul discussed the complexity and significance of hardware redundancy and site reliability in their systems, with a focus on the challenges of calculating and improving these aspects. They highlighted the value of resources like Google's reliability engineering documents and the statistics published by Backblaze on the reliability of hard disks and solid-state drives. They also emphasized the role of manufacturers in responding to feedback from cloud providers to improve their statistics and reliability.
Failover Approach in Engineering Design
Paul, James, and Paul Makin discussed the engineering approach of designing for potential failures, a strategy known as 'failover'. They agreed this approach, which originated in hardware engineering, has now been adopted for software and infrastructure. James committed to updating a document to reflect this approach, with a focus on using industry-standard availability as an example rather than specific number calculations. The team also briefly touched on a point raised by Michael in a previous Slack message, but did not elaborate further.
Payment System High Availability and Reliability
Karim discussed the importance of high availability in the payment system, emphasizing the use of dual auto failover links to ensure 99.5% uptime. He explained that in the event of one link failing, the other is immediately available to prevent unplanned outages. Additionally, he highlighted the need for a reliable mechanism to handle planned outages for software updates, suggesting a design that minimizes the planned outage time and allows for updates without system downtime.
Mojaloop Uptime Goal and Design Discussion
Karim, James, Vijay, Sam, and Paul discussed the ambitious goal of achieving 99.999% uptime for the Mojaloop software. They recognized that while this figure was theoretically possible, it would depend on various factors and could be compromised in practice. James suggested updating the wording to reflect this and to focus more on design. Paul emphasized the importance of discussing design invariants instead of specific uptime numbers. The team agreed on the need to support infrastructure designs that could achieve 5 nines of uptime, but they debated whether to include a specific uptime percentage in their statements.
Software Component Availability and Upgrades
Karim emphasized the importance of software components' availability and the ability to upgrade them without causing a prolonged downtime. He advocated for an active-active or active-passive model for the software components to ensure no single point of failure. James agreed with Karim's points, and David's interaction with James indicated a shift in the discussion toward improving the language of the invariants document.
Mojaloop Software Invariants and High Availability
James, David, and Karim discussed the invariants of the Mojaloop software and its components. They debated the importance of high availability and uptime, and how to achieve this. The team agreed that the software should be deployed using appropriate infrastructure and processes, with a focus on preventing single points of failure. They also discussed the need for clearer documentation on the use of Kubernetes and other tools and technologies for high availability. The invariants of the software, which should remain true regardless of design or implementation changes, was emphasized as a key principle.
Minimizing Downtime During System Upgrades
The team, led by James, discussed the challenges and potential solutions for minimizing downtime during system upgrades. Karim highlighted the recent issue with the national instant payment system in Pakistan, where a planned upgrade caused a two-hour outage. Paul Makin pointed out that version upgrades inherently require some downtime, but suggested that this could be minimized by using active-secondary site configurations. James proposed focusing on targets and risk management for future planned upgrades, and suggested moving the discussion of the UUID version 7 to the next week's meeting.
Next steps
• James will update the wording in the invariants section to focus on design and architectural decisions that contribute to high availability, rather than specific uptime numbers.
• James will rephrase the paragraph on planned upgrades to target 0 downtime and discuss the risk and potential impact of any upgrade.
• Paul will prepare an agenda for the next meeting to discuss the UUID version 7.
Please note that this content was originally generated by AI but has been reviewed for accuracy by James Bush
Latest comments (0)