Mojaloop Performance, Architecture PoC

Thursday October 22, 2020
12:30 PM UTC

Session Leads: Pedro Barreto (@pedrosousabarreto), Miguel de Barros (@mdebarros), Bryan Schneider (@BryanSchneider)

Session Description:

Discuss performance improvements. Outcome of a Proof of Concept stream done to optimize Architecture and Scalability

Session Slides: https://github.com/mojaloop/documentation-artifacts/blob/master/presentations/October%202020%20Community%20Event/presentations/PI11-October-2020-performance.pdf

Session Recording: https://www.youtube.com/watch?v=Jdb1UGysXIY&list=PLhqcvNJ3bJt7U8ab05ayGv16-6n5jZOH2&index=33

Please post questions, suggestions and thoughts about this session in this thread.

Post by clicking the “Reply” button below.

I’m not sure that I understand the propriety of quoting the overall capacity of a system as a multiple of the capacity of individual DFSPs. If the constraint is the capacity of an individual DFSP, then an overall capacity of the kind quoted can only be obtained if all DFSPs are running at the same speed. Unused capacity from DFSPs which are running below capacity is, if I have understood correctly, not convertible for use by DFSPs which are running at capacity. Equality of throughput between DFSPs is unlikely to be the case in the real world; particularly where systems include both MMS and traditional banks, given that MMS throughputs are often an order of magnitude greater than traditional bank throughputs. Would it not be more accurate simply to quote the per-DFSP throughput limit, together with an inefficiency overhead associated with adding further DFSPs if it really is the case that adding further DFSPs running below the per-DFSP capacity impairs the performance of those DFSPs which are running at capacity?

@Michael, think of it as “reserved capacity” made available under a service level agreement to a scheme participant. You might get better performance, if other DFSPs are not executing at the service limits. We did not study that. But we also did not study the “peak factor” which can be substantial… 3x to 8x for MM, though not as bad for batch level traffic from banks as that can be throttled.

To satisfy that the CSU is capable of higher performance, we need to study the non-linearity of maximum transaction time. If it remains mostly less than 5 seconds, you can crank up the volume.

But we set a specific throughput target combined with specific sub-second response time to show what “steady state” SLA could be offered to participants. And then, how many participants could be offered this SLA simultaneously.

So what the team measured has systemic value to a scheme design. What you are asking for is a reflection of the lumpiness of demand. We ought to study that to see what impact exceeding a per-DFSP SLA has on overall system performance and whether we need to implement throttling or quality-of-service strategies.

1 Like

That’s an astute observation @MichaelRichards. My understanding of the test environment at the moment is that it assumes similar loads per-DFSP, but perhaps it’s worth benchmarking a scenario where there are uneven loads between DFSPs, and see how that affects the overall performance.

Here is the Chat from the live presentation today:

05:52:20 From Lewis Daly : @miguel have you guys looked into rewriting specific components into other languages (maybe Rust, Golang or something) to eek out even more performance?
05:52:54 From Steve Haley to Panelists : As a short term, that will serve most markets - only a few participants per market will be concerned by the 200 FTP/s
05:53:44 From Pedro Barreto to Panelists : Not at the moment Lewis, but because we don’t have many calculation/cpu intensive tasks that might not give us the usual benefit of going to those languages/Stacks
05:55:01 From Steve Haley to Panelists : (comment referred to 3. Virtual positions)
05:56:54 From Miller Abel : Yes… MPESA Kenya might need 8 VFSPs to handle loads, but nearly every other operator in that market would be fine with the single FSP perf.
05:57:04 From Adrian Hope-Bailie to Panelists : Tiger Beetle is written in Zig
05:57:47 From Miller Abel : Rebalancing is complex, but worthy of study to see if this is worth the effort. Otherwise, FSPs have to do this on their own by onboarding multiple “branches” each with their own liquidity for settlement.
05:58:27 From Steve Haley to Panelists : @paula - can we make this your landing page? :slight_smile:
05:59:13 From Miller Abel : @Adrian#Zig???
06:09:41 From Lewis Daly : Does this cost include networking costs?
06:09:51 From Lewis Daly : Amazon tends to hide those from you
06:09:57 From Miguel de Barros to Panelists : “@miguel have you guys looked into rewriting specific components into other languages (maybe Rust, Golang or something) to eek out even more performance?” As Pedro’s comment, we are mostly bound by Network and IOPS. The CPU processing is relatively light. Not to say there is some performance improvement by switching languages, but the underlying Kafka libraries are already written in C++. I do not think a change would be substantial.
06:10:05 From Miguel de Barros to Panelists : @lewis ^^
1 Like

Thanks @millerabel - I think I missed a lot of this since the responses we sent to all panelists only :laughing:

…and you can see a hint of this in the data. The minimum transaction time for 32 FSP config is about half the CSU value. So we are not using the maximum capacity of the throughput, in favor of a level SLA to participants.

I think we have a pretty good understanding of what we expect the results to look like. The performance of each participant handler is mostly isolated (assuming there is no Network/IOPs limits being hit) by the shared data-stores (i.e. Redis). As long as you benchmark against the scenario of all FSPs, you would expect each FSPs to maintain the same maximum processing capability assuming that the injection of transfers is maintained for the each sub-set of FSPs.

It would be worth trying this to validate this understanding.


Please refer here for the tight, fast C-like language @Adrian refers to using in Tiger Beetle.

Virtual positions, an option we have not yet experimented with, would allow the scale to be multiplied for single DFSPs. Then we can apply the current math -> current per DFSP limit multiplied by the number virtual positions per each DFSP.

A system that has 4 virtual positions per DFSP will be able to handle four times the load.

1 Like

Sorry, I didn’t notice I was sending to panelists only :frowning:

1 Like

Presumably, less the overhead of maintaining the virtuality?

The overhead is minimal, the system is already reactive event-driven, so the cost of every now and then checking that position is still ok is neglectable… just not as simple flow as it is today. Still, all async.

Regarding resource usage, we can handle all DFSPs from a single participant container/service instance when not under load, and scale that horizontally up to 1 container/service per virtual position when under load. Quite flexible and efficient usage of resources.

1 Like

I still think it’s about more than the lumpiness of demand. If I’ve understood this discussion correctly, there is at present a hard limit on the throughput for a payer DFSP, which I take to be a consequence of the two pinch points in the current implementation and the PoC: the duplicate check and the liquidity check. You’re certainly right that this can be alleviated by allowing a build-up in demand, provided that this does not result in unacceptable delay; but I imagine that system planners would have factored that in already, and would be planning for maximum demand. That’s certainly how we looked at it in M-Pesa. The problem is, as far as I see, about the implied transferability of spare capacity. If I’ve got a DFSP who’s running at a max load of 200 TPS in a 4-DFSP system, and we re-scale the system to accommodate 8 DFSPs, then the rated capacity of the system has ~ doubled according to our measurement, but the 200-TPS DFSP hasn’t got any more capacity available. So all I’m saying is: it’s easier to understand the capacity of this system if I quote it per DFSP, since that is in fact the limiting factor; and this is true even if we improve the throughput, in this architecture.

Sorry, Pedro: I didn’t mean to suggest that this wasn’t a resource-efficient system. I’m sure it is. All I was trying to say was that the performance limit is per DFSP, so that’s how we should quote it…

You’re correct on the capacity calculation.
But in the future, if we implement the virtual positions, it will be per virtual position instead of per DFSP.

1 Like

You’ve understood it correctly, but let’s not overstate the MPesa case. A hub must enter into an SLA that it can honor. We can’t have an MPesa like volume so consume system resources that all other DFSPs are starved. So by designing for a unit DFSP SLA, the hub ensures equitable admission of transfers from even the smallest DFSPs. The vDFSP concept is just an observation at this point of a way to fit high volume outliers into the unit SLA. It is not the only way or necessarily the best way.

We haven’t characterized the per-DFSP performance degradation when unit DFSP performance design limits (aka the hub DFSP SLA) are approached and exceeded. And we haven’t looked closely at the notification performance architecture or the query architecture to see what other bottlenecks might exist.

Load balancing is possible using the vDFSP concept, but it is not simple to implement. Traffic must be reliably assigned to position bins and the transfer completed within that bin. The number of bins should be dynamic and responsive to presented load approaching the unit DFSP design limit. An algorithm for dynamically growing and shrinking the number of bins, online, while maintaining optimal access times is known as Linear Hashing.

See Litwin from 1980 or so…
http://users.ece.northwestern.edu/~peters/references/Linearhash80.pdf

The vDFSP concept can be implemented opaquely to the participants. In evaluating the POC architecture, we might ask what additional design alterations we could make that relieve the prepare and position bottlenecks.

It should also be apparent that DFSPs that place a higher load burden on the hub may incur more operating fees than others. And that’s a complex, non-technical calculus to balance the good for the system-using community vs the cost burden of competing institutions with uneven load demands.

— Miller

MichaelRichards
October 22

I still think it’s about more than the lumpiness of demand. If I’ve understood this discussion correctly, there is at present a hard limit on the throughput for a payer DFSP, which I take to be a consequence of the two pinch points in the current implementation and the PoC: the duplicate check and the liquidity check. You’re certainly right that this can be alleviated by allowing a build-up in demand, provided that this does not result in unacceptable delay; but I imagine that system planners would have factored that in already, and would be planning for maximum demand. That’s certainly how we looked at it in M-Pesa. The problem is, as far as I see, about the implied transferability of spare capacity. If I’ve got a DFSP who’s running at a max load of 200 TPS in a 4-DFSP system, and we re-scale the system to accommodate 8 DFSPs, then the rated capacity of the system has ~ doubled according to our measurement, but the 200-TPS DFSP hasn’t got any more capacity available. So all I’m saying is: it’s easier to understand the capacity of this system if I quote it per DFSP, since that is in fact the limiting factor; and this is true even if we improve the throughput, in this architecture.