API Proxy Implementation Question (Sync vs. Async)

A bit of background: We are in the process of replacing the legacy ecommerce service with a highly modular service we’re calling Commerce Coordinator.

To start decoupling the legacy ecommerce service from other services and micro-frontends we’re building thin shims for the Coordiator and routing requests there and, for now, these shims will basically call the relevant legacy APIs.

One of the first of these will be decoupling the order history MFE, since it is relatively self-contained and lower-risk than other ecommerce functions. The Coordiator shim for this will essentially be a proxy for the orders API, which brings me to my question:

Is there a better pattern or technology for implementing this proxy than simply (sychronously) calling the legacy API and returning the results?

Might Django Channels allow the order data to be fetched asynchronously and returned to the MFE? I haven’t used them, and only just started to read about them.

Any other technologies or patterns that folks have found helpful for decoupling a request and the returned data? Or should we stick to “Keep It Simple” here?

Forward-looking statements: The goal of these shims is to allow smooth transitions from the legacy service to an ecommerce backend vendor or other service; once that shift has been made, we will likely consider moving to vendor-hosted or vendor-specific frontends for things like order history, so this proxy will likely be a short- to medium-term solution.

Good summary @bjh. A reminder that synchronous calls have caused cascading failures across services in the past, so if we decide it is the right solution here, we should protect it behind a Circuit Breaker.

I am interested in @dave’s (or anyone else’s) current knowledge about Django Channels.

A couple of questions just to get myself oriented:

  • Is the legacy order info in the ecommerce service?
  • The goal is to have the order history MFE, work with the commerce coordinator interface instead of the existing legacy API directly is that the goal?
  • Do I remember correctly that you are still trying to aim for no data storage at the commerce-coordinator?

If this was a read-only endpoint, then data replication might make life operationally easier. If we replicate the data into the new service (and translate it to whatever format makes sense there), then an operational failure in the legacy system will mean that users see stale results, instead of failing entirely. That being said, my understanding is that Commerce Coordinator is business-model-free, making this unworkable.

I don’t think Django Channels is really applicable here, since it’s more about keeping long running connections open, like for WebSockets. I guess it might be useful if we have long running sessions with a user regarding this data, but that doesn’t seem like the likely request pattern.

If the Commerce Coordinator is more or less database model-free and is serving as a proxy to other services, then you might look into async views. We’d have to deploy it as an ASGI app to actually take advantage of the async views though, which probably means uvicorn, since that can be installed as a gunicorn worker type. We’d also probably want to use an async HTTP client like httpx.

I don’t think any of that is really a substitute for circuit breakers in the event of outright failure, but it will help mitigate slow-downs and improve server utilization–though honestly, I’m guessing the throughput rate here isn’t all that high, meaning the cost savings are probably not significant.

Another consideration if Commerce Coordinator is doing a lot of proxying would be to segment the workers so that the proxies to different parts of the system are spun up using different sets of gunicorn processes. At the end of the day, if the legacy orders system is down, we don’t really care if the part of Commerce Coordinator that proxies to that legacy orders system is also down, so long as it doesn’t bring the rest of Commerce Coordinator down with it. Though I suppose that only really works if the requests are relatively isolated and you don’t have CC requests that involve calling three other services on a regular basis.

Those are a few ideas, but please don’t take them as strong recommendations. I’ve looked over some of those docs that you’ve very kindly shared and the ADRs in the repo, but I feel like I still don’t really grok the Commerce Coordinator’s model-free philosophy, and how to effectively work within it. It makes me nervous from an operational point of view, and I feel like I’m missing some fundamental concepts (e.g. what happens to the refund history when a payment processing service changes?).

  • Is the legacy order info in the ecommerce service?

Presently, yes.

  • The goal is to have the order history MFE, work with the commerce coordinator interface instead of the existing legacy API directly is that the goal?

Yes, and then that API will change which backend it gets the data from.

  • Do I remember correctly that you are still trying to aim for no data storage at the commerce-coordinator ?

Yes, the Coordinator should connect services together, it should not implement new functionality nor be a source-of-truth.

if we decide it is the right solution here, we should protect it behind a Circuit Breaker

Do you have a reference to a good implementation of this pattern in Django? (If not, I can try to search for or write one.)

How might the implementation need to be adapted to work with async views? (Seems to me like it would require some sort of global-state “recent error counter” for each external API)

To Dave’s point above, I think even if the coordinator is not the “source-of-truth”, you may still want to have some data replication locally. Operationally, having the double book keeping of external service + coordinator will be useful for debugging and understanding distributed system issues that might come up.

One example of this that I can think of is how would we differentiate between the following two scenarios:

Scenario 1 (Data never got into our system)

  • Commerce Coordinator goes Down
  • The external service sends requests to CC but they fail.
  • External service marks it as failed.

Scenario 2 (External service never gets a confirmation)

  • The external service sends a request.
  • We process the request.
  • We send the response to the external service.
  • They never receive our response and mark the transaction as failed due to timeout.

In both cases, the request looks like it failed to the external service, but in one case we have successfully done all the work and in the other we have done none of the work. Our response to having to fix the failed transaction would be different in the two cases, but how would we tell them apart?

One answer(simply re-run the process and take the same actions on all downstream services) would work if we could guarantee that all downstream requests could be made idempotently. But unless we can guarantee that, we’d have a difficult time telling the two scenarios apart without some local data at the coordinator to help us understand what happened.

In both cases, the request looks like it failed to the external service, … but how would we tell them apart?

Debugging and understanding such a loosely-coupled and distributed system is high in our minds and we plan to implement logging and tracing to help with exactly this sort of thing (see ADR-3 for example)

One answer(simply re-run the process and take the same actions on all downstream services)…

Replay-ability is an interesting idea, though, I’d love to talk more about that sometime.

I’m not convinced that replication of business objects, like order history entries as such, is the right answer for either of these, even if we do end up storing some data in the form of events or even payloads.

I do not have a good reference implementation.

First, a quick summary of the ideas:

  • async views
  • segment the workers (separate cluster)

Both of these are interesting, and we can discuss more and you can discuss with our SRE team.

Regarding grok-ing the proposed model-free philosophy, I’ve also had trouble with this, but I’ve found it difficult to discuss in the abstract. I find it useful that we are discussing a specific use case here. However, I think I understand this use case to still be unique, in that it sounds both transitional and temporary. If I understand correctly, the new version of the proxied solution, is also not meant to be a permanent solution, since we think that will transition to the MFE. If we thought it would be more permanent, I might think about it differently.

Another, somewhat more concrete use case–doesn’t this system ever have to create reports that join data across the sources? Like “revenues of the courses that launched this quarter”? Or “income of self-paced vs. instructor-paced courses”?

A snippet from the code today:

Note: I’m pretty sure the comment was made tongue-in-cheek, but my point is the date. :stuck_out_tongue:

Transitional periods are never as short as we’d like them to be. Maybe the scenario I’m concerned about just won’t happen. You folks obviously have a much better handle on the operations side and current issues with this area than I do. Intuitively, it just seems riskier to me, and I’m wary of adding significant code or operational complexity, even for what is meant to be a temporary bridging solution. We did that with the XModule → XBlock migration, and that middle layer bled us for years because there was always something more important than fixing it. Instead, we constantly added to the complexity in a patchwork of fixes that we’re only finally digging ourselves out of ten years later. The time we thought we’d have to build the replacement got consumed by putting out fires in the existing system.

Again, maybe my complexity concerns are overblown and this is the simpler, less risky approach. If so, that’s great. But if not, I would urge you folks to assume that any solution you put out there is going to lose all funding for improvement two weeks after it goes live and starts doing its job properly, and you will only ever get time to work on it when it’s experiencing dramatic failures in production. If you have to halt all new development at that point, is that an okay place to be at? Is it still better, maintenance-wise, than the system that exists today?

If it’s not simpler/easier than what exists today–if this represents a short term increase in complexity–is there some smaller, more maintainable piece of it that could be built that would still yield similar benefits? For instance, if I had it all to do over again, I think we could have introduced XBlocks in a way that was smaller and more limited (e.g. no parent/child relationships), that would have still covered 90% of the use case at a fraction of the complexity. Is there any version of that in this situation?