Tips for navigating common external integration issues
Today, complexity can be found throughout our day-to-day operations. When using an external integration, it is no longer just about implementation, but also making it maintainable, operable, and an optimal experience for internal and external users alike. A lot more goes into integration than merely setting up credentials. In a previous blog post, I talked about how complexity has shifted. Instead of managing our infrastructure, we manage the APIs that make cloud computing possible. This complexity now exists more in the integration layer of our software. Below I break down five areas where complexity exists when using external integrations or APIs, each of which requires different solutions based on the use case.
It's not uncommon to have intermediate outages of some of the services you depend on. But how do you communicate this to your internal or external users? Although tempting, it's essential to not lock down the whole application. Don't just disable everything simply because one thing isn't working. This can be a lazy way of communicating that some part is not functioning, and it can be done better.
For example, if your email service is having issues, you don't want the service blocking other more critical services. It’s often alright to send emails later. It could even be blocking CI/CD from running. Timeouts are often used to avoid unexpected lagging, and retries are used to see if the unexpected error keeps occurring, but sometimes you need more than that. If the service isn't seen as critical, it should fail silently to the end-user, while critical services will need to be disabled intelligently.
Alternatively, you may encounter a scenario where you have many people trying to request the same resource while it is unresponsive or experiencing failure, which can overload systems. Implementing frontend circuit breakers to external services can be helpful in these situations. As Martin Fowler puts it:
"You wrap a protected function call in a circuit breaker object, which monitors for failures. Once the failures reach a certain threshold, the circuit breaker trips, and all further calls to the circuit breaker return with an error, without the protected call being made at all. Usually you'll also want some kind of monitor alert if the circuit breaker trips."
Lastly, to prevent your users from going down a broken path, you can also show that some functionality is unavailable and disabled instead of causing your users to experience the error after taking action. You should set expectations with your end-user. In addition to just letting a user know some functionality is unavailable, a status page link is also helpful here.
To an end-user, rate limiting is often invisible. Users do not have a clue how close your organization or account is to hitting rate limits. As a result, we have to be careful about letting users take repeated actions that could cause rate-limiting issues with the external services and integrations we use.
There is rarely a one size fits all strategy for this. It depends on what your rate limits are, which is dependent on the service, and the types of actions that are being taken. Sometimes this might even be a sign of an incident in an external service, so you will need to keep that in mind. It is crucial to inform the user what is going on when you are throttling them for rate limiting reasons and clear expectations for when the integration will be available to them again.
Today there are 6+ popular authentication types for integrations. Some of them include basic and bearer authentication, API Keys, OAuth (with multiple flows), JSON web tokens, OpenID, and SAML. This also doesn't include whether the authentication method is for a whole team or organization or just a single user. That’s a lot to manage!
You have to be strategic in safely managing this while also not restricting people too much from getting their work done. User based authentication can be helpful for auditing who exactly is taking what actions, but not all platforms support authentication on a user basis and it has to be done on a team level. One trick is to make a key or token for each user so that users can be assigned to their key or token versus that key or token being shared by the whole team.
There's a multitude of things that can go wrong when you have users with slower connections. The worst thing to do is to let the whole page clock, though. As Phil Sturgeon says in Surviving Other People's APIs, "when a page in your application appears empty for even a fraction of a second, it can make your app seem like it is unresponsive, or not working correctly, or just sluggish."
Instead, you can use progressive data loading to allow the faster parts of the site to load and communicate with users that other elements are loading or have timed out. Alternatively, it may make sense to load from top to bottom. It will depend on the situation, but the most important situation to prevent is a total failure that leaves the user unsure of what went wrong. This can often cause them to attempt to reload which may lead to more retry attempts.
Finally, let's discuss decoupling services from vendors. Your glue code should not be so tightly glued to the rest of your services. It's important not to litter an application with HTTP calls to external services and integrations. This can be problematic because it is harder to test, harder to change vendors, and changes could have further-reaching system impacts that could be avoided by decoupling the integration. I would recommend having at least a thin wrapper around API calls to help alleviate this problem.