Thinking About Error Codes

Last week I had a discussion with Spoons (LightStep Co-founder & CTO) about a PR I submitted for him to review. He is usually way too busy to look at code but I was modifying the error codes returned by our core authentication code and I knew that he couldn’t resist a PR that dealt with errors and authentication. The nerd snipe was too easy.

The goal of the PR was to allow a downstream client of the Authentication service to conditionally respond to errors based on what they signified. Specifically, we wanted the client to terminate if it received an error saying that it was unauthenticated while gracefully handling errors such as timeouts and upstream service unavailability.

Here is a simplified version of the code I was working with:

// Upstream
func AuthenticateKey(key string) error {
	// Is the key non empty?
	if len(key) == 0 {
		return nil, errors.InvalidArgument("zero-length key is not valid")
	}

	// Is the key signed by LightStep?
	parsedToken, keyClaims := parseTokenValidateSignature(key)
	if !parsedToken.Valid {
		return nil, errors.InvalidArgument("token is not a valid JWT")
	}
	
	// Are all the essential fields set?
	if _, ok := parsedToken.Header["kid"].(string); !ok {
		return nil, errors.InvalidArgument("invalid KID in token headers")
	}
	
	// Does it contain a valid id?
	if len(keyClaims.GetID()) == 0 {
		return nil, errors.InvalidArgument("token did not contain a valid ID")
	}

	// Is this key currently active, or has it been revoked?
	if isRevoked(keyClaims.GetID()) {
		return nil, errors.Unauthenticated()
	}

	return nil
}

errors.InvalidArgument() and errors.Unauthenticated() are wrappers around a set of Google defined RPC error codes. At LightStep, we have standardized on these error codes as they provide a shared vocabulary for all of our service to service communication.

In my first go at solving the problem, I decided to coerce all error values generated by the function into errors.Unauthenticated(). This would allow my downstream client to explicitly check for that error value when deciding how to respond.

// Upstream
func AuthenticateKey(key string) (err error) {
	// All internal errors in this function should be converted to Unauthenticated
	defer func() {
		if err != nil {
			err = errors.Unauthenticated()
		}
	}()

	// no other changes
	... 
}

// Downstream
func CheckAuth(key string) {
	err := AuthenticateKey(key string)
	if err != nil {
		if errors.IsUnauthenticated() {
			// Crash because this is irrecoverable
			...
		} else {
			// Gracefully handle this case
			...
		}
	}
}

Given my simplified example, you might wonder why I didn’t just check for errors.Unauthenticated() and errors.InvalidArgument in the client code instead of coercing all errors to errors.Unauthenticated(). My concern with this approach was with future maintainability. I was afraid that someone else might edit this code and use a different error code value without realizing the impact it would have on my downstream client. Oftentimes you can solve for this with processes like code ownership but this code didn’t have any owners in charge of reviewing changes. Another reality is that the real authentication code is much hairier than the simple example I’ve presented. In truth, it includes multiple function calls that descend down their own call stacks, often in different files. By coercing all of the errors at the API boundary I could limit the possibility that a new error code would be introduced that would break assumptions made by clients.

In his review, Spoons acknowledged these arguments and agreed that the approach was reasonable but he offered an alternate one that changed the way I thought about errors. He suggested conditionally handling errors based upon a higher level classification: client errors and server errors.

Client errors are those where the fault is on the client, and server errors indicate the the problem is with the server. This classification of errors is by no means a new idea – HTTP status codes are already broken up in this way with 4xx status codes representing client errors and 5xx status codes representing server errors. If we fit our Google defined RPC error codes into this classification scheme, we could build generic error handling logic that didn’t require checking the specific error code each time.

The following code does just that; it classifies RPC error codes into higher level classes such as ClientError and ServerError. You’ll notice that some of the RPC error codes don’t neatly fall into either of these two categories.

package errors

import (
	"google.golang.org/grpc/codes"
)

// A Class is a set of types of outcomes (including errors) that will often
// be handled in the same way.
type Class string

const (
	Unknown Class = "0xx"
	// Success represents outcomes that achieved the desired results.
	Success Class = "2xx"
	// ClientError represents errors that were the client's fault.
	ClientError Class = "4xx"
	// ServerError represents errors that were the server's fault.
	ServerError Class = "5xx"

	// CanceledClass represents non-success, non-error cases that were
	// marked as cancelled, either internally or externally.
	CanceledClass Class = "canceled"
)

// ErrorClass returns the class of the given error, which can then be used
// as a suffix for outcome metrics and as a tag. For example,
//    metrics.Count(..., "response_code:" + errors.ErrorClass(err), ...)
func ErrorClass(err error) Class {
	switch Code(err) {
	// Success or "success"
	case codes.OK:
		return Success

		// Client errors
	case codes.InvalidArgument, codes.NotFound, codes.AlreadyExists,
		codes.PermissionDenied, codes.Unauthenticated, codes.FailedPrecondition,
		codes.OutOfRange:
		return ClientError

		// Server errors
	case codes.DeadlineExceeded, codes.ResourceExhausted, codes.Aborted,
		codes.Unimplemented, codes.Internal, codes.Unavailable, codes.DataLoss:
		return ServerError

	case codes.Canceled:
		return CanceledClass

		// Not sure
	case codes.Unknown:
		fallthrough
	default:
		return Unknown
	}
}

With this in place, I just needed to make sure that the errors returned from AuthenticateKey were all client errors. This way my downstream client could branch based upon the Class of the error.

//Downstream
func CheckAuth(key string) {
	err := AuthenticateKey(key string)
	if err != nil {
		if errors.ErrorClass(err) == errors.ClientError {
			// Crash because this is irrecoverable
			...
		} else {
			// Gracefully handle this case
			...
		}
	}
}

Nice! This solution is cleaner than checking the explicit error type. It also helps address the concern I had about people modifying the error codes returned from AuthenticateKey. As long as they don’t use a server error where they should have used a client error all should be fine. This type of thing is much easier to catch in the code review process as well.

Sometimes knowing whether to return a client or server error is easier than knowing the exact type of error to return. The RPC error codes package has good documentation but I’ve found that sometimes it still isn’t quite enough. When this happens, ask yourself: “If this error were to occur, what action should the client take in response?” Should they get a new token? Should they bail and ask an admin for authorization? Should they wait and then try again? The error code that you return should provide the user with the best chance of responding correctly. Consider the example from before:

// Upstream
func AuthenticateKey(key string) error {
	// Is the key non empty?
	if len(key) == 0 {
		return nil, errors.InvalidArgument("zero-length key is not valid")
	}
	...

If we were to return errors.Unauthenticated() here, the caller could reasonably respond by minting a new token. However, if key is empty, the caller shouldn’t respond by minting a new one, they should respond by making sure they are correctly passing one in! In this case, errors.InvalidArgument() is more helpful to the client because it indicates that there is something inherently problematic with the key (or lack thereof) being used.