Srinivas Devaki

SRE Lead with over half a decade of experience

Contract Validation

Published Feb 18, 2023

two gophers handshaking on a contract

Contract Validation issues are one of the highest code change-based incidents I have seen in my experience. They come in many shapes and forms and across different kinds of systems.

Let's begin with the shapes and forms of these issues and then go onto different systems where they appear.

Schema Based Issues

Most contract issues are simply schema violations due to using a schema-free language like JSON.

Data type mismatches: for the same field the type changed from integer to float, or string to integer etc,...
Missing essential fields: client is expecting a field to be set but the new code change removed the field
Loose Decoding Rules: some languages json unmarshalling will hapliy decode order_id field like this {"order_id": "1"} into an integer field but if server side suddenly decides to change it into uuid the clients would suddenly start breaking
Integer Size: decoding user id into int32 might work for a while but if client doesn't adopt the bigger integer size along with server or if a service doesn't report it's clients about an id size change, then you are up for a surprise
Field Renaming: while some of the service side tests might pass with field renaming but the change is a backwards in-compatible change.

Data Contract Issues:

Beyond schema, contract tests are usually contextual to the data, but even here, there are common sets of patterns

Default Value Nuisance:

For example, a struct integer field's default value would be 0 if it is not explicitly set, so one common pattern is to use another field to explicitly indicates whether the integer field value is non-empty. (a pointer could also be used to indicate nil value, but in a few cases separate field pattern is used as well)

type BatchUpdateResponse struct {
  Success bool `json:"success"`
  FailedItems int `json:"total_items"`
  FailureReason string `json:"failure_reason"`
}

For the above example, the field "success" indicates whether the request was successful, and if it is a failure, then FailedItems and FailureReason would be set. Because of no contract testing or runtime contract validations, until a significant production outage starts, no one realises that they can't precisely debug why customers are seeing the failure as they can't see the failure reason. The worst case scenario is when the request is a success, but the "success" field is not set due to the code structure that pre-emptively returned but success field is only set at the end rather than at the start, in this case we end up scratching our heads why a request failed without setting the failure reason.

Invalid Data:

As the heading suggests, these issues are straightforward to understand, i.e. the data as a whole is not consistent with itself. For example, we receive a response containing an invoice where the total amount of money doesn’t match the sum of individual items.

While static contract tests could help, they couldn’t cover all the cases that can avoid production issues. The best possible approach is to set up runtime contract validations and report them to the observability tool so that the deployment can be aborted at the canary stage.

Contract Validation Issues in Systems

Properly setting up contracts is often recommended in APIs, but HTTP APIs are not the only systems that face outages that can be prevented using proper contracts.

Event-Driven Systems

As you decouple systems/microservices for faster releases, we move towards event-driven microservices where a service publishes the events, and other services listen and process. Kafka is the most popular backend for driving event-driven workflows; suppose the producer breaches the contract or the schema. In that case, the event loss is the best-case scenario, and in case of a decoding error, you have 2 options, ignore the event and commit the offset that it is processed. This is the best-case scenario where the event is lost, but let's say the offset is not committed. Still, the PHP thread is crashing, or the application is crashing altogether; in such cases, you repeatedly process the same invalid item and stop processing other items entirely. Until a human intervenes and explicitly removes or a new code change gets shipped, that poison pill will block the processing for that kafka topic. Setting up consumer-driven runtime contracts & schema testing on the producer side is the best solution for this problem. Better yet, use something like protobuf with necessary backward compatibility linters rather than JSON.

Configuration Systems

Config change outages are often much higher than code change-triggered outages; this stems from the fact that config is not treated as code but rather as a loose blob of stuff that anybody can change to whatever they want.

The same principles applied for code need to be applied to the configuration as well when it comes to schema & contract. This means at an infrastructure level, the configuration must be tested against the schema provided by the code before a config change is deployed. And same is valid for data. Often, a backend is enabled by setting enabled: true, but the necessary configuration required to enable that backend is not tested. Most configuration libraries deal with the config in a dynamic type language like "viper"; the best approach to this problem is to use something like “envconfig”, where the configuration needs to be decoded into a strongly typed struct. On top of this data, validation can be added to ensure that the configuration contract is good along with the schema.

Caching Systems

Like queue/worker systems, during local or dev environment testing, the contract failure doesn't come up because no cache is prepared by the old code, or the old code is not reading the new cache. As soon as you deploy to production, the best case scenario is that the new code starts failing due to the old cache, which is still safe, and you could recover your system by reverting. But god forbid, if the new code starts caching the new values, your old code starts failing, so you can neither use the new code nor the old code.

On top of using a proper schema and setting up contract tests for the data inside the cache, it is very important to treat cache decoding errors as actual errors than cache misses.

// pseudo code
func (c *Cache) Get(key string, value interface{}) (cacheMiss bool) {
  bts, err := redis.Get()
  if err != nil && errors.Is(err, redis.NilErr) {
    return true
  }
  err = json.Unmarshal(bts, data)
  if err != nil [
    return true
  }
  return false
}

func getOrderData() *Order {
  data := Order{}
  miss := cache.Get(fmt.Sprintf("order_data_%d", orderID))
  if !miss {
    return &data
  }
  data = database.Get(orderID)
  cache.Set(fmt.Sprintf("order_data_%d", orderID), data)
  return data
}

In the above code, the main issue is that upon decoding or redis network error, it started treating it as a cache miss and falling back to the database. This not only increases the load on the master but also starts setting the new corrupted cache inside the cache. It is better to treat contract failures in the cache as total failures than treat them as cache misses. You could still implement some budgeting for redis errors to be treated as misses without overloading the database but treating contract errors as cache misses could leave you with a corrupt cache in your redis

Feature Store:

Most storage systems are built on top of some schema & contract; because of this very reason that corrupt data is very painful to deal with. but for any high-scale company or with a high number of microservices, providing a good feature store is essential for powering ML, business rule engine, personalisation or features that are only available for few cohorts. Because this is a generic feature store, indexing is driven by ingesting data generated by an analytics system that crunches the data and sends it off for indexing inside the feature store. And a separate system reads this data. Because of this intermediate layer, the reader of this data can't dictate the schema or contract validation before ingesting the data inside the database, as this feature store is centralised data storage.

Like a corrupt cache, this is a painful outage to handle if bad data is ingested; your users suddenly lose their personalised view of your site and start seeing like any normal user. Until all the data is reindexed with the correct schema, all the personalisation would be lost. The worst case is if this data starts throwing panic or translates to a critical error at the reader level.

writes

[data analytics job] ⇒ [feature store]

reads

[services] ⇒ [feature store]

Monolith to Microservice Migration:

Most teams start with a monolith and transition towards microservices as the scale increases. Contract testing and shadow contract testing are crucial to understanding how the current data is structured and building a better understanding of how to replicate that in the new microservice properly. During the migration, these contract tests will also ensure that you are satisfying the contract throughout the migration without going into production and reverting.

Scheduling Systems:

Probably the best one of all. Imagine a feature like placing an order for tomorrow or sending a new year notification. These things require pre-scheduling events or triggers; schema & contract issues in such systems will suddenly hit you in the blue.

Solution:

A good starting point is to use a strongly typed schema like protobuf for all these systems, i.e. not just for grpc but for caching, queueing, event-driven systems, and feature stores. The next step would be to set up some basic static analysis like backward compatibility checks for all these systems using protobuf. And Finally, setting up static contract validations will help detect some issues at the PR level itself; for more complex scenarios, runtime contract validation can help detect issues at pre-prod automated integration tests or the canary stage.

Conclusion:

Designing systems without schema or contract might be the easiest or quickest way to ship the feature, but as you scale, moving towards robust schema-aware systems, implementing schema validation(static & runtime) and contract validation techniques is essential to avoiding painful outages.

Original: https://blog.eightnoteight.dev/p/contract-validation

Go/golang Software architecture

Report

Enjoy this post? Give Srinivas Devaki a like if it's helpful.

Srinivas Devaki

SRE Lead with over half a decade of experience

Hello, as an SRE, I have built many resilient systems, improved reliability across the organization, built systems to improve developer productivity and faster release cycles, commanded several production incident responses and pe...

Discover and read more posts from Srinivas Devaki

get started