This topic is a little on the complicated side, and is written from the perspective of a software system designer, so bare with me.
When building a service or website, whether you’re Google or a small startup, everybody has to worry about the data they manage. In way, you can consider most websites to just be an interface by which your data, which they store, is changed. Tangentially, this way of thinking about programs is called data-oriented programming, but we’ll write about that another day.
Most data on a particular platform has valid and invalid states.
Sometimes these validations are simple to describe.
For example, on most websites, 10000 BC
is an invalid birth year, and &^%(*&&
is an invalid name.
Sometimes, however, validation can get a lot more complicated than that. And unfortunately, input from a user must always be validated before using and serving up to other users, so there’s no way around figuring out when and how to do data validation for your web service.
If web services didn’t validate API requests properly, you can only imagine the garbage data that would fill your favorite social media platforms, crashes and bugs would abound as your browser attempts to parse invalid and even malicious information from the backend servers.
Let’s say my web service, Canvas (the popular Learning Management System), exposes a web API for user browsers to interact with and make requests.
There’s two types of validation I must perform on user requests:
You can think of syntactic validation as making sure you’re speaking English, and using correct spelling and grammar. semantic validation is interpreting the meaning of the sentence and making sure it’s a request we want support.
In practice, the former is done almost entirely automatically; when building a web service, you’ll describe the expected structure of your request using Swagger, OpenAPI, or just plain json-schema. In this way, the documentation you create for your API can also be applied to existing endpoints as a one-liner validation of the request structure.
If ClassMeeting.yaml
exists as a valid json-schema declaration of the expected request body,
validation within your endpoint code might look like this:
function patch(req) {
// will throw an error if a syntactically invalid body is provided
verifySchema(req.body, 'ClassMeeting');
}
However, many endpoints on your API will need to implement additional semantic validation.
The most common example of this is just authorization: An instructor can make a syntactically valid request to create a lecture note in another class, but if the instructor is not actually a part of that class, that operation should be disallowed.
That said, this is not the only kind of semantic validation you’ll encounter. The underlying definition is that semantic validation requires knowledge of existing application state, which is a large reason of why stateless technologies like json-schema cannot handle these cases.
The solution to implementing semantic validation seems simple: just add additional so-called “business logic” at your create and update endpoints for the resources in question, to ensure that at no point, the user creates a resource with an invalid state.
For example, if we need to ensure that a ClassMeeting
’s startTime
must always come before its endTime
(a reasonable requirement), simply reject the request if that requirement is not met.
function patch(req) {
// ...
// we will also error if the requested startTime is after the endTime
if (req.body.startTime > req.body.endTime)
throw new Error("startTime must occur before endTime");
}
In practice, this works very effectively for small resources that are not dependent on other data, but you start running into issues when you need to validate large collections of interdependent data.
Side note: This can also break down if you allow partial updates of a single resource with interdependent fields, for the exact same reasons. Imagine if we could send an update request that only updates the startTime. Finish reading and then think about why this would be a problem.
Consider the example of a Meeting resource, but now let’s introduce a new restriction (that is hopefully still reasonable): no two meetings can overlap.
This validation is also a semantic one, but its applied not over a single resource, but an entire collection of them. This makes our business logic more complicated due to the types of requests we accept.
It’s easy enough to compare a collection of Meetings and ensure there’s no overlap, but the question for the developer becomes when is it appropriate to apply this validation? If you apply the validation when the individual Meeting resource is created and updated, we will have to pull all the other Meetings from the database to compare it to.
function patch(req) {
// ...
// fetch all class meetings from the db to compare to
const meetings = ClassMeetings.findAll();
// complicated overlap check
if (meetings.some(meeting => meeting.startTime < req.body.startTime && meeting.endTime > req.body.startTime
|| meeting.startTime < req.body.endTime && meeting.endTime > req.body.endTime))
throw new Error("no meetings can overlap");
}
This is an underlying theme in many semantic validation problems: often, you will need to fetch additional context from the database to complete your validation. This is usually undesirable for performance and complexity reasons, but in some small cases, may simply be the cost of doing business.
Another aspect to consider is error reporting. We can convey to the client that the “no overlaps” error was encountered fairly easily, but if we want to convey more helpful information such as which item presented a conflict, error reporting quickly becomes much more complicated, and the client must also shoulder the burden of interpreting these complex errors in a user-friendly way.
Side note: Not only this, but this kind of validation can also interfere with the client’s form submission flow: The client must usually remember to resubmit the rejected Meeting resource once the issue is corrected. This could happen when the rejected Meeting is modified, or when the conflicting meeting is modified, which might happen in a completely different view. It’s possible that the site design will not even allow for a good way to correct these kinds of errors without starting over.
Often, if the data in question is complex and has a lot of different fields, the user has no way to get their client to remember their invalid data unless its submitted successfully. If the user closes their browser, all their unsubmitted data is lost. For a small object like a meeting, this might be an acceptable amount of local state to risk losing, but if its ten meeting objects, or if it’s an entire assignment with many child questions that all need to be semantically validated, it becomes more and more important to allow the backend to accept intermediate, temporarily invalid state.
Rather than forcing the user to solve a Tower of Hanoi problem of trying to reach the desired final state without any invalid intermediate states, in certain cases the backend should allow semantically invalid data to be saved and committed.
I’ll propose two different paradigms for allowing our API endpoints to support this, but they are incompatible.
This one is the simplest: don’t allow the client to submit any create or update requests. This might sound counterintuitive, but this approach can be highly appropriate for small data sets.
Instead, we allow the client to submit a replaceAll
request for the entire collection, like so.
PUT / classes / [classID] / meetings
In the body of this request, the client provides an array of every meeting belonging to this class. All others currently stored are deleted and replaced. Every time the client wants to update one meeting, they must send every meeting in the collection, new and old ones.
This makes it trivial for the backend to verify the semantic validity of the entire collection, without any additional requests to the database for more information.
In the draft / publish approach, we add a field to the parent resource that manages the collection
called isPublished
that will be set to false on newly created resources. When a resource is not published or “is a draft”,
it does not need to be semantically valid when it is created or updated. Instead, there is an extra step after updating
called publishing that is required to make the changes visible to other clients.
POST / classes / [classID] / publish
This will toggle the publishing of a particular class, which will validate the meetings that belong to it in the process.
Do you need one of these strategies to implement validation of your resource collection? Not necessarily, you should need to meet the requirements first. Here’s a simple flow chart:
This conversation is based on some challenges I faced while building data models for certain complicated web services. That said, I find that the principles of these observations are relevant in many other types of systems, and shares some similarities with discussions on concurrency and consistency.
These are the same problems that any large organization has probably spent a lot of time trying to solve, so the more fluent you are, the better you’ll be at software design.
Published December 12th, 2021
The ramblings of a software engineer, with an emphasis on security.
© Aaron James 2022