Max Heinritz > Posts

Search

Search is a powerful feature for any data-intensive application, and it’s especially important in complex business domains such as logistics.

A typical freight use case is pasting a reference number like “71234123” into a search box to see all related data. That could include bills of lading, purchase orders, commercial invoices, delivery orders, invoices, etc, along with corresponding entities that reify such concepts within the application’s domain model; most commonly shipments and invoices. Results could also include unstructured data such as emails or in-app messages.

Implementing such a system involves a few parts.

Search-engine database

The foundation is a persistence mechanism with search-friendly indexing and retrieval patterns. While it’s possible to power search with a relational database or a document store, a dedicated option will offer improved flexibility and performance at scale.

Started in 1999, Lucene is the progenitor of most modern tools in this category, giving rise to Elasticsearch in 2010, which Amazon forked into Opensearch in 2021 after commoditizing its availability cloud.

Opensearch is the easiest and cheapest option on AWS. The Opensearch domain model comprises searchable “documents” with a well-typed set of “fields”. An “index” is a list of documents, which can be searched by matching a “query term” against document fields.

Algolia is also an option if you are willing to pay more for ease of use.

Real-time indexing pipeline

The next component is an indexing pipeline to load data into the search database.

For logistics, this often involves denormalizing substantial amounts of data. To index a shipment entity, we may want to load consignee, shipper, carrier, and customer organizations for their names and addresses. Then we may load all its payable and receivable invoices, and then all their payee and payor organizations etc.

Each time any of these entities change, we need to reindex all possibly impacted search documents. When do we trigger such reindexing?

One option is to retrigger within the entity mutate services: when an organization name is changed, immediately reindex all shipments that involve that organization. But this forces mutate paths to depend on all possibly impacted searchable entities. This causes dependency cycles, which causes spaghetti code: shipments naturally depend on organizations due to the nature of the domain, but now organizations need to depend on shipments too.

A better option is event consumers. Search consumers can subscribe to “revised” events on all related entities and recompute data to index as needed. This avoids cycles and eliminates the need to update lower-level mutate code paths every time a new higher-level entity is added.

Concretely, this would be an ORGANIZATION_REVISED event triggering something like “SearchShipmentReindexConsumer” to find related shipments and update their search documents. The organization mutation code simply needs to publish the event – it doesn’t care who consumes it.

Note that since the search database is separate from the primary database anyway, moving this logic to event consumers does not change behavior with respect to database transactions.

Batch indexing pipeline

Sometimes a new field needs to be added to all existing instances of a search document. For example, we might decide we want to start indexing a computed field called “billing status” for a shipment, which aggregates the payments statuses for all payable and receivable invoices. We need to go back and update all the existing documents to include this field. For this, cursor-based list backfill with checkpointing by saving to the filesystem is a good option. A resumable Temporal workflow also works.

Denormalized records

Reindexing becomes very slow and expensive if entity traversal and data loading is done as part of search indexing. An option here is to break the indexing pipeline into two steps: first, compute a “denormalized record” and store it within the primary database using a JSON field; second, read from the denormalized record table in batch and index directly into the search database.

This improves debuggability by making it easier to determine whether something is broken in the step to produce a denormalized record or whether something is broken in the step to index the denormalized record in the search database. This is especially useful for complex computed fields – having the intermediate result in the primary database is helpful for quick iteration.

Product interface - entrypoint

I find it’s best to allow the search box and search results to be accessible without leaving the current page. For example, embedding the search box as a fixed part of the top menu bar, or accessible through a shortcut.

Let’s say I’m viewing a PDF file and want to find data related to a specific reference number on the PDF. With a search box widget in the top menu bar, I can copy/paste from the PDF to search and see results in a hovering drop down. If there are two shipment results, I can cross-check the rich shipment search result row against what’s visible in the PDF file to determine which of the shipments I want – all without leaving the page with the PDF.

The alternative is to require transitioning to a different full page view to perform search. Such an approach creates a context switch that is unnecessary for most use cases. It adds additional clicks, clutters up the user’s browser search history, is visually jarring, etc.

Product interface - results

Given that search results may return many different entity types, the interface should provide “rich” search results for each entity type. For example, a shipment result might include a small route preview with the origin and destination city and the carrier. An email result might include the email title and sender. An organization might include its name, address, USDOT number, etc. The rich results provide two benefits:

Visually distinguish different entity types, reifying concepts in the domain model. Provide additional data points to help users find what they are looking for. These data points vary based on entity.

Backend-to-frontend API

There are two approaches here. The backend can expose a single API that tasks a search term and returns all results of any type, or it can expose APIs per entity type. The former approach allows the backend to control the sorting of results across entity types, but it adds complexity. The latter approach is more straightforward, especially with GraphQL, wherein a single network request can fetch data from several search indices at once. This approach also makes it easier to fetch exactly what’s needed for each entity type:

query SearchQuery($term: String!) {
  organizations(term: $term, first: 10) {
    edges {
      qid
      name
      primaryAddress {
        city
        state
      }
    }
  }
  artifacts(term: $term, first: 10) {
    edges {
      qid
      name
      format
      type
    }
  }
  shipments(term: $term, first: 10) …
  invoices(term: $term, first: 10) …
}

Backend implementation

I’ve seen good success with the Searchkit library and custom GraphQL APIs.