The Zen of Chatbot State ✈︎

When Facebook Messenger, Slack, and Skype announced availability of chat APIs last year, we were excited to bring our Hello Hipmunk travel assistant - which was previously only available through email - to the users of these messaging products. However, as we started adding support for these new APIs to our chatbot engine, we quickly realized that there were challenges unique to building a great experience on modern chat platforms. Not only did we have to contend with users interacting via text message, as in email, but now we wanted to support interactive button inputs as well. We found ourselves tackling a fundamental, but relatively unexplored problem: What is the best way to model a conversation?

The transition from email-based virtual assistant to chatbot posed questions. How should our chatbot handle short messages of half-formed ideas? How should we expose buttons to add useful and meaningful interactivity with our search results? Finally, how could a chatbot leverage the messaging interface to make a conversation with a bot even more powerful than one with a human?

Web and mobile interfaces are based on the idea of moving users between pages, and have the power (and responsibility) to render more or less precisely what the application wants. In the case of the Hipmunk web and mobile user interfaces, the user simply enters the criteria for their search into a form and selects “Search”. If user wants to change their search criteria, they can update the criteria in the form and select “Search” again. This is a highly structured interface, and restricts the flexibility of the interaction with the user; there is a clearly defined sequence of actions and limited variables that the user can enter. As such, the state variables and transitions of the interaction are clearly defined.

In contrast, chatbots represent the opposite end of the spectrum of interfaces: minimum structure, maximum flexibility. Chatbot users are presented with a conversation history (an interface so ubiquitous in digital interfaces, it approaches zero learning curve for most users) where the chat platform controls almost all of the user experience - and the user has a nearly unbounded number of inputs via natural language. To solve this complexity, many chatbots invite users to enter information in discrete bits and pieces. But a form filling bot isn’t really an improvement over a web form, so a chatbot should ideally be able to handle information regardless of order or structure. Furthermore, a user should be able to change parts of their request without having to reenter all of their original request.

Adding additional information to a chat adding additional information to a chat

Correcting incorrect information in a chat

correcting incorrect information in a chat

Parsing this flow of input text and determining exactly what the user wants the bot to do is challenging. Language interfaces offer very few guarantees, but a chatbot needs to map this ambiguous input into well-defined actions to perform for the user. Thus, the problem of managing conversation state and transitioning from one state to another is a central problem in building chatbots.

While traditional web applications can pull from a toolbox of proven design patterns such as MVC (to contain separate state) or Flux (to define state and state transition), there is no off-the-shelf strategy that helps us solve the problem at the core of building conversational interfaces: state management of a continuous conversation. We decided to tackle this problem with a unique architecture.

Background

Our bot is broadly structured into four sequential parts:

  1. A parsing engine responsible for extracting entities from language
  2. An intent engine that recognizes intents and constructs actions
  3. An execution engine step that carries out the action and produces results
  4. A templating engine that displays the results to the user

Entities are mostly nouns or adjectives. For instance, if a user asked for “a non-stop flight on Virgin America to Dallas leaving before 6pm” we treat “non-stop”, “Virgin America”, “Dallas”, and “6pm” each as separate entities. Broadly, intent refers to verbs in the user’s request (“I want to fly”), and actions are fully formed ideas (the combination of an intent and entities).

To illustrate how these parts work together, let's imagine a user requests:

find me a flight from San Francisco to Paris next weekend!

via Facebook Messenger. We are alerted to this request by a JSON request to our webhook endpoint. First, our parsing engine extracts typed entities (in this case, "flight", "San Francisco", "Paris", and "next weekend") from the text. Next, our intent engine works backwards from the entities to identify the intent ("flight"), assembling an action that links the intent with the associated entities. We then execute the action against Hipmunk's internal APIs; in this case, we collect flight results to Paris next weekend. Finally, we render the flight results using our templating engine and present those results to the user via our implementation of the sending platform's API.

This linear process sounds straightforward enough, but the freeform nature of text means that we should support some variation in input. For example, if a user sends us three messages:

I want a flight to Paris

from San Francisco

this weekend

...our bot should interpret those three messages as one request, as if the user had typed the same message all at once.

Conversations as a Stream

Our first realization for modeling a conversation was that if we thought of the conversation as a contiguous stream of information, this problem becomes simpler and easier to solve. Instead of modeling the chatbot's knowledge as a series of entities "slots" specific to each intent, we inverted the model. We taught our bot to look at the "stream" of entities in the conversation and work backwards from different combinations of entities that represent discrete intent. This ability to step through intents and pick out relevant entities is implicit in the way we built our intent engine. This meant that with minimal effort we can handle corrections like:

actually, I want to go to London.

Each time the user inputs more text, we append the new entities we parse to the conversation stream, prioritizing the most recent information. In this way, we can even handle more complex, multipart requests like:

"I want to go to Denver for Thanksgiving and LA for New Years."

By thinking of our conversation as an append-only stream of entities, we stumbled upon a simple but powerful way to manage conversation state. This method turns out to be both flexible and safe: we can add, remove, and modify actions that parse the entity stream without worrying about migrating schema or breaking preexisting conversations.

Conversations as Streams

Conversations as a Loop

Handling a conversation as append-only stream of entities works well for text, but we also wanted to embrace the new button-based navigation patterns that are now standard fare for chatbots. So how should buttons interact with our conversational model? We wanted buttons to mutate conversation state, but we didn't want a button press to cause the user to lose their place in the conversation. Tackling this one became obvious to us: buttons are entities too!

By treating buttons as entities, we can easily support a button that reads "Show Cheapest" and a chat from the user saying "actually, I want the cheapest option" with the same "cheapest" entity. We can then use the same logic in the the intent, execution, and display steps - they're the same entity. But we can also tie more complex behaviors to similar buttons with "hidden" entities: buttons that share text but differ in intent (such as a "create fare alert" button) can have unique entities that contain information specific to the source from which they originated.

Now, our conversational model looks more like a loop, where the user inputs either a button press or text; we parse the text if necessary; and we present the user with a response that optionally has buttons. This process repeats as many times as necessary to handle the conversation with the user. The inputs and outputs of each cycle of the loop are essentially uniform, allowing us to use a linear sequence of entities to store the entire state of the conversation.

Conversations as Loops

Conversations as a Tree

By embracing our entity model for storing state, we stumbled upon a bonus feature. One strength of conversational interfaces is that the temporal dimension of a conversation is preserved and accessible by just scrolling up in the chat view. Want to see the hotel we showed you last week? Just scroll up in the conversation! The linear user interface of the chat platforms always keeps the history of the conversation there for the user to see.

How can we take advantage of the always-present history? As a basic requirement, we felt that we should support not just seeing but also interacting with their conversation history. In order to do this, we realized, our state management should jump back to the state of the conversation in the context of the exact point of history the user is interacting with. This allows us to use all of the context at that point in history. By restoring the conversation state to the point of interaction, we gained a somewhat hidden "superpower": when a user interacts with a card, they can cleanly jump back and forth through state of the conversation at any point. Unlike a "normal" person-to-person conversation, which is obviously linear, and requires explicit cues about context, at the user's prompting, our bot can recall and reuse the context from any point in the conversation history.

Thinking of our conversation model as a tree helps with understanding how we implemented this functionality. Normally, the conversation state is stored in an append-only fashion (obviously this is how a "real" conversation works). However, when a user jumps back in the conversation history and initiates an interaction with output from our past, we branch the state, and continue building state along that new branch. We can continue doing this every time the user scrolls back in history.

Conversations as Trees

Parting Thoughts

The insight of modeling a conversation and its state as a continuous, appendable series of entities is quite powerful. With minor variations on this strategy, we can incorporate button-based interactions, and take advantage of the ever-present history on messaging platforms. This strategy, which allows us to think about user intent as a cluster of entities rather than a decision tree of questions, achieves what has been termed Random Access Navigation.

If you're interested in working with us to build conversational interfaces and take the agony out of travel, check out our jobs page.