In the Loop: Hidden Markov Models (pt 2)
Learn how hidden Markov models work, how to calculate them, and why they’re powerful for systems where key states are hidden—like speech or behavioral modeling.
Transcript
Hi there. I'm Micaela Kaplan, the ML Evangelist at HumanSignal, and this is In the Loop, the series where we help you stay in the loop with all things data science and AI. This week we're going to explore hidden Markov models in the second part of our Markov model series.
If you watched part one, you might remember that we identified four types of Markov models based on whether the system was observable or not and whether the system was autonomous or not. This week we're going to dive into hidden Markov models, or HMMs, which are used for systems that aren’t fully observable but are fully autonomous.
Hidden Markov models are “hidden” because they contain events we can’t directly observe. For example, parts of speech tags in a sentence—while we can see the words, the part of speech is inferred. HMMs model both the observable “emissions” (like words) and their hidden causal counterparts (like tags). Like other Markov models, HMMs are based on the Markov assumption: the next state depends only on the previous state. HMMs add a second assumption called output independence: the output depends only on the current state, not on the full history.
HMMs consist of a set of states, a transition probability matrix, an initial state distribution, and an emission probability matrix (i.e., how likely an observation is to occur from each state). According to Rabiner (1989) and Ferguson (1960s), three fundamental problems characterize HMMs: (1) Likelihood: Given an HMM and an observation sequence, how likely is the sequence? Solved using the forward algorithm. (2) Decoding: Given an HMM and observations, what is the best sequence of hidden states? Solved using the Viterbi algorithm. (3) Learning: Given observations and states, learn the HMM parameters. Solved using the forward-backward or Baum-Welch algorithm.
Let’s explore this with a classic example from Jason Eisner. Suppose we want to figure out whether each day was hot or cold based only on how many ice creams Jason ate. We’re given observations of 3 ice creams, 1 ice cream, and 3 ice creams over three days. We construct a trellis: a grid with rows for hidden states (hot/cold) and columns for days. At the bottom of each column, we list the observations.
We initialize the trellis with initial probabilities (π) and connect these to the first day’s observations using emission probabilities. For subsequent days, we consider all transitions between states (e.g., hot to cold, cold to cold) and compute joint probabilities of transitioning and observing the given emission. The way we use the trellis depends on our goal. For the forward algorithm, we sum probabilities to get the total likelihood of the observation sequence. For the Viterbi algorithm, we track the most likely path using max operations and back pointers. For learning (Baum-Welch), we iteratively estimate the transition and emission matrices.
Hidden Markov models are powerful for modeling systems where states can’t be directly seen but influence what we can observe. From speech tagging to time series inference, HMMs have played a foundational role in early NLP and probabilistic modeling.
That’s all for this week. Thanks for staying in the loop. Want even more In the Loop? Check out our other videos. Don’t forget to like and subscribe to stay in the loop.