name: sequence-analytics description: Expert knowledge of sequence analysis for longitudinal data, including TraMineR (R) and TanaT (Python) packages. Use when working with state sequences, event sequences, trajectory analysis, distance metrics (optimal matching, Hamming, LCS), clustering, and visualization of sequential data.

Sequence Analytics

This skill provides expert guidance for sequence analysis of longitudinal data, drawing from the established TraMineR ecosystem (R) and the emerging TanaT library (Python).

Overview

Sequence analysis is a set of methods for analyzing categorical time series data representing trajectories, lifecourse events, and ordered categorical observations. Common applications include:

Social sciences: Career trajectories, family formation, educational pathways
Healthcare: Patient care pathways, treatment sequences, disease progression
User behavior: Customer journeys, clickstreams, session analysis
Bioinformatics: DNA/protein sequences, gene expression patterns

Key Concepts

Sequence Representations

Format	Description	Example	Use Case
STS (State Sequence)	One state per time unit, aligned columns	`A,A,B,B,C,A`	Standard analysis, fixed time grid
SPS (State-Permanence)	Run-length encoded with durations	`(A,2)-(B,2)-(C,1)-(A,1)`	Variable-length spells, compact storage
DSS (Distinct Successive States)	Consecutive duplicates removed	`A,B,C,A`	Focus on transitions, not durations
SPELL	Begin/end/status triplets	`(0,2,A),(2,4,B),(4,5,C)`	Episode data, overlapping intervals
TSE (Time-Stamped Events)	Event occurrences with timestamps	`(t1,E1),(t2,E2)`	Point events, event sequence mining

Alphabets and States

Alphabet: The set of all possible states (e.g., {employed, unemployed, student})
State: A single categorical value at a time point
Spell: A contiguous period in the same state
Transition: A change from one state to another

Missing Data Handling

Void (%): Position doesn't exist (sequence ends before this point)
Missing (*): State exists but is unknown
Left-censoring: Unknown states at sequence start
Right-censoring: Sequence ends before observation period

TraMineR (R Package)

TraMineR is the reference implementation for sequence analysis, developed at University of Geneva.

Core Functions

Sequence Creation

# Create state sequence object from wide-format data
library(TraMineR)
seq_data <- seqdef(
  data,                    # DataFrame with sequence data
  var = 2:10,              # Column indices for states
  alphabet = c("A", "B", "C"),  # Valid states
  states = c("a", "b", "c"),    # Short labels for display
  labels = c("State A", "State B", "State C"),  # Long labels
  cpal = c("#E41A1C", "#377EB8", "#4DAF4A"),    # Color palette
  weights = data$weight,   # Optional case weights
  missing = NA,            # Missing value indicator
  void = "%"               # Void state indicator
)

# Convert between formats
sps_data <- seqformat(seq_data, from = "STS", to = "SPS")

Descriptive Statistics

# State distribution at each time point
seqstatd(seq_data)         # Cross-sectional state frequencies

# Individual sequence characteristics
seqlength(seq_data)        # Sequence lengths
seqient(seq_data)          # Within-sequence entropy
seqST(seq_data)            # Turbulence index
seqici(seq_data)           # Complexity index
seqtransn(seq_data)        # Number of transitions

# Duration and spells
seqdur(seq_data)           # Spell durations
seqdss(seq_data)           # Distinct successive states
seqmeant(seq_data)         # Mean time in each state

# Transition analysis
seqtrate(seq_data)         # Transition rate matrix
seqsubsn(seq_data)         # Number of subsequences

Distance Metrics

# Pairwise sequence distances
dist_matrix <- seqdist(
  seq_data,
  method = "OM",           # Distance method
  indel = 1.0,             # Insertion/deletion cost
  sm = "TRATE",            # Substitution cost matrix
  norm = "auto",           # Normalization method
  with.missing = FALSE     # Handle missing values
)

# Generate substitution cost matrix
sub_costs <- seqcost(
  seq_data,
  method = "TRATE",        # Based on transition rates
  time.varying = FALSE     # Constant or time-varying
)

Distance Methods:

Method	Full Name	Description
`OM`	Optimal Matching	Edit distance with indel and substitution
`HAM`	Hamming	Positional mismatch count (no indels)
`DHD`	Dynamic Hamming	Time-varying substitution costs
`LCS`	Longest Common Subsequence	Based on shared subsequences
`LCP`	Longest Common Prefix	Prefix similarity
`OMloc`	Localized OM	Position-specific costs
`OMslen`	OM Spell-Length	Spell-length sensitive
`OMspell`	OM Spell	Operates on spells, not positions
`CHI2`	Chi-squared	Based on state distributions
`EUCLID`	Euclidean	Euclidean in state-time space
`TWED`	Time Warp Edit Distance	Allows time warping

Visualization

# Index plot - individual sequences as horizontal bars
seqIplot(seq_data, sortv = "from.start")

# State distribution plot - stacked area chart over time
seqdplot(seq_data)

# Frequency plot - most frequent sequences
seqfplot(seq_data, pbarw = TRUE)

# Mean time plot - bar chart of time in each state
seqmtplot(seq_data)

# Modal state plot - sequence of most common states
seqmsplot(seq_data)

# Entropy plot - transversal entropy over time
seqHtplot(seq_data)

# Representative sequences plot
seqrplot(seq_data, diss = dist_matrix, criterion = "density")

Clustering and Representatives

# Cluster sequences using distance matrix
library(cluster)
clust <- agnes(dist_matrix, method = "ward")
groups <- cutree(clust, k = 4)

# Find representative sequences
reps <- seqrep(seq_data, diss = dist_matrix, criterion = "density")

# Medoid (central sequence)
medoid <- disscenter(dist_matrix)

# Pseudo-variance (dispersion)
dissvar(dist_matrix)

# Association with covariates
dissassoc(dist_matrix, group = groups)

TraMineRextras

Extended functionality:

library(TraMineRextras)

# Relative frequency plot
seqplot.rf(seq_data)

# Entropy with confidence intervals
seqplot.tentrop(seq_data)

# Event sequence distances
seqedist(event_seq)

# Time granularity changes
seqgranularity(seq_data, tspan = 2)

TanaT (Python Library)

TanaT is a Python library for temporal sequence analysis, particularly focused on patient care pathways.

Key Features

Multi-sequence trajectories (states, intervals, events)
Multidimensional temporal pattern discovery
Compatible with pandas DataFrames
Visualization with matplotlib

Core Concepts

Trajectory Types:

States: Categorical values at each time point
Intervals: Periods with start/end times
Events: Point occurrences with timestamps

from tanat import Trajectory, StateSequence

# Create a state sequence
seq = StateSequence(
    states=["A", "B", "B", "C", "A"],
    timestamps=[0, 1, 2, 3, 4]
)

# Multi-sequence trajectory
traj = Trajectory(
    states=state_seq,
    events=event_seq,
    intervals=interval_seq
)

Documentation Resources

TanaT provides LLM-optimized documentation:

llms.txt - Concise summary and index (token-efficient)
llms-full.txt - Complete documentation (comprehensive)

Implementation Patterns for yasqat

Based on TraMineR and TanaT, here are recommended patterns for the yasqat library:

Sequence Data Structure

from dataclasses import dataclass
from typing import Sequence, Mapping
import numpy as np

@dataclass
class Alphabet:
    """State alphabet with labels and colors."""
    states: tuple[str, ...]
    labels: Mapping[str, str] | None = None
    colors: Mapping[str, str] | None = None
    
    def __contains__(self, state: str) -> bool:
        return state in self.states

class StateSequence:
    """A single state sequence."""
    def __init__(
        self,
        states: Sequence[str],
        alphabet: Alphabet | None = None,
        weight: float = 1.0,
    ):
        self._states = tuple(states)
        self._alphabet = alphabet or Alphabet(tuple(set(states)))
        self._weight = weight
    
    def __len__(self) -> int:
        return len(self._states)
    
    def __getitem__(self, idx: int) -> str:
        return self._states[idx]

Distance Computation

def optimal_matching(
    seq1: StateSequence,
    seq2: StateSequence,
    indel: float = 1.0,
    sm: np.ndarray | str = "constant",
    normalize: bool = False,
) -> float:
    """Compute optimal matching distance between two sequences.
    
    Uses dynamic programming with O(n*m) time and space complexity.
    
    Args:
        seq1: First sequence
        seq2: Second sequence  
        indel: Insertion/deletion cost
        sm: Substitution cost matrix or method name
        normalize: Whether to normalize by sequence lengths
        
    Returns:
        Distance value (0 = identical)
    """
    n, m = len(seq1), len(seq2)
    
    # Build substitution cost matrix if needed
    if isinstance(sm, str):
        sm = build_substitution_matrix(seq1.alphabet, method=sm)
    
    # Dynamic programming matrix
    dp = np.zeros((n + 1, m + 1))
    
    # Initialize borders
    for i in range(n + 1):
        dp[i, 0] = i * indel
    for j in range(m + 1):
        dp[0, j] = j * indel
    
    # Fill matrix
    for i in range(1, n + 1):
        for j in range(1, m + 1):
            sub_cost = sm[seq1[i-1], seq2[j-1]]
            dp[i, j] = min(
                dp[i-1, j] + indel,      # Deletion
                dp[i, j-1] + indel,      # Insertion
                dp[i-1, j-1] + sub_cost  # Substitution
            )
    
    distance = dp[n, m]
    
    if normalize:
        distance /= max(n, m)
    
    return distance

Visualization

import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle

def index_plot(
    sequences: list[StateSequence],
    ax: plt.Axes | None = None,
    sort_by: str | None = None,
) -> plt.Axes:
    """Plot sequences as horizontal stacked bars.
    
    Each sequence is a row, each state is a colored segment.
    """
    if ax is None:
        _, ax = plt.subplots(figsize=(12, len(sequences) * 0.3))
    
    # Sort sequences if requested
    if sort_by == "from.start":
        sequences = sorted(sequences, key=lambda s: s[0])
    
    for i, seq in enumerate(sequences):
        x = 0
        for state in seq:
            color = seq.alphabet.colors.get(state, "gray")
            rect = Rectangle((x, i), 1, 0.8, facecolor=color, edgecolor="none")
            ax.add_patch(rect)
            x += 1
    
    ax.set_xlim(0, max(len(s) for s in sequences))
    ax.set_ylim(0, len(sequences))
    ax.set_xlabel("Time")
    ax.set_ylabel("Sequence")
    
    return ax

Descriptive Statistics

import numpy as np
from collections import Counter

def longitudinal_entropy(seq: StateSequence, normalize: bool = True) -> float:
    """Calculate within-sequence entropy.
    
    Measures diversity of states visited by the sequence.
    
    Formula: H(s) = -Σ p_a * log(p_a)
    where p_a is proportion of time in state a.
    """
    counts = Counter(seq)
    n = len(seq)
    
    entropy = 0.0
    for count in counts.values():
        p = count / n
        if p > 0:
            entropy -= p * np.log(p)
    
    if normalize and len(counts) > 1:
        max_entropy = np.log(len(seq.alphabet.states))
        entropy /= max_entropy
    
    return entropy

def transition_rate_matrix(sequences: list[StateSequence]) -> np.ndarray:
    """Compute state-to-state transition rate matrix.
    
    Entry [i,j] is probability of transitioning from state i to state j.
    """
    states = sequences[0].alphabet.states
    n_states = len(states)
    state_to_idx = {s: i for i, s in enumerate(states)}
    
    counts = np.zeros((n_states, n_states))
    
    for seq in sequences:
        for t in range(len(seq) - 1):
            i = state_to_idx[seq[t]]
            j = state_to_idx[seq[t + 1]]
            counts[i, j] += 1
    
    # Normalize rows
    row_sums = counts.sum(axis=1, keepdims=True)
    row_sums[row_sums == 0] = 1  # Avoid division by zero
    rates = counts / row_sums
    
    return rates

Best Practices

Distance Method Selection

Scenario	Recommended Method	Rationale
Similar length sequences	Hamming (HAM)	Fast, simple interpretation
Variable length sequences	Optimal Matching (OM)	Handles insertions/deletions
Timing matters	Dynamic Hamming (DHD)	Time-varying costs
Order matters, not timing	LCS	Ignores positional differences
Spells are meaningful	OMspell	Operates on spell structure

Substitution Cost Strategies

Constant (2.0): All substitutions equally costly
Transition-based (TRATE): Costs inversely proportional to observed transition rates
Feature-based: Costs based on state attribute differences
Domain knowledge: Expert-defined costs

Performance Optimization

Distance computation is O(n²) for n sequences
Each pairwise comparison is O(L²) for length L
Use parallel computation for large datasets
Consider approximate methods (e.g., k-medoids sampling)
Cache distance matrices when reusing

Interpretation Guidelines

High entropy: Diverse, unpredictable sequences
Low turbulence: Stable, few transitions
High complexity: Many transitions AND state diversity
Small distance: Similar trajectory patterns

References

Gabadinho, A., Ritschard, G., Müller, N.S., & Studer, M. (2011). Analyzing and Visualizing State Sequences in R with TraMineR. Journal of Statistical Software, 40(4).
Studer, M., & Ritschard, G. (2016). What matters in differences between life trajectories: A comparative review of sequence dissimilarity measures. Journal of the Royal Statistical Society: Series A, 179(2).
Abbott, A., & Tsay, A. (2000). Sequence analysis and optimal matching methods in sociology. Sociological Methods & Research, 29(1).
Elzinga, C.H. (2010). Complexity of categorical time series. Sociological Methods & Research, 38(3).

Sequence Analytics

Skill Details

Repository Files

Sequence Analytics

Overview

Key Concepts

Sequence Representations

Alphabets and States

Missing Data Handling

TraMineR (R Package)

Core Functions

Sequence Creation

Descriptive Statistics

Distance Metrics

Visualization

Clustering and Representatives

TraMineRextras

TanaT (Python Library)

Key Features

Core Concepts

Documentation Resources

Implementation Patterns for yasqat

Sequence Data Structure

Distance Computation

Visualization

Descriptive Statistics

Best Practices

Distance Method Selection

Substitution Cost Strategies

Performance Optimization

Interpretation Guidelines

References

Related Skills

Xlsx

Clickhouse Io

Clickhouse Io

Analyzing Financial Statements

Data Storytelling

Kpi Dashboard Design

Dbt Transformation Patterns

Sql Optimization Patterns

Anndata

Xlsx

Skill Information