Sequence Analytics

by rexarski

data

Expert knowledge of sequence analysis for longitudinal data, including TraMineR (R) and TanaT (Python) packages. Use when working with state sequences, event sequences, trajectory analysis, distance metrics (optimal matching, Hamming, LCS), clustering, and visualization of sequential data.

Skill Details

Repository Files

1 file in this skill directory


name: sequence-analytics description: Expert knowledge of sequence analysis for longitudinal data, including TraMineR (R) and TanaT (Python) packages. Use when working with state sequences, event sequences, trajectory analysis, distance metrics (optimal matching, Hamming, LCS), clustering, and visualization of sequential data.

Sequence Analytics

This skill provides expert guidance for sequence analysis of longitudinal data, drawing from the established TraMineR ecosystem (R) and the emerging TanaT library (Python).

Overview

Sequence analysis is a set of methods for analyzing categorical time series data representing trajectories, lifecourse events, and ordered categorical observations. Common applications include:

  • Social sciences: Career trajectories, family formation, educational pathways
  • Healthcare: Patient care pathways, treatment sequences, disease progression
  • User behavior: Customer journeys, clickstreams, session analysis
  • Bioinformatics: DNA/protein sequences, gene expression patterns

Key Concepts

Sequence Representations

Format Description Example Use Case
STS (State Sequence) One state per time unit, aligned columns A,A,B,B,C,A Standard analysis, fixed time grid
SPS (State-Permanence) Run-length encoded with durations (A,2)-(B,2)-(C,1)-(A,1) Variable-length spells, compact storage
DSS (Distinct Successive States) Consecutive duplicates removed A,B,C,A Focus on transitions, not durations
SPELL Begin/end/status triplets (0,2,A),(2,4,B),(4,5,C) Episode data, overlapping intervals
TSE (Time-Stamped Events) Event occurrences with timestamps (t1,E1),(t2,E2) Point events, event sequence mining

Alphabets and States

  • Alphabet: The set of all possible states (e.g., {employed, unemployed, student})
  • State: A single categorical value at a time point
  • Spell: A contiguous period in the same state
  • Transition: A change from one state to another

Missing Data Handling

  • Void (%): Position doesn't exist (sequence ends before this point)
  • Missing (*): State exists but is unknown
  • Left-censoring: Unknown states at sequence start
  • Right-censoring: Sequence ends before observation period

TraMineR (R Package)

TraMineR is the reference implementation for sequence analysis, developed at University of Geneva.

Core Functions

Sequence Creation

# Create state sequence object from wide-format data
library(TraMineR)
seq_data <- seqdef(
  data,                    # DataFrame with sequence data
  var = 2:10,              # Column indices for states
  alphabet = c("A", "B", "C"),  # Valid states
  states = c("a", "b", "c"),    # Short labels for display
  labels = c("State A", "State B", "State C"),  # Long labels
  cpal = c("#E41A1C", "#377EB8", "#4DAF4A"),    # Color palette
  weights = data$weight,   # Optional case weights
  missing = NA,            # Missing value indicator
  void = "%"               # Void state indicator
)

# Convert between formats
sps_data <- seqformat(seq_data, from = "STS", to = "SPS")

Descriptive Statistics

# State distribution at each time point
seqstatd(seq_data)         # Cross-sectional state frequencies

# Individual sequence characteristics
seqlength(seq_data)        # Sequence lengths
seqient(seq_data)          # Within-sequence entropy
seqST(seq_data)            # Turbulence index
seqici(seq_data)           # Complexity index
seqtransn(seq_data)        # Number of transitions

# Duration and spells
seqdur(seq_data)           # Spell durations
seqdss(seq_data)           # Distinct successive states
seqmeant(seq_data)         # Mean time in each state

# Transition analysis
seqtrate(seq_data)         # Transition rate matrix
seqsubsn(seq_data)         # Number of subsequences

Distance Metrics

# Pairwise sequence distances
dist_matrix <- seqdist(
  seq_data,
  method = "OM",           # Distance method
  indel = 1.0,             # Insertion/deletion cost
  sm = "TRATE",            # Substitution cost matrix
  norm = "auto",           # Normalization method
  with.missing = FALSE     # Handle missing values
)

# Generate substitution cost matrix
sub_costs <- seqcost(
  seq_data,
  method = "TRATE",        # Based on transition rates
  time.varying = FALSE     # Constant or time-varying
)

Distance Methods:

Method Full Name Description
OM Optimal Matching Edit distance with indel and substitution
HAM Hamming Positional mismatch count (no indels)
DHD Dynamic Hamming Time-varying substitution costs
LCS Longest Common Subsequence Based on shared subsequences
LCP Longest Common Prefix Prefix similarity
OMloc Localized OM Position-specific costs
OMslen OM Spell-Length Spell-length sensitive
OMspell OM Spell Operates on spells, not positions
CHI2 Chi-squared Based on state distributions
EUCLID Euclidean Euclidean in state-time space
TWED Time Warp Edit Distance Allows time warping

Visualization

# Index plot - individual sequences as horizontal bars
seqIplot(seq_data, sortv = "from.start")

# State distribution plot - stacked area chart over time
seqdplot(seq_data)

# Frequency plot - most frequent sequences
seqfplot(seq_data, pbarw = TRUE)

# Mean time plot - bar chart of time in each state
seqmtplot(seq_data)

# Modal state plot - sequence of most common states
seqmsplot(seq_data)

# Entropy plot - transversal entropy over time
seqHtplot(seq_data)

# Representative sequences plot
seqrplot(seq_data, diss = dist_matrix, criterion = "density")

Clustering and Representatives

# Cluster sequences using distance matrix
library(cluster)
clust <- agnes(dist_matrix, method = "ward")
groups <- cutree(clust, k = 4)

# Find representative sequences
reps <- seqrep(seq_data, diss = dist_matrix, criterion = "density")

# Medoid (central sequence)
medoid <- disscenter(dist_matrix)

# Pseudo-variance (dispersion)
dissvar(dist_matrix)

# Association with covariates
dissassoc(dist_matrix, group = groups)

TraMineRextras

Extended functionality:

library(TraMineRextras)

# Relative frequency plot
seqplot.rf(seq_data)

# Entropy with confidence intervals
seqplot.tentrop(seq_data)

# Event sequence distances
seqedist(event_seq)

# Time granularity changes
seqgranularity(seq_data, tspan = 2)

TanaT (Python Library)

TanaT is a Python library for temporal sequence analysis, particularly focused on patient care pathways.

Key Features

  • Multi-sequence trajectories (states, intervals, events)
  • Multidimensional temporal pattern discovery
  • Compatible with pandas DataFrames
  • Visualization with matplotlib

Core Concepts

Trajectory Types:

  • States: Categorical values at each time point
  • Intervals: Periods with start/end times
  • Events: Point occurrences with timestamps
from tanat import Trajectory, StateSequence

# Create a state sequence
seq = StateSequence(
    states=["A", "B", "B", "C", "A"],
    timestamps=[0, 1, 2, 3, 4]
)

# Multi-sequence trajectory
traj = Trajectory(
    states=state_seq,
    events=event_seq,
    intervals=interval_seq
)

Documentation Resources

TanaT provides LLM-optimized documentation:

  • llms.txt - Concise summary and index (token-efficient)
  • llms-full.txt - Complete documentation (comprehensive)

Implementation Patterns for yasqat

Based on TraMineR and TanaT, here are recommended patterns for the yasqat library:

Sequence Data Structure

from dataclasses import dataclass
from typing import Sequence, Mapping
import numpy as np

@dataclass
class Alphabet:
    """State alphabet with labels and colors."""
    states: tuple[str, ...]
    labels: Mapping[str, str] | None = None
    colors: Mapping[str, str] | None = None
    
    def __contains__(self, state: str) -> bool:
        return state in self.states

class StateSequence:
    """A single state sequence."""
    def __init__(
        self,
        states: Sequence[str],
        alphabet: Alphabet | None = None,
        weight: float = 1.0,
    ):
        self._states = tuple(states)
        self._alphabet = alphabet or Alphabet(tuple(set(states)))
        self._weight = weight
    
    def __len__(self) -> int:
        return len(self._states)
    
    def __getitem__(self, idx: int) -> str:
        return self._states[idx]

Distance Computation

def optimal_matching(
    seq1: StateSequence,
    seq2: StateSequence,
    indel: float = 1.0,
    sm: np.ndarray | str = "constant",
    normalize: bool = False,
) -> float:
    """Compute optimal matching distance between two sequences.
    
    Uses dynamic programming with O(n*m) time and space complexity.
    
    Args:
        seq1: First sequence
        seq2: Second sequence  
        indel: Insertion/deletion cost
        sm: Substitution cost matrix or method name
        normalize: Whether to normalize by sequence lengths
        
    Returns:
        Distance value (0 = identical)
    """
    n, m = len(seq1), len(seq2)
    
    # Build substitution cost matrix if needed
    if isinstance(sm, str):
        sm = build_substitution_matrix(seq1.alphabet, method=sm)
    
    # Dynamic programming matrix
    dp = np.zeros((n + 1, m + 1))
    
    # Initialize borders
    for i in range(n + 1):
        dp[i, 0] = i * indel
    for j in range(m + 1):
        dp[0, j] = j * indel
    
    # Fill matrix
    for i in range(1, n + 1):
        for j in range(1, m + 1):
            sub_cost = sm[seq1[i-1], seq2[j-1]]
            dp[i, j] = min(
                dp[i-1, j] + indel,      # Deletion
                dp[i, j-1] + indel,      # Insertion
                dp[i-1, j-1] + sub_cost  # Substitution
            )
    
    distance = dp[n, m]
    
    if normalize:
        distance /= max(n, m)
    
    return distance

Visualization

import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle

def index_plot(
    sequences: list[StateSequence],
    ax: plt.Axes | None = None,
    sort_by: str | None = None,
) -> plt.Axes:
    """Plot sequences as horizontal stacked bars.
    
    Each sequence is a row, each state is a colored segment.
    """
    if ax is None:
        _, ax = plt.subplots(figsize=(12, len(sequences) * 0.3))
    
    # Sort sequences if requested
    if sort_by == "from.start":
        sequences = sorted(sequences, key=lambda s: s[0])
    
    for i, seq in enumerate(sequences):
        x = 0
        for state in seq:
            color = seq.alphabet.colors.get(state, "gray")
            rect = Rectangle((x, i), 1, 0.8, facecolor=color, edgecolor="none")
            ax.add_patch(rect)
            x += 1
    
    ax.set_xlim(0, max(len(s) for s in sequences))
    ax.set_ylim(0, len(sequences))
    ax.set_xlabel("Time")
    ax.set_ylabel("Sequence")
    
    return ax

Descriptive Statistics

import numpy as np
from collections import Counter

def longitudinal_entropy(seq: StateSequence, normalize: bool = True) -> float:
    """Calculate within-sequence entropy.
    
    Measures diversity of states visited by the sequence.
    
    Formula: H(s) = -Σ p_a * log(p_a)
    where p_a is proportion of time in state a.
    """
    counts = Counter(seq)
    n = len(seq)
    
    entropy = 0.0
    for count in counts.values():
        p = count / n
        if p > 0:
            entropy -= p * np.log(p)
    
    if normalize and len(counts) > 1:
        max_entropy = np.log(len(seq.alphabet.states))
        entropy /= max_entropy
    
    return entropy

def transition_rate_matrix(sequences: list[StateSequence]) -> np.ndarray:
    """Compute state-to-state transition rate matrix.
    
    Entry [i,j] is probability of transitioning from state i to state j.
    """
    states = sequences[0].alphabet.states
    n_states = len(states)
    state_to_idx = {s: i for i, s in enumerate(states)}
    
    counts = np.zeros((n_states, n_states))
    
    for seq in sequences:
        for t in range(len(seq) - 1):
            i = state_to_idx[seq[t]]
            j = state_to_idx[seq[t + 1]]
            counts[i, j] += 1
    
    # Normalize rows
    row_sums = counts.sum(axis=1, keepdims=True)
    row_sums[row_sums == 0] = 1  # Avoid division by zero
    rates = counts / row_sums
    
    return rates

Best Practices

Distance Method Selection

Scenario Recommended Method Rationale
Similar length sequences Hamming (HAM) Fast, simple interpretation
Variable length sequences Optimal Matching (OM) Handles insertions/deletions
Timing matters Dynamic Hamming (DHD) Time-varying costs
Order matters, not timing LCS Ignores positional differences
Spells are meaningful OMspell Operates on spell structure

Substitution Cost Strategies

  1. Constant (2.0): All substitutions equally costly
  2. Transition-based (TRATE): Costs inversely proportional to observed transition rates
  3. Feature-based: Costs based on state attribute differences
  4. Domain knowledge: Expert-defined costs

Performance Optimization

  • Distance computation is O(n²) for n sequences
  • Each pairwise comparison is O(L²) for length L
  • Use parallel computation for large datasets
  • Consider approximate methods (e.g., k-medoids sampling)
  • Cache distance matrices when reusing

Interpretation Guidelines

  • High entropy: Diverse, unpredictable sequences
  • Low turbulence: Stable, few transitions
  • High complexity: Many transitions AND state diversity
  • Small distance: Similar trajectory patterns

References

  • Gabadinho, A., Ritschard, G., Müller, N.S., & Studer, M. (2011). Analyzing and Visualizing State Sequences in R with TraMineR. Journal of Statistical Software, 40(4).
  • Studer, M., & Ritschard, G. (2016). What matters in differences between life trajectories: A comparative review of sequence dissimilarity measures. Journal of the Royal Statistical Society: Series A, 179(2).
  • Abbott, A., & Tsay, A. (2000). Sequence analysis and optimal matching methods in sociology. Sociological Methods & Research, 29(1).
  • Elzinga, C.H. (2010). Complexity of categorical time series. Sociological Methods & Research, 38(3).

Related Skills

Xlsx

Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas

data

Clickhouse Io

ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.

datacli

Clickhouse Io

ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.

datacli

Analyzing Financial Statements

This skill calculates key financial ratios and metrics from financial statement data for investment analysis

data

Data Storytelling

Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.

data

Kpi Dashboard Design

Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.

designdata

Dbt Transformation Patterns

Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.

testingdocumenttool

Sql Optimization Patterns

Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.

designdata

Anndata

This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.

arttooldata

Xlsx

Spreadsheet toolkit (.xlsx/.csv). Create/edit with formulas/formatting, analyze data, visualization, recalculate formulas, for spreadsheet processing and analysis.

tooldata

Skill Information

Category:Data
Last Updated:1/25/2026