name: clio-clustering description: | Build a complete data clustering and visualization pipeline from any data source. Use when the user wants to analyze patterns in text data (GitHub issues, Slack messages, support tickets, code reviews, forum posts, customer feedback, etc.), cluster similar items, or build an interactive visualization to explore the patterns. Triggers on: "cluster", "analyze patterns", "group similar", "clio-style", "pattern analysis", "visualize clusters", "find themes", "topic modeling", "semantic clustering". user-invocable: true allowed-tools: Read, Write, Edit, Bash, Grep, Glob, WebFetch, WebSearch

Clio-Style Clustering Pipeline

Build an end-to-end semantic clustering analysis from any text data source, with interactive visualization.

What This Skill Does

This skill guides you through building a complete clustering pipeline:

Data Sourcing - Identify APIs/methods to fetch data, build tests to verify access
Scraping - Collect data with proper pagination and rate limiting
Embedding - Generate embeddings using OpenAI's text-embedding-3-large
Clustering - Hierarchical HDBSCAN clustering with UMAP projection
Labeling - LLM-powered cluster naming and description
Visualization - Interactive React/D3 explorer with drill-down

Quick Start

When the user describes a data source (e.g., "GitHub issues from facebook/react"), follow these steps:

Phase 1: Data Source Discovery

First, identify how to access the data:

Research the API - Use web search to find official API documentation
Identify authentication - What tokens/keys are needed?
Find pagination patterns - How does the API handle large datasets?
Determine rate limits - What are the constraints?

See data-sourcing.md for common patterns (GitHub, Slack, etc.)

Phase 2: Build & Test Data Fetcher

IMPORTANT: Write tests BEFORE building the full scraper.

# test_fetcher.py - Verify API access works
import os
import requests

def test_api_access():
    """Verify we can access the API."""
    # Adapt this for your specific data source
    token = os.environ.get('API_TOKEN')
    assert token, "API_TOKEN not set"

    response = requests.get(
        'https://api.example.com/endpoint',
        headers={'Authorization': f'Bearer {token}'}
    )
    assert response.status_code == 200

    data = response.json()
    assert len(data) > 0, "No data returned"
    print(f"Successfully fetched {len(data)} items")

if __name__ == '__main__':
    test_api_access()

Run the test: python test_fetcher.py

Only proceed to the full scraper once tests pass.

Phase 3: Build the Scraper

Create a scraper that:

Handles pagination efficiently
Respects rate limits
Stores data in SQLite for resumability
Saves progress for resumable scraping

See data-sourcing.md for the database schema and scraper template.

Phase 4: Generate Embeddings & Cluster

Use the clustering pipeline to:

Generate embeddings with OpenAI
Run hierarchical HDBSCAN clustering
Project to 2D with UMAP
Label clusters with LLM

See clustering-reference.md for the complete implementation.

Phase 5: Build Visualization

Set up the interactive visualization:

Export data to JSON
Create Next.js app with D3 visualization
Add hierarchical drill-down view

See visualization-setup.md for setup instructions.

The components/ directory contains ready-to-copy React components.

Project Structure

When complete, the project should look like:

project/
├── data/
│   └── items.db              # SQLite database
├── pipeline/
│   ├── __init__.py
│   ├── db.py                 # Database operations
│   ├── scraper.py            # Data fetcher
│   ├── embed.py              # Embedding generation
│   ├── cluster.py            # HDBSCAN clustering
│   ├── describe.py           # LLM labeling
│   └── export.py             # JSON export
├── visualizer/
│   ├── app/
│   │   ├── page.tsx
│   │   └── layout.tsx
│   ├── components/
│   │   ├── HierarchicalView.tsx
│   │   ├── ScatterPlot.tsx
│   │   └── ...
│   ├── lib/
│   │   ├── types.ts
│   │   ├── data.ts
│   │   └── utils.ts
│   └── public/data/
│       ├── items.json
│       └── clusters.json
├── test_fetcher.py           # API access tests
├── requirements.txt
└── README.md

Dependencies

Python (for pipeline)

openai>=1.0
instructor>=1.0
hdbscan>=0.8.33
umap-learn>=0.5
scikit-learn>=1.3
numpy>=1.24
rich>=13.0

Node.js (for visualization)

{
  "dependencies": {
    "next": "14.2.0",
    "react": "^18.2.0",
    "d3": "^7.8.5",
    "framer-motion": "^11.0.0",
    "tailwindcss": "^3.4.1"
  }
}

Environment Variables

OPENAI_API_KEY=sk-...           # Required for embeddings and labeling
# Plus whatever auth your data source needs:
GITHUB_TOKEN=ghp_...            # For GitHub
SLACK_TOKEN=xoxb-...            # For Slack
# etc.

Running the Pipeline

# 1. Test API access
python test_fetcher.py

# 2. Scrape data
python -m pipeline.scraper

# 3. Generate embeddings
python -m pipeline.embed

# 4. Cluster
python -m pipeline.cluster

# 5. Label clusters with LLM
python -m pipeline.describe

# 6. Export for visualization
python -m pipeline.export

# 7. Run visualizer
cd visualizer && npm run dev

Key Design Decisions

SQLite for storage - Simple, portable, supports resumability
HDBSCAN over K-means - Finds natural clusters, handles noise
3-level hierarchy - Coarse (L1) -> Medium (L2) -> Fine (L3)
UMAP for projection - Preserves local structure better than t-SNE
text-embedding-3-large - Best quality embeddings for semantic similarity
Next.js + D3 - Fast, interactive visualization with SSR support

Detailed Documentation

Data Sourcing Patterns - API patterns, auth, pagination
Clustering Implementation - Embedding, HDBSCAN, UMAP code
Visualization Setup - Next.js app and components

Clio Clustering

Skill Details

Repository Files

Clio-Style Clustering Pipeline

What This Skill Does

Quick Start

Phase 1: Data Source Discovery

Phase 2: Build & Test Data Fetcher

Phase 3: Build the Scraper

Phase 4: Generate Embeddings & Cluster

Phase 5: Build Visualization

Project Structure

Dependencies

Python (for pipeline)

Node.js (for visualization)

Environment Variables

Running the Pipeline

Key Design Decisions

Detailed Documentation

Related Skills

Clickhouse Io

Clickhouse Io

Clinical Decision Support

Geopandas

Datacommons Client

Clickhouse Io

Geopandas

Datacommons Client

Clinical Decision Support

Clickhouse Query

Skill Information