Aind Infrastructure
by AllenNeuralDynamics
Knowledge about AIND data infrastructure including MongoDB/DocumentDB access patterns, S3 asset storage, collection schemas, and query patterns. Use when working with AIND analysis results, querying metadata, or accessing stored assets.
Skill Details
Repository Files
1 file in this skill directory
name: aind-infrastructure description: Knowledge about AIND data infrastructure including MongoDB/DocumentDB access patterns, S3 asset storage, collection schemas, and query patterns. Use when working with AIND analysis results, querying metadata, or accessing stored assets.
AIND Analysis Infrastructure
This skill captures knowledge about the Allen Institute for Neural Dynamics (AIND) data infrastructure for analysis results.
Overview
AIND stores analysis results in a two-tier system:
- MongoDB (DocumentDB): Metadata and pointers to S3 assets
- S3: Actual result files (figures, tables, videos)
MongoDB Access
Connection Pattern
Use aind-data-access-api for DocumentDB access:
from aind_data_access_api.document_db import MetadataDbClient
client = MetadataDbClient(
host="api.allenneuraldynamics.org", # Public API endpoint
database="analysis", # Analysis database
collection="<collection-name>", # Project-specific collection
)
One Collection Per Project
Each analysis project has its own collection. Examples:
dynamic-foraging-model-fitting- Foraging behavior MLE fitting- Other projects follow same pattern
Query Patterns
# Basic query - old format only
records = client.retrieve_docdb_records(
filter_query={"status": "success"},
projection={"_id": 1, "subject_id": 1, "s3_location": 1},
paginate=False, # Set True for large queries
)
# Query for new format
records = client.retrieve_docdb_records(
filter_query={"processing.data_processes.output_parameters.additional_info": "success"},
projection={"_id": 1, "location": 1},
paginate=False,
)
Note: Due to the two different formats, it's recommended to use aind-analysis-arch-result-access which handles both formats automatically.
Two Pipeline Formats
AIND has two analysis pipeline formats with different document structures:
1. Prototype Pipeline (older format)
Flat structure with fields at root level:
{
"_id": "fe96ff6c9e7b...", // SHA256 hash
"subject_id": "744040",
"session_date": "2024-12-09",
"status": "success",
"s3_location": "s3://aind-dynamic-foraging-analysis-prod-o5171v/fe96ff6c...",
"analysis_datetime": "2025-01-18T05:14:46.722265",
"nwb_name": "744040_2024-12-09_13-30-23.nwb",
"analysis_spec": {
"analysis_name": "MLE fitting",
"analysis_args": { ... }
},
"analysis_results": {
"fit_settings": { "agent_alias": "QLearning_L1F1_CK1_softmax", ... },
"params": { ... },
"log_likelihood": -144.83,
"AIC": 299.67,
"BIC": 319.50,
"n_trials": 390,
"prediction_accuracy": 0.807,
"cross_validation": { ... }
}
}
Key paths:
subject_id→ root levelsession_date→ root levelstatus→ root level- S3 location →
s3_location - Fitting results →
analysis_results.* - Agent alias →
analysis_results.fit_settings.agent_alias
2. AIND Analysis Framework (new format)
Nested structure following aind-data-schema:
{
"_id": "d2a652c73ee8420a...", // Shorter UUID
"object_type": "Metadata",
"name": "7d9b907880012b65...",
"location": "s3://aind-analysis-prod-o5171v/dynamic-foraging-model-fitting/7d9b...",
"processing": {
"data_processes": [
{
"name": "han_df_mle_aind-analysis-wrapper",
"start_date_time": "2026-01-10T04:07:11",
"output_parameters": {
"additional_info": "success",
"subject_id": "781575",
"session_date": "2025-07-14",
"nwb_name": "behavior_781575_2025-07-14_21-41-11.nwb",
"fitting_results": {
"fit_settings": { "agent_alias": "ForagingCompareThreshold", ... },
"params": { ... },
"log_likelihood": -272.10,
"AIC": 552.20,
"BIC": 569.34,
"n_trials": 536,
"prediction_accuracy": 0.783,
"cross_validation": { ... }
}
}
}
]
}
}
Key paths:
subject_id→processing.data_processes[0].output_parameters.subject_idsession_date→processing.data_processes[0].output_parameters.session_datestatus→processing.data_processes[0].output_parameters.additional_info- S3 location →
location - Fitting results →
processing.data_processes[0].output_parameters.fitting_results.* - Agent alias →
processing.data_processes[0].output_parameters.fitting_results.fit_settings.agent_alias
Field Mapping Summary
| Field | Old Format | New Format |
|---|---|---|
| subject_id | subject_id |
processing.data_processes[0].output_parameters.subject_id |
| session_date | session_date |
processing.data_processes[0].output_parameters.session_date |
| status | status |
processing.data_processes[0].output_parameters.additional_info |
| S3 location | s3_location |
location |
| agent_alias | analysis_results.fit_settings.agent_alias |
processing.data_processes[0].output_parameters.fitting_results.fit_settings.agent_alias |
| n_trials | analysis_results.n_trials |
processing.data_processes[0].output_parameters.fitting_results.n_trials |
| AIC/BIC | analysis_results.AIC/BIC |
processing.data_processes[0].output_parameters.fitting_results.AIC/BIC |
Querying Both Formats
Use MongoDB projection aliasing to normalize fields:
# Query that works for both formats
projection = {
"_id": 1,
# Old format fields
"subject_id": 1,
"session_date": 1,
"status": 1,
"s3_location": 1,
# New format fields (aliased)
"subject_id_new": "$processing.data_processes.output_parameters.subject_id",
"session_date_new": "$processing.data_processes.output_parameters.session_date",
"location": 1,
}
The aind-analysis-arch-result-access package handles this automatically by querying both formats and merging results into a unified DataFrame.
S3 Access
Public Buckets
Analysis results are in public S3 buckets (no auth needed):
import s3fs
# Anonymous access for public buckets
fs = s3fs.S3FileSystem(anon=True)
Common Bucket Paths
# Old pipeline bucket
S3_PATH_ANALYSIS_OLD = "s3://aind-dynamic-foraging-analysis-prod-o5171v"
# New pipeline bucket (AIND Analysis Framework)
S3_PATH_ANALYSIS_NEW = "s3://aind-analysis-prod-o5171v/dynamic-foraging-model-fitting"
# Bonsai processed data
S3_PATH_BONSAI_ROOT = "s3://aind-behavior-data/foraging_nwb_bonsai_processed"
Asset URL Construction
Convert S3 path to HTTPS for web display:
def s3_to_https(s3_path: str) -> str:
"""Convert s3://bucket/key to https://bucket.s3.amazonaws.com/key"""
if s3_path.startswith("s3://"):
s3_path = s3_path[5:]
bucket = s3_path.split("/")[0]
key = "/".join(s3_path.split("/")[1:])
return f"https://{bucket}.s3.amazonaws.com/{key}"
Reading Files from S3
import json
import pickle
# JSON files
with fs.open("s3://bucket/path/file.json") as f:
data = json.load(f)
# Pickle files
with fs.open("s3://bucket/path/file.pkl", "rb") as f:
df = pickle.load(f)
# Check existence
if fs.exists("s3://bucket/path/file.png"):
# File exists
Batch Operations
For multiple S3 reads, use ThreadPoolExecutor:
from concurrent.futures import ThreadPoolExecutor
from tqdm import tqdm
def fetch_file(s3_path):
# ... fetch logic
pass
with ThreadPoolExecutor(max_workers=10) as executor:
results = list(tqdm(
executor.map(fetch_file, paths),
total=len(paths),
desc="Fetching files"
))
Using aind-analysis-arch-result-access
This package provides ready-made functions for specific analyses:
# Install: pip install aind-analysis-arch-result-access
from aind_analysis_arch_result_access import get_mle_model_fitting
# Get foraging model fitting results
# IMPORTANT: At least one filter parameter is required!
df = get_mle_model_fitting(
subject_id="778869", # Filter by subject (recommended for fast loading)
# session_date="2024-10-24", # Or filter by date
# agent_alias="QLearning...", # Or filter by model type
# from_custom_query={...}, # Or use custom MongoDB query
if_include_metrics=True,
if_include_latent_variables=False, # Set False for faster loading
if_download_figures=False,
)
# For querying by date range:
from datetime import datetime, timedelta
three_months_ago = (datetime.now() - timedelta(days=90)).strftime("%Y-%m-%d")
df_recent = get_mle_model_fitting(
from_custom_query={"session_date": {"$gte": three_months_ago}},
if_include_metrics=True,
if_include_latent_variables=False,
)
Important notes:
- Function requires at least one of:
subject_id,session_date,agent_alias, orfrom_custom_query - The package queries both pipeline formats separately and merges results into a unified DataFrame
only_recent_version=True(default) deduplicates by keeping most recent analysis- Loading all records can be slow; filter by subject_id or date range for prototyping
Key columns in returned DataFrame:
_id: Record identifiersubject_id,session_date: Session infoagent_alias: Model type usedn_trials: Number of trialsS3_location: Path to result files (use for constructing asset URLs)status: "success" or "failed"pipeline_source: "aind analysis framework" or "han's analysis pipeline"- Metrics:
log_likelihood,AIC,BIC,prediction_accuracy, etc.
Common Asset Types
Assets stored in S3 per record:
fitted_session.png- Main result figuredocDB_record.json- Full analysis resultsoriginal_results_*.json- Raw output files- Latent variables (q-values, RPE, etc.)
Best Practices
- Filter early: Use MongoDB queries to reduce data before pandas operations
- Batch S3 operations: Use threading for multiple file reads
- Cache results: Consider caching DataFrames for repeated queries
- Handle both formats: Account for old and new pipeline structures
- Check S3 existence: Assets may not exist for all records
Related Skills
Xlsx
Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Analyzing Financial Statements
This skill calculates key financial ratios and metrics from financial statement data for investment analysis
Data Storytelling
Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.
Kpi Dashboard Design
Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.
Dbt Transformation Patterns
Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.
Sql Optimization Patterns
Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.
Clinical Decision Support
Generate professional clinical decision support (CDS) documents for pharmaceutical and clinical research settings, including patient cohort analyses (biomarker-stratified with outcomes) and treatment recommendation reports (evidence-based guidelines with decision algorithms). Supports GRADE evidence grading, statistical analysis (hazard ratios, survival curves, waterfall plots), biomarker integration, and regulatory compliance. Outputs publication-ready LaTeX/PDF format optimized for drug develo
Anndata
This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.
