Count Dataset Tokens
by letta-ai
This skill provides guidance for counting tokens in datasets using specific tokenizers. It should be used when tasks involve tokenizing dataset content, filtering data by domain or category, and aggregating token counts. Common triggers include requests to count tokens in HuggingFace datasets, filter datasets by specific fields, or use particular tokenizers (e.g., Qwen, DeepSeek, GPT).
Skill Details
Repository Files
1 file in this skill directory
name: count-dataset-tokens description: This skill provides guidance for counting tokens in datasets using specific tokenizers. It should be used when tasks involve tokenizing dataset content, filtering data by domain or category, and aggregating token counts. Common triggers include requests to count tokens in HuggingFace datasets, filter datasets by specific fields, or use particular tokenizers (e.g., Qwen, DeepSeek, GPT).
Count Dataset Tokens
Overview
This skill guides the process of counting tokens in datasets, typically from HuggingFace Hub or similar sources. These tasks involve loading datasets, filtering by specific criteria (domains, categories, splits), tokenizing text fields, and computing aggregate statistics.
Workflow
Phase 1: Understand the Dataset Structure
Before writing any code, thoroughly examine the dataset documentation and structure:
-
Read the README/dataset card completely - Look for:
- Available splits (train, test, validation, etc.)
- Column/field definitions and their exact names
- Domain or category definitions (exact values used)
- Data types for each field
- Any metadata subsets available
-
Explore the actual data - Write exploratory code to:
- List all available columns
- Check unique values for categorical fields (especially filter fields like "domain")
- Verify field names match documentation
- Examine sample records to understand data format
-
Document findings before proceeding - Note:
- Exact field names to use
- Exact categorical values for filtering
- Any discrepancies between documentation and actual data
Phase 2: Clarify Task Requirements
When task wording is ambiguous, especially regarding filter criteria:
-
Use exact terminology from the dataset - If the task asks for "science" tokens but the dataset has "biology", "chemistry", "physics" domains:
- First check if a "science" domain actually exists
- Do NOT assume "science" means combining related domains unless explicitly documented
- Report when exact matches are not found
-
Handle missing categories correctly:
- If a requested filter value returns zero results, report this finding
- Do not reinterpret or expand the filter criteria without explicit justification
- Check if the requested value might be a typo or variant spelling
-
Distinguish between:
- Exact domain/category matches (e.g., domain == "science")
- Aggregated categories (e.g., domain in ["biology", "chemistry", "physics"])
- The task wording should guide which approach to use
Phase 3: Implement Token Counting
-
Load the correct tokenizer:
- Use the exact tokenizer specified in the task
- Verify the tokenizer loads correctly before processing data
- Handle any authentication requirements for gated models
-
Write robust counting code:
# Example structure from transformers import AutoTokenizer from datasets import load_dataset # Load tokenizer tokenizer = AutoTokenizer.from_pretrained("tokenizer-name") # Load dataset (use streaming for large datasets) dataset = load_dataset("dataset-name", split="train") # Filter by exact criteria filtered = dataset.filter(lambda x: x["domain"] == "exact_value") # Count tokens with null/empty handling total_tokens = 0 for item in filtered: text = item.get("text_field") if text: # Handle None and empty strings tokens = tokenizer.encode(text) total_tokens += len(tokens) -
Handle edge cases:
- None/null values in text fields
- Empty strings
- Special characters or encoding issues
- Very long texts that may need truncation handling
Phase 4: Verify Results
-
Validate filter results:
- Check the count of filtered records
- Verify sample records match expected criteria
- Confirm no unexpected filtering occurred
-
Sanity check token counts:
- Compare against expected magnitudes
- Verify counts are non-zero when data exists
- Check for reasonable tokens-per-record ratios
-
Document assumptions:
- Note any interpretation decisions made
- Flag uncertainty in the results if criteria were ambiguous
Common Pitfalls
1. Misinterpreting Filter Criteria
Problem: Assuming "science" means biology + chemistry + physics without documentation support. Solution: Always use exact filter values from the dataset. When exact matches fail, report the discrepancy rather than reinterpreting.
2. Insufficient Dataset Exploration
Problem: Writing filtering code before understanding available field values. Solution: Always run exploratory analysis first:
# Check unique values before filtering
print(dataset.unique("domain")) # or equivalent
3. Incomplete Documentation Review
Problem: Missing domain definitions or field specifications in the README. Solution: Read the entire dataset card, including any linked documentation. Look for schema definitions, data dictionaries, or field value enumerations.
4. Silent Filter Failures
Problem: Filter returns zero results but code continues without warning. Solution: Always check and report filter result counts:
filtered = dataset.filter(condition)
print(f"Filtered count: {len(filtered)}")
if len(filtered) == 0:
print("WARNING: No records match filter criteria")
5. Ignoring Null/Empty Values
Problem: Tokenizing None or empty strings causes errors or incorrect counts. Solution: Always validate text content before tokenizing:
if text and isinstance(text, str) and len(text.strip()) > 0:
tokens = tokenizer.encode(text)
6. Overconfident Assumptions
Problem: Proceeding with reinterpreted criteria without expressing uncertainty. Solution: When making interpretation decisions, document them clearly and flag the uncertainty in results.
Verification Checklist
Before finalizing results, verify:
- Dataset README/card was fully reviewed
- Field names match exactly what's in the data
- Filter values are exact matches from the dataset
- Exploratory analysis confirmed available categories
- Filter produced expected number of records
- Null/empty handling is implemented
- Token counts pass sanity checks
- Any assumptions or interpretations are documented
- Results file is correctly formatted and written
Related Skills
Xlsx
Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Analyzing Financial Statements
This skill calculates key financial ratios and metrics from financial statement data for investment analysis
Data Storytelling
Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.
Team Composition Analysis
This skill should be used when the user asks to "plan team structure", "determine hiring needs", "design org chart", "calculate compensation", "plan equity allocation", or requests organizational design and headcount planning for a startup.
Startup Financial Modeling
This skill should be used when the user asks to "create financial projections", "build a financial model", "forecast revenue", "calculate burn rate", "estimate runway", "model cash flow", or requests 3-5 year financial planning for a startup.
Kpi Dashboard Design
Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.
Dbt Transformation Patterns
Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.
Startup Metrics Framework
This skill should be used when the user asks about "key startup metrics", "SaaS metrics", "CAC and LTV", "unit economics", "burn multiple", "rule of 40", "marketplace metrics", or requests guidance on tracking and optimizing business performance metrics.
