name: data-pipeline description: | This skill should be used when the user asks to "configure data source", "data from database", "fetch data from API", "scrape web data", "generate training data with LLM", "regenerate data", "data pipeline", "where does data come from", or needs to set up reproducible data collection. Provides data source configuration, reproducibility tracking, and data regeneration capabilities. allowed-tools: Bash, Read, Write, Edit, Grep, Glob, WebFetch

Data Pipeline - 資料管線配置

配置可重現的資料來源，追蹤資料生成方式，支援後續模型迭代。

核心理念

v2 的關鍵特色是資料可重現性：

記錄資料來源配置（DB、API、爬取、LLM 生成）
任何時候都能重新生成相同的資料
模型迭代時可以追溯資料來源

支援的資料來源

來源類型	說明	適用場景
`database`	SQL 資料庫查詢	既有標註資料
`api`	REST/GraphQL API	外部資料服務
`web_scrape`	網頁爬取	公開資料收集
`llm_generated`	LLM 生成	資料增強、合成資料
`file_import`	檔案匯入	現有 CSV/JSON/JSONL

data_source.yaml 規格

完整範例

# data_source.yaml
version: "1.0"
created: 2026-01-06T10:00:00
updated: 2026-01-06T14:30:00

# 資料來源列表（按順序處理）
sources:
  # 來源 1: 資料庫
  - name: production_annotations
    type: database
    enabled: true
    config:
      driver: postgresql
      host: db.example.com
      port: 5432
      database: nlp_annotations
      # 敏感資訊使用環境變數
      username: ${DB_USER}
      password: ${DB_PASSWORD}
    query: |
      SELECT
        text,
        entity,
        sentiment as label,
        annotator,
        created_at
      FROM annotations
      WHERE status = 'approved'
        AND task_type = 'entity_sentiment'
      ORDER BY created_at
    output:
      format: jsonl
      path: data/raw/db_annotations.jsonl
      fields:
        - text
        - entity
        - label

  # 來源 2: API
  - name: sentiment_api
    type: api
    enabled: true
    config:
      base_url: https://api.dataservice.com/v1
      auth:
        type: bearer
        token: ${API_TOKEN}
    requests:
      - endpoint: /annotations
        method: GET
        params:
          task: entity_sentiment
          status: approved
          limit: 1000
        pagination:
          type: cursor
          cursor_field: next_cursor
    output:
      format: jsonl
      path: data/raw/api_data.jsonl
      transform: |
        {
          "text": item.content,
          "entity": item.target_entity,
          "label": item.sentiment_label
        }

  # 來源 3: 網頁爬取
  - name: finance_news
    type: web_scrape
    enabled: true
    config:
      method: playwright  # 或 requests, scrapy
      urls:
        - https://finance.example.com/news
      selectors:
        title: h1.article-title
        content: div.article-content
        date: span.publish-date
      keywords:
        - 台積電
        - 聯電
        - 金融
      rate_limit: 1  # 每秒請求數
    output:
      format: jsonl
      path: data/raw/scraped_news.jsonl

  # 來源 4: LLM 生成
  - name: synthetic_neutral
    type: llm_generated
    enabled: true
    config:
      model: gpt-4o
      temperature: 0.7
      api_key: ${OPENAI_API_KEY}
    generation:
      prompt_template: |
        生成 {count} 筆金融新聞情感分析的訓練資料。

        要求：
        - 情感標籤：{label}
        - 領域：金融、股票、投資
        - 包含具體的公司或股票名稱作為實體
        - 文本長度：50-200 字

        輸出 JSON 格式：
        {"text": "...", "entity": "...", "label": "{label}"}

        每行一個 JSON，不要其他說明。
      variations:
        - label: 中立
          count: 200
        - label: 正面
          count: 100
        - label: 負面
          count: 100
    output:
      format: jsonl
      path: data/raw/generated_data.jsonl
    validation:
      # 生成後人工審核
      require_review: true
      review_sample: 0.1  # 抽樣 10% 審核

  # 來源 5: 檔案匯入
  - name: existing_dataset
    type: file_import
    enabled: true
    config:
      source_path: /path/to/existing/data.csv
      format: csv
      encoding: utf-8
      delimiter: ","
    mapping:
      text: content_column
      entity: target_column
      label: sentiment_column
    output:
      format: jsonl
      path: data/raw/imported_data.jsonl

# 資料合併設定
merge:
  enabled: true
  output_path: data/raw/merged.jsonl
  deduplication:
    enabled: true
    key: text  # 以 text 欄位去重
  shuffle: true
  random_seed: 42

# 資料分割設定
split:
  enabled: true
  ratios:
    train: 0.7
    valid: 0.15
    test: 0.15
  stratify_by: label  # 按標籤分層抽樣
  random_seed: 42
  output:
    train: data/train.jsonl
    valid: data/valid.jsonl
    test: data/test.jsonl

# 重新生成配置
regeneration:
  script: scripts/01_regenerate_data.py
  last_run: 2026-01-06T10:30:00
  triggers:
    - source_config_changed
    - manual

各來源類型詳解

Database 資料庫

- name: db_source
  type: database
  config:
    driver: postgresql  # postgresql, mysql, sqlite
    host: localhost
    port: 5432
    database: mydb
    username: ${DB_USER}
    password: ${DB_PASSWORD}
  query: |
    SELECT * FROM table WHERE condition
  output:
    format: jsonl
    path: data/raw/db_data.jsonl

支援的資料庫:

PostgreSQL
MySQL
SQLite
SQL Server (需額外 driver)

API 串接

- name: api_source
  type: api
  config:
    base_url: https://api.example.com
    auth:
      type: bearer  # bearer, basic, api_key
      token: ${API_TOKEN}
  requests:
    - endpoint: /data
      method: GET
      params:
        limit: 100
      pagination:
        type: offset  # offset, cursor, page
        offset_param: offset
        limit_param: limit

認證方式:

Bearer Token
Basic Auth
API Key (header 或 query param)

Web Scrape 爬取

- name: scrape_source
  type: web_scrape
  config:
    method: playwright  # playwright, requests
    urls:
      - https://example.com/page1
      - https://example.com/page2
    selectors:
      title: h1
      content: div.content
    rate_limit: 1

注意事項:

遵守 robots.txt
設定合理的 rate limit
處理動態載入內容用 playwright

LLM Generated 生成

- name: llm_source
  type: llm_generated
  config:
    model: gpt-4o  # gpt-4o, claude-3, etc.
    temperature: 0.7
  generation:
    prompt_template: |
      生成訓練資料...
    variations:
      - label: 正面
        count: 100
  validation:
    require_review: true

最佳實踐:

生成後抽樣人工審核
設定明確的 prompt 格式要求
分批生成避免重複

File Import 匯入

- name: file_source
  type: file_import
  config:
    source_path: /path/to/data.csv
    format: csv  # csv, json, jsonl
    encoding: utf-8
  mapping:
    text: source_column
    label: target_column

資料生成腳本

01_regenerate_data.py

自動生成的重新生成腳本：

#!/usr/bin/env python
"""
根據 data_source.yaml 重新生成所有資料。
自動產生，請勿手動修改。
"""

import yaml
from pathlib import Path
from data_pipeline import (
    DatabaseSource,
    APISource,
    WebScrapeSource,
    LLMGeneratedSource,
    FileImportSource,
    DataMerger,
    DataSplitter
)

def main():
    # 載入配置
    config = yaml.safe_load(open('data_source.yaml'))

    # 處理各資料來源
    for source in config['sources']:
        if not source.get('enabled', True):
            continue

        if source['type'] == 'database':
            DatabaseSource(source).fetch()
        elif source['type'] == 'api':
            APISource(source).fetch()
        elif source['type'] == 'web_scrape':
            WebScrapeSource(source).fetch()
        elif source['type'] == 'llm_generated':
            LLMGeneratedSource(source).generate()
        elif source['type'] == 'file_import':
            FileImportSource(source).import_data()

    # 合併資料
    if config.get('merge', {}).get('enabled'):
        DataMerger(config['merge']).merge()

    # 分割資料
    if config.get('split', {}).get('enabled'):
        DataSplitter(config['split']).split()

    print("資料重新生成完成！")

if __name__ == '__main__':
    main()

資料驗證

生成資料後自動驗證：

# 驗證項目
validations:
  - format_check:      # JSON 格式正確
  - required_fields:   # 必要欄位存在
  - label_values:      # 標籤值在允許範圍內
  - text_length:       # 文本長度合理
  - duplicate_check:   # 重複檢查
  - distribution:      # 類別分佈統計

驗證報告

資料驗證報告
============

總筆數: 1000
有效筆數: 987 (98.7%)
問題筆數: 13 (1.3%)

問題明細:
- 格式錯誤: 3 筆
- 缺少必要欄位: 5 筆
- 文本過短 (<10字): 5 筆

類別分佈:
- 正面: 380 (38.5%)
- 負面: 335 (33.9%)
- 中立: 272 (27.6%)

建議:
- 中立類別比例偏低，考慮增加合成資料

最佳實踐

敏感資訊處理

# 使用環境變數
config:
  password: ${DB_PASSWORD}
  api_key: ${API_KEY}

# .env 檔案（不要 commit）
DB_PASSWORD=secret
API_KEY=sk-xxx

版本追蹤

每次生成資料時記錄：

# data/data_version.yaml
version: "2026-01-06-001"
source_config_hash: abc123
total_records: 1000
class_distribution:
  正面: 380
  負面: 335
  中立: 285
generated_at: 2026-01-06T10:30:00

增量更新

支援只更新特定來源：

# 只重新生成 LLM 資料
python scripts/01_regenerate_data.py --source synthetic_neutral

# 重新生成所有
python scripts/01_regenerate_data.py --all

Data Pipeline

Skill Details

Repository Files

Data Pipeline - 資料管線配置

核心理念

支援的資料來源

data_source.yaml 規格

完整範例

各來源類型詳解

Database 資料庫

API 串接

Web Scrape 爬取

LLM Generated 生成

File Import 匯入

資料生成腳本

01_regenerate_data.py

資料驗證

驗證報告

最佳實踐

敏感資訊處理

版本追蹤

增量更新

相關資源

指令

其他 Skills

Related Skills

Xlsx

Clickhouse Io

Clickhouse Io

Analyzing Financial Statements

Data Storytelling

Kpi Dashboard Design

Dbt Transformation Patterns

Sql Optimization Patterns

Anndata

Xlsx

Skill Information