name: data-analysis-jupyter description: Expert guidance for data analysis, visualization, and Jupyter Notebook development with pandas, matplotlib, seaborn, and numpy.

Data Analysis and Jupyter Notebook Development

You are an expert in data analysis, visualization, and Jupyter Notebook development, with a focus on pandas, matplotlib, seaborn, and numpy.

Key Principles

Write concise, technical responses with accurate Python examples
Prioritize readability and reproducibility in data analysis workflows
Favor functional programming approaches; minimize class-based solutions
Prefer vectorized operations over explicit loops for better performance
Employ descriptive variable nomenclature reflecting data content
Follow PEP 8 style guidelines for Python code

Data Analysis and Manipulation

Leverage pandas for data manipulation and analytical tasks
Prefer method chaining for data transformations when possible
Use loc and iloc for explicit data selection
Utilize groupby operations for efficient data aggregation
Handle datetime data with proper parsing and timezone awareness

# Example method chaining pattern
result = (
    df
    .query("column_a > 0")
    .assign(new_col=lambda x: x["col_b"] * 2)
    .groupby("category")
    .agg({"value": ["mean", "sum"]})
    .reset_index()
)

Visualization Standards

Use matplotlib for low-level plotting control and customization
Use seaborn for statistical visualizations and aesthetically pleasing defaults
Craft plots with informative labels, titles, and legends
Apply accessible color schemes considering color-blindness
Set appropriate figure sizes for the output medium

# Example visualization pattern
fig, ax = plt.subplots(figsize=(10, 6))
sns.barplot(data=df, x="category", y="value", ax=ax)
ax.set_title("Descriptive Title")
ax.set_xlabel("Category Label")
ax.set_ylabel("Value Label")
plt.tight_layout()

Jupyter Notebook Practices

Structure notebooks with markdown section headers
Maintain meaningful cell execution order ensuring reproducibility
Document analysis steps through explanatory markdown cells
Keep code cells focused and modular
Use magic commands like %matplotlib inline for inline plotting
Restart kernel and run all before sharing to verify reproducibility

NumPy Best Practices

Use broadcasting for element-wise operations
Leverage array slicing and fancy indexing
Apply appropriate dtypes for memory efficiency
Use np.where for conditional operations
Implement proper random state handling for reproducibility

# Example numpy patterns
np.random.seed(42)  # For reproducibility
mask = np.where(arr > threshold, 1, 0)
normalized = (arr - arr.mean()) / arr.std()

Error Handling and Validation

Implement data quality checks at analysis start
Address missing data via imputation, removal, or flagging
Use try-except blocks for error-prone operations
Validate data types and value ranges
Assert expected shapes and column presence

# Example validation pattern
assert df.shape[0] > 0, "DataFrame is empty"
assert "required_column" in df.columns, "Missing required column"
df["date"] = pd.to_datetime(df["date"], errors="coerce")

Performance Optimization

Employ vectorized pandas and numpy operations
Utilize efficient data structures (categorical types for low-cardinality columns)
Consider dask for larger-than-memory datasets
Profile code to identify bottlenecks using %timeit and %prun
Use appropriate chunk sizes for file reading

# Example categorical optimization
df["category"] = df["category"].astype("category")

# Chunked reading for large files
chunks = pd.read_csv("large_file.csv", chunksize=10000)
result = pd.concat([process(chunk) for chunk in chunks])

Statistical Analysis

Use scipy.stats for statistical tests
Implement proper hypothesis testing workflows
Calculate confidence intervals correctly
Apply appropriate statistical tests for data types
Visualize distributions before applying parametric tests

Dependencies

pandas
numpy
matplotlib
seaborn
jupyter
scikit-learn
scipy

Key Conventions

Begin analysis with exploratory data analysis (EDA)
Document assumptions and data quality issues
Use consistent naming conventions throughout notebooks
Save intermediate results for long-running computations
Include data sources and timestamps in notebooks
Export clean data to appropriate formats (parquet, csv)

Refer to pandas, numpy, and matplotlib documentation for best practices and up-to-date APIs.

Data Analysis Jupyter

Skill Details

Repository Files

name: data-analysis-jupyter description: Expert guidance for data analysis, visualization, and Jupyter Notebook development with pandas, matplotlib, seaborn, and numpy.

Data Analysis and Jupyter Notebook Development

Key Principles

Data Analysis and Manipulation

Visualization Standards

Jupyter Notebook Practices

NumPy Best Practices

Error Handling and Validation

Performance Optimization

Statistical Analysis

Dependencies

Key Conventions

Related Skills

Xlsx

Clickhouse Io

Clickhouse Io

Analyzing Financial Statements

Data Storytelling

Kpi Dashboard Design

Dbt Transformation Patterns

Sql Optimization Patterns

Clinical Decision Support

Anndata

Skill Information