ChatGPT for Data Analysis — Step-by-Step

Sanjeev SharmaSanjeev Sharma
8 min read

Advertisement

Introduction

ChatGPT has become an invaluable tool for data analysts, scientists, and business intelligence professionals. It can help clean data, write analysis code, generate visualizations, and draw insights from datasets. This guide walks through practical data analysis workflows using ChatGPT, from exploration to presentation.

Setting Up for Data Analysis

ChatGPT excels at helping with Python data analysis libraries (pandas, numpy, matplotlib, seaborn). Provide clear context about your data:

Data Analysis Setup Prompt:

Dataset: Customer purchase history
Format: CSV with 50K rows
Columns: customer_id, purchase_date, amount, category, region

Goal: Identify high-value customer segments

Tools: Python, pandas, matplotlib, seaborn

I'll share the data structure and ask for analysis code.

Data Exploration with ChatGPT

Start analysis by asking ChatGPT to explore your data:

Please write Python code that:
1. Loads the CSV file
2. Displays basic statistics (shape, dtypes, missing values)
3. Shows sample rows
4. Identifies outliers in the amount column
5. Checks date range

Dataset: customer_purchases.csv
Assume it's in the current directory.

ChatGPT generates:

import pandas as pd
import numpy as np

# Load data
df = pd.read_csv('customer_purchases.csv')

# Basic exploration
print(f"Shape: {df.shape}")
print(f"\nColumn types:\n{df.dtypes}")
print(f"\nMissing values:\n{df.isnull().sum()}")
print(f"\nFirst rows:\n{df.head()}")

# Statistics
print(f"\nBasic statistics:\n{df.describe()}")

# Outlier detection (using IQR)
Q1 = df['amount'].quantile(0.25)
Q3 = df['amount'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['amount'] < Q1 - 1.5*IQR) | (df['amount'] > Q3 + 1.5*IQR)]
print(f"\nOutliers: {len(outliers)} ({len(outliers)/len(df)*100:.2f}%)")

# Date range
print(f"\nDate range: {df['purchase_date'].min()} to {df['purchase_date'].max()}")

Data Cleaning Prompts

For specific cleaning tasks, ask ChatGPT directly:

Data Cleaning Request:

Data issues:
- Some dates are in MM/DD/YYYY, others in YYYY-MM-DD
- Amount column has '$' and commas
- customer_id has leading zeros sometimes
- Some regions have extra spaces/capitalization
- Missing values in category column (about 2%)

Requested: Write code to clean all these issues

ChatGPT provides:

# Convert dates to consistent format
df['purchase_date'] = pd.to_datetime(df['purchase_date'], infer_datetime_format=True)

# Clean amount (remove $ and commas, convert to float)
df['amount'] = df['amount'].str.replace('$', '').str.replace(',', '')
df['amount'] = pd.to_numeric(df['amount'])

# Standardize customer_id (remove leading zeros)
df['customer_id'] = df['customer_id'].astype(int)

# Clean region (trim spaces, standardize case)
df['region'] = df['region'].str.strip().str.title()

# Handle missing categories (forward fill or default)
df['category'] = df['category'].fillna(method='ffill')
df['category'] = df['category'].fillna('Unknown')

# Remove duplicates if any
df = df.drop_duplicates()

Exploratory Data Analysis (EDA)

Ask ChatGPT for comprehensive EDA:

Exploratory Data Analysis:

Dataset: Customer purchases (cleaned)
Columns: customer_id, purchase_date, amount, category, region

Create a comprehensive EDA that includes:
1. Distribution of purchases by category
2. Total revenue by region
3. Customer purchase frequency
4. Average order value by region
5. Trends over time
6. Correlation analysis if numeric columns exist

Provide code that generates summary statistics and visualizations.

Generating Visualizations

For specific insights, ask ChatGPT to create visualizations:

Data Visualization Request:

Create 4 visualizations for customer purchase analysis:
1. Bar chart: Revenue by region
2. Line chart: Monthly revenue trend
3. Histogram: Distribution of purchase amounts
4. Scatter plot: Purchase frequency vs. average order value by customer

Use matplotlib and seaborn. Make them publication-ready
(good labels, colors, titles, font sizes).

ChatGPT generates:

import matplotlib.pyplot as plt
import seaborn as sns

# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (15, 10)

# 1. Revenue by region
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2)

revenue_by_region = df.groupby('region')['amount'].sum().sort_values()
revenue_by_region.plot(kind='barh', ax=ax1, color='steelblue')
ax1.set_title('Revenue by Region', fontsize=12, fontweight='bold')
ax1.set_xlabel('Total Revenue ($)')

# 2. Monthly revenue trend
df['month'] = df['purchase_date'].dt.to_period('M')
monthly_revenue = df.groupby('month')['amount'].sum()
ax2.plot(monthly_revenue.index.astype(str), monthly_revenue.values, marker='o')
ax2.set_title('Monthly Revenue Trend', fontsize=12, fontweight='bold')
ax2.tick_params(axis='x', rotation=45)

# 3. Distribution of purchase amounts
ax3.hist(df['amount'], bins=50, color='coral', edgecolor='black')
ax3.set_title('Distribution of Purchase Amounts', fontsize=12, fontweight='bold')
ax3.set_xlabel('Amount ($)')
ax3.set_ylabel('Frequency')

# 4. Scatter: Frequency vs Average Order Value
customer_stats = df.groupby('customer_id').agg({
    'customer_id': 'count',  # purchase count
    'amount': 'mean'  # average order value
}).rename(columns={'customer_id': 'purchase_count'})

ax4.scatter(customer_stats['purchase_count'], customer_stats['amount'], alpha=0.5)
ax4.set_title('Purchase Frequency vs Average Order Value', fontsize=12, fontweight='bold')
ax4.set_xlabel('Number of Purchases')
ax4.set_ylabel('Average Order Value ($)')

plt.tight_layout()
plt.savefig('analysis_dashboard.png', dpi=300, bbox_inches='tight')
plt.show()

Statistical Analysis

Ask ChatGPT for statistical tests:

Statistical Analysis:

Questions:
1. Is there a significant difference in average purchase amount between regions?
2. What's the correlation between purchase frequency and average order value?
3. Are certain categories more popular in specific regions?

Dataset: Customer purchases

Provide code to:
- Perform appropriate statistical tests (ANOVA, correlation, chi-square)
- Show p-values and effect sizes
- Interpret results

Segmentation and Clustering

For customer segmentation:

Customer Segmentation:

Using RFM analysis (Recency, Frequency, Monetary):
1. Calculate R: Days since last purchase
2. Calculate F: Number of purchases
3. Calculate M: Total amount spent

Create customer segments:
- High value: High R, F, M
- At-risk: High R, Low F, M
- Developing: Low R, Low F, Medium M
- VIP: Any R, High F and M

Provide code to segment customers and show segment size/statistics.

ChatGPT generates RFM analysis:

from datetime import datetime

# Reference date (today)
ref_date = pd.to_datetime('2024-12-31')

# Calculate RFM
rfm = df.groupby('customer_id').agg({
    'purchase_date': lambda x: (ref_date - x.max()).days,  # Recency
    'customer_id': 'count',  # Frequency
    'amount': 'sum'  # Monetary
})

rfm.columns = ['Recency', 'Frequency', 'Monetary']

# Quartile scores (1-4, 4 is best)
rfm['R_Score'] = pd.qcut(rfm['Recency'], 4, labels=[4, 3, 2, 1], duplicates='drop')
rfm['F_Score'] = pd.qcut(rfm['Frequency'].rank(method='first'), 4, labels=[1, 2, 3, 4])
rfm['M_Score'] = pd.qcut(rfm['Monetary'], 4, labels=[1, 2, 3, 4], duplicates='drop')

# Calculate RFM score
rfm['RFM_Score'] = rfm['R_Score'].astype(int) + rfm['F_Score'].astype(int) + rfm['M_Score'].astype(int)

# Segment
def segment(score):
    if score >= 10:
        return 'VIP'
    elif score >= 8:
        return 'High Value'
    elif score >= 5:
        return 'Medium Value'
    else:
        return 'Low Value'

rfm['Segment'] = rfm['RFM_Score'].apply(segment)

print(rfm['Segment'].value_counts())
print(f"\nSegment statistics:\n{rfm.groupby('Segment')[['Recency', 'Frequency', 'Monetary']].mean()}")

For simple trend analysis:

Sales Forecasting:

Using the monthly revenue data:
1. Plot monthly sales trend
2. Fit a linear regression to identify trend
3. Calculate seasonal patterns (monthly, quarterly)
4. Make a simple forecast for next 6 months

Provide code and interpretation.

Advanced Analysis: Prompt Template

For complex analysis, provide a structured prompt:

Advanced Analysis Request:

Dataset: [describe data]
Business Question: [what you want to know]
Constraints: [any data limits]
Deliverables:
- Code for analysis
- Visualizations (specific types)
- Summary statistics
- Insights and recommendations

Assume I have pandas, numpy, matplotlib, seaborn, scipy installed.

Collaborating with ChatGPT on Analysis

The iterative process:

Round 1: Initial exploration
"Analyze this customer dataset"
ChatGPT provides: Basic exploration code and findings

Round 2: Deep dive
"The category X is growing fast. Analyze that trend deeper"
ChatGPT provides: Focused analysis on category X

Round 3: Visualization
"Show category X trends by region with seasonal patterns"
ChatGPT provides: Enhanced visualization code

Round 4: Business insights
"What actions should we take based on this analysis?"
ChatGPT provides: Recommendations with supporting data

Data Analysis Best Practices with ChatGPT

  1. Clarify your goal first: "I want to increase customer retention" is better than "analyze the data"

  2. Share data structure: Column names, types, and sample values help ChatGPT understand context

  3. Specify tools: Tell ChatGPT which libraries and Python version you're using

  4. Ask for reproducibility: Request code that reads from files and doesn't hardcode values

  5. Verify assumptions: ChatGPT sometimes makes incorrect assumptions about data

  6. Iterate: Complex analysis requires multiple rounds of refinement

Common Data Analysis Tasks

Cohort Analysis: How do user cohorts (by sign-up date) behave over time?

Time Series Analysis: What are the trends and seasonal patterns?

A/B Testing: How do test and control groups differ statistically?

Correlation Analysis: Which variables predict outcomes?

Anomaly Detection: What unusual patterns exist in the data?

Limitations

ChatGPT can't:

  • Access your data directly (you must provide it)
  • Execute code to verify it works (you must test it)
  • Understand your business context without explanation
  • Know about recent data updates (it works with code, not live data)

Conclusion

ChatGPT accelerates every stage of data analysis—exploration, cleaning, analysis, and visualization. It's most effective when you know what question you're trying to answer and can provide clear context about your data. Use it to prototype analysis quickly, then refine and verify results.

FAQ

Q: Is ChatGPT code always correct? A: No. Always test code before using results. ChatGPT sometimes has syntax errors or logical mistakes, especially with complex operations.

Q: Can ChatGPT handle large datasets? A: Your code can, but ChatGPT can't process the data directly. It helps you write efficient code to handle large datasets.

Q: Should I use ChatGPT for statistical analysis? A: Yes, but verify it's using the right statistical test for your data. ChatGPT sometimes suggests tests without checking assumptions (normality, equal variance, etc.).

Advertisement

Sanjeev Sharma

Written by

Sanjeev Sharma

Full Stack Engineer · E-mopro