- Published on
Pandas for Beginners - Data Analysis with Python
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
Pandas is the most important Python library for data analysis. Whether you're cleaning messy CSVs, analyzing business data, or preparing data for machine learning — pandas is your go-to tool.
This guide takes you from zero to confident pandas user.
- Installation
- Creating DataFrames
- Basic Exploration
- Selecting Data
- Data Manipulation
- Groupby — Aggregation
- Handling Missing Data
- Merging and Joining
- Real-World Example: Sales Analysis
- Conclusion
Installation
pip install pandas numpy openpyxl
Creating DataFrames
main.py
import pandas as pd
# From a dictionary
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'age': [28, 34, 25, 31],
'salary': [75000, 90000, 65000, 85000],
'department': ['Engineering', 'Marketing', 'Engineering', 'HR'],
})
print(df)
# name age salary department
# 0 Alice 28 75000 Engineering
# 1 Bob 34 90000 Marketing
# 2 Charlie 25 65000 Engineering
# 3 Diana 31 85000 HR
# From a CSV file
df = pd.read_csv('employees.csv')
# From Excel
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
Basic Exploration
main.py
df.head(5) # First 5 rows
df.tail(5) # Last 5 rows
df.shape # (rows, columns) — (4, 4)
df.dtypes # Data types of each column
df.info() # Summary: dtypes, null counts, memory
df.describe() # Stats: count, mean, std, min, max
df.columns.tolist() # ['name', 'age', 'salary', 'department']
df.isnull().sum() # Count missing values per column
Selecting Data
main.py
# Select a column (returns Series)
df['name']
df.name
# Select multiple columns (returns DataFrame)
df[['name', 'salary']]
# Select rows by index
df.iloc[0] # First row
df.iloc[0:3] # First 3 rows
df.iloc[[0, 2]] # Rows 0 and 2
# Select rows by label
df.loc[0, 'name'] # Single value
df.loc[0:2, 'name':'salary'] # Slice
# Boolean filtering
df[df['age'] > 30] # Rows where age > 30
df[df['department'] == 'Engineering']
# Multiple conditions
df[(df['age'] > 25) & (df['salary'] > 70000)]
df[(df['department'] == 'Engineering') | (df['department'] == 'HR')]
Data Manipulation
main.py
# Add a new column
df['salary_k'] = df['salary'] / 1000
df['senior'] = df['age'] > 30
# Rename columns
df = df.rename(columns={'name': 'employee_name', 'salary': 'annual_salary'})
# Drop columns
df = df.drop(columns=['salary_k'])
# Sort values
df.sort_values('salary', ascending=False)
df.sort_values(['department', 'salary'])
# Apply a function to a column
df['name_upper'] = df['name'].str.upper()
df['salary_after_tax'] = df['salary'].apply(lambda x: x * 0.75)
# Map values
df['level'] = df['salary'].map(
lambda s: 'senior' if s > 80000 else 'mid' if s > 70000 else 'junior'
)
Groupby — Aggregation
main.py
# Average salary by department
df.groupby('department')['salary'].mean()
# department
# Engineering 70000.0
# HR 85000.0
# Marketing 90000.0
# Multiple aggregations
df.groupby('department').agg({
'salary': ['mean', 'min', 'max', 'count'],
'age': 'mean',
})
# Value counts
df['department'].value_counts()
Handling Missing Data
main.py
# Create data with NaN
import numpy as np
df.loc[1, 'salary'] = np.nan
# Check for missing values
df.isnull()
df.isnull().sum() # Per column
# Drop rows with any NaN
df.dropna()
# Drop rows where specific column is NaN
df.dropna(subset=['salary'])
# Fill NaN with a value
df['salary'].fillna(df['salary'].mean())
# Forward fill (use previous valid value)
df.fillna(method='ffill')
Merging and Joining
main.py
employees = pd.DataFrame({
'id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'dept_id': [1, 2, 1, 3],
})
departments = pd.DataFrame({
'id': [1, 2, 3],
'dept_name': ['Engineering', 'Marketing', 'HR'],
})
# Inner join (default)
merged = pd.merge(employees, departments, left_on='dept_id', right_on='id')
# Left join
merged = pd.merge(employees, departments, left_on='dept_id', right_on='id', how='left')
# Concatenate vertically
combined = pd.concat([df1, df2], ignore_index=True)
Real-World Example: Sales Analysis
analysis.py
import pandas as pd
df = pd.read_csv('sales.csv') # date, product, quantity, price
# Data cleaning
df['date'] = pd.to_datetime(df['date'])
df['revenue'] = df['quantity'] * df['price']
df = df.dropna()
# Monthly revenue
monthly = df.groupby(df['date'].dt.to_period('M'))['revenue'].sum()
# Top 5 products
top_products = (
df.groupby('product')['revenue']
.sum()
.sort_values(ascending=False)
.head(5)
)
# Export results
top_products.to_csv('top_products.csv')
Conclusion
Pandas is indispensable for anyone working with data in Python. Master DataFrames, selection, groupby, and merging — and you can analyze virtually any dataset. Whether you're doing business analytics, data science, or building data pipelines, pandas is always in the toolkit.