Pandas for Beginners - Data Analysis with Python

Sanjeev SharmaSanjeev Sharma
4 min read

Advertisement

Introduction

Pandas is the most important Python library for data analysis. Whether you're cleaning messy CSVs, analyzing business data, or preparing data for machine learning — pandas is your go-to tool.

This guide takes you from zero to confident pandas user.

Installation

pip install pandas numpy openpyxl

Creating DataFrames

main.py
import pandas as pd

# From a dictionary
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [28, 34, 25, 31],
    'salary': [75000, 90000, 65000, 85000],
    'department': ['Engineering', 'Marketing', 'Engineering', 'HR'],
})

print(df)
#       name  age  salary   department
# 0    Alice   28   75000  Engineering
# 1      Bob   34   90000    Marketing
# 2  Charlie   25   65000  Engineering
# 3    Diana   31   85000           HR

# From a CSV file
df = pd.read_csv('employees.csv')

# From Excel
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

Basic Exploration

main.py
df.head(5)          # First 5 rows
df.tail(5)          # Last 5 rows
df.shape            # (rows, columns) — (4, 4)
df.dtypes           # Data types of each column
df.info()           # Summary: dtypes, null counts, memory
df.describe()       # Stats: count, mean, std, min, max
df.columns.tolist() # ['name', 'age', 'salary', 'department']
df.isnull().sum()   # Count missing values per column

Selecting Data

main.py
# Select a column (returns Series)
df['name']
df.name

# Select multiple columns (returns DataFrame)
df[['name', 'salary']]

# Select rows by index
df.iloc[0]      # First row
df.iloc[0:3]    # First 3 rows
df.iloc[[0, 2]] # Rows 0 and 2

# Select rows by label
df.loc[0, 'name']           # Single value
df.loc[0:2, 'name':'salary'] # Slice

# Boolean filtering
df[df['age'] > 30]  # Rows where age > 30
df[df['department'] == 'Engineering']

# Multiple conditions
df[(df['age'] > 25) & (df['salary'] > 70000)]
df[(df['department'] == 'Engineering') | (df['department'] == 'HR')]

Data Manipulation

main.py
# Add a new column
df['salary_k'] = df['salary'] / 1000
df['senior'] = df['age'] > 30

# Rename columns
df = df.rename(columns={'name': 'employee_name', 'salary': 'annual_salary'})

# Drop columns
df = df.drop(columns=['salary_k'])

# Sort values
df.sort_values('salary', ascending=False)
df.sort_values(['department', 'salary'])

# Apply a function to a column
df['name_upper'] = df['name'].str.upper()
df['salary_after_tax'] = df['salary'].apply(lambda x: x * 0.75)

# Map values
df['level'] = df['salary'].map(
    lambda s: 'senior' if s > 80000 else 'mid' if s > 70000 else 'junior'
)

Groupby — Aggregation

main.py
# Average salary by department
df.groupby('department')['salary'].mean()
# department
# Engineering    70000.0
# HR             85000.0
# Marketing      90000.0

# Multiple aggregations
df.groupby('department').agg({
    'salary': ['mean', 'min', 'max', 'count'],
    'age': 'mean',
})

# Value counts
df['department'].value_counts()

Handling Missing Data

main.py
# Create data with NaN
import numpy as np
df.loc[1, 'salary'] = np.nan

# Check for missing values
df.isnull()
df.isnull().sum()  # Per column

# Drop rows with any NaN
df.dropna()

# Drop rows where specific column is NaN
df.dropna(subset=['salary'])

# Fill NaN with a value
df['salary'].fillna(df['salary'].mean())

# Forward fill (use previous valid value)
df.fillna(method='ffill')

Merging and Joining

main.py
employees = pd.DataFrame({
    'id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'dept_id': [1, 2, 1, 3],
})

departments = pd.DataFrame({
    'id': [1, 2, 3],
    'dept_name': ['Engineering', 'Marketing', 'HR'],
})

# Inner join (default)
merged = pd.merge(employees, departments, left_on='dept_id', right_on='id')

# Left join
merged = pd.merge(employees, departments, left_on='dept_id', right_on='id', how='left')

# Concatenate vertically
combined = pd.concat([df1, df2], ignore_index=True)

Real-World Example: Sales Analysis

analysis.py
import pandas as pd

df = pd.read_csv('sales.csv')  # date, product, quantity, price

# Data cleaning
df['date'] = pd.to_datetime(df['date'])
df['revenue'] = df['quantity'] * df['price']
df = df.dropna()

# Monthly revenue
monthly = df.groupby(df['date'].dt.to_period('M'))['revenue'].sum()

# Top 5 products
top_products = (
    df.groupby('product')['revenue']
    .sum()
    .sort_values(ascending=False)
    .head(5)
)

# Export results
top_products.to_csv('top_products.csv')

Conclusion

Pandas is indispensable for anyone working with data in Python. Master DataFrames, selection, groupby, and merging — and you can analyze virtually any dataset. Whether you're doing business analytics, data science, or building data pipelines, pandas is always in the toolkit.

Advertisement

Sanjeev Sharma

Written by

Sanjeev Sharma

Full Stack Engineer · E-mopro