Published on

Pandas for Beginners - Data Analysis with Python

Authors

Introduction

Pandas is the most important Python library for data analysis. Whether you're cleaning messy CSVs, analyzing business data, or preparing data for machine learning — pandas is your go-to tool.

This guide takes you from zero to confident pandas user.

Installation

pip install pandas numpy openpyxl

Creating DataFrames

main.py
import pandas as pd

# From a dictionary
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [28, 34, 25, 31],
    'salary': [75000, 90000, 65000, 85000],
    'department': ['Engineering', 'Marketing', 'Engineering', 'HR'],
})

print(df)
#       name  age  salary   department
# 0    Alice   28   75000  Engineering
# 1      Bob   34   90000    Marketing
# 2  Charlie   25   65000  Engineering
# 3    Diana   31   85000           HR

# From a CSV file
df = pd.read_csv('employees.csv')

# From Excel
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

Basic Exploration

main.py
df.head(5)          # First 5 rows
df.tail(5)          # Last 5 rows
df.shape            # (rows, columns) — (4, 4)
df.dtypes           # Data types of each column
df.info()           # Summary: dtypes, null counts, memory
df.describe()       # Stats: count, mean, std, min, max
df.columns.tolist() # ['name', 'age', 'salary', 'department']
df.isnull().sum()   # Count missing values per column

Selecting Data

main.py
# Select a column (returns Series)
df['name']
df.name

# Select multiple columns (returns DataFrame)
df[['name', 'salary']]

# Select rows by index
df.iloc[0]      # First row
df.iloc[0:3]    # First 3 rows
df.iloc[[0, 2]] # Rows 0 and 2

# Select rows by label
df.loc[0, 'name']           # Single value
df.loc[0:2, 'name':'salary'] # Slice

# Boolean filtering
df[df['age'] > 30]  # Rows where age > 30
df[df['department'] == 'Engineering']

# Multiple conditions
df[(df['age'] > 25) & (df['salary'] > 70000)]
df[(df['department'] == 'Engineering') | (df['department'] == 'HR')]

Data Manipulation

main.py
# Add a new column
df['salary_k'] = df['salary'] / 1000
df['senior'] = df['age'] > 30

# Rename columns
df = df.rename(columns={'name': 'employee_name', 'salary': 'annual_salary'})

# Drop columns
df = df.drop(columns=['salary_k'])

# Sort values
df.sort_values('salary', ascending=False)
df.sort_values(['department', 'salary'])

# Apply a function to a column
df['name_upper'] = df['name'].str.upper()
df['salary_after_tax'] = df['salary'].apply(lambda x: x * 0.75)

# Map values
df['level'] = df['salary'].map(
    lambda s: 'senior' if s > 80000 else 'mid' if s > 70000 else 'junior'
)

Groupby — Aggregation

main.py
# Average salary by department
df.groupby('department')['salary'].mean()
# department
# Engineering    70000.0
# HR             85000.0
# Marketing      90000.0

# Multiple aggregations
df.groupby('department').agg({
    'salary': ['mean', 'min', 'max', 'count'],
    'age': 'mean',
})

# Value counts
df['department'].value_counts()

Handling Missing Data

main.py
# Create data with NaN
import numpy as np
df.loc[1, 'salary'] = np.nan

# Check for missing values
df.isnull()
df.isnull().sum()  # Per column

# Drop rows with any NaN
df.dropna()

# Drop rows where specific column is NaN
df.dropna(subset=['salary'])

# Fill NaN with a value
df['salary'].fillna(df['salary'].mean())

# Forward fill (use previous valid value)
df.fillna(method='ffill')

Merging and Joining

main.py
employees = pd.DataFrame({
    'id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'dept_id': [1, 2, 1, 3],
})

departments = pd.DataFrame({
    'id': [1, 2, 3],
    'dept_name': ['Engineering', 'Marketing', 'HR'],
})

# Inner join (default)
merged = pd.merge(employees, departments, left_on='dept_id', right_on='id')

# Left join
merged = pd.merge(employees, departments, left_on='dept_id', right_on='id', how='left')

# Concatenate vertically
combined = pd.concat([df1, df2], ignore_index=True)

Real-World Example: Sales Analysis

analysis.py
import pandas as pd

df = pd.read_csv('sales.csv')  # date, product, quantity, price

# Data cleaning
df['date'] = pd.to_datetime(df['date'])
df['revenue'] = df['quantity'] * df['price']
df = df.dropna()

# Monthly revenue
monthly = df.groupby(df['date'].dt.to_period('M'))['revenue'].sum()

# Top 5 products
top_products = (
    df.groupby('product')['revenue']
    .sum()
    .sort_values(ascending=False)
    .head(5)
)

# Export results
top_products.to_csv('top_products.csv')

Conclusion

Pandas is indispensable for anyone working with data in Python. Master DataFrames, selection, groupby, and merging — and you can analyze virtually any dataset. Whether you're doing business analytics, data science, or building data pipelines, pandas is always in the toolkit.