Implement code functionality

How to read an Excel file in Python

May 30, 2025

・ by

Claude and the Anthropic Team

Table of contents

Reading Excel files in Python enables you to efficiently process spreadsheet data for analysis and automation. Python's libraries provide powerful tools to extract, manipulate, and transform Excel data into formats your applications can use.

This guide covers essential techniques for Excel file handling in Python, with practical examples created using Claude, an AI assistant built by Anthropic. You'll learn implementation strategies and troubleshooting approaches.

Using `pandas` to read Excel files

import pandas as pd

df = pd.read_excel('data.xlsx')
print(df.head())

ID   Name  Age      City
0   1  Alice   25  New York
1   2    Bob   30    Boston
2   3  Carol   22   Chicago
3   4  David   35   Seattle
4   5    Eve   28    Denver

The pandas library simplifies Excel data handling through its powerful read_excel() function. This function automatically converts your spreadsheet into a DataFrame—a table-like structure that makes data manipulation more intuitive in Python.

The code demonstrates three key capabilities that make pandas particularly effective:

Automatic column detection and naming from the Excel headers
Native support for various Excel file formats (.xlsx, .xls, .xlsm)
Built-in data preview using head() to verify the import worked correctly

The resulting DataFrame preserves both the structure and data types from your Excel file. This means numbers stay as numbers and text remains as text, eliminating common data type conversion headaches.

Basic Excel reading techniques

Building on pandas fundamentals, Python offers granular control over Excel data through sheet selection, row filtering, and the powerful openpyxl library for advanced spreadsheet operations.

Reading specific sheets with `pd.read_excel()`

import pandas as pd

# Read a specific sheet by name
df1 = pd.read_excel('data.xlsx', sheet_name='Sheet2')
# Read a specific sheet by index (0-based)
df2 = pd.read_excel('data.xlsx', sheet_name=1)
print(f"Sheet names: {pd.ExcelFile('data.xlsx').sheet_names}")

Sheet names: ['Sheet1', 'Sheet2', 'Sheet3']

The sheet_name parameter in pd.read_excel() gives you precise control over which Excel sheets to import. You can reference sheets either by their name (like 'Sheet2') or by their zero-based index position.

Sheet names work reliably when you know the exact worksheet title
Index numbers (starting from 0) help when working with unknown sheet names
Use pd.ExcelFile().sheet_names to get a list of all available sheets in your workbook

This flexibility proves especially valuable when processing Excel files with multiple worksheets. You can selectively import only the data you need instead of loading the entire workbook.

Reading specific rows and columns

import pandas as pd

# Read specific rows
df = pd.read_excel('data.xlsx', skiprows=2, nrows=3)
# Read specific columns
df_cols = pd.read_excel('data.xlsx', usecols="A,C:E")
print(df_cols.head(2))

ID  Age      City
0   1   25  New York
1   2   30    Boston

The pd.read_excel() function gives you precise control over which rows and columns to import from your Excel file. The skiprows parameter skips the specified number of rows from the top, while nrows determines how many rows to read after that point.

Use usecols to select specific columns. The parameter accepts Excel-style references like "A,C:E" to pick individual columns (A) or ranges (C through E)
Combine these parameters to extract exactly the data subset you need without loading the entire spreadsheet
This targeted approach improves memory efficiency when working with large Excel files

These parameters work together seamlessly. For example, you could skip the header row, read the next 100 rows, and select only the numerical columns—all in a single function call.

Using `openpyxl` for Excel operations

from openpyxl import load_workbook

workbook = load_workbook(filename='data.xlsx')
sheet = workbook.active
value = sheet.cell(row=1, column=2).value
print(f"Cell B1 contains: {value}")

Cell B1 contains: Name

The openpyxl library provides direct, cell-level access to Excel files. Unlike pandas, which loads entire datasets, openpyxl lets you work with individual cells and worksheets for more granular control.

The load_workbook() function opens your Excel file and creates a workbook object that represents the entire file
Access the currently selected worksheet using workbook.active
Read specific cell values with sheet.cell(row, column).value using 1-based indexing. For example, cell B1 is row=1, column=2

This approach works particularly well when you need to read or modify specific cells without loading the entire spreadsheet into memory. It's especially useful for large Excel files or when you need to maintain precise formatting.

Advanced Excel reading methods

Building on the foundational Excel reading techniques, Python offers sophisticated data handling capabilities through advanced pandas configurations, formula processing, and multiple engine support for complex spreadsheet operations.

Using advanced `pandas` options

import pandas as pd

df = pd.read_excel(
    'data.xlsx',
    dtype={'ID': int, 'Age': int, 'Name': str},
    converters={'City': lambda x: x.strip().upper()}
)
print(df[['Name', 'City']].head(2))

Name      City
0  Alice  NEW YORK
1    Bob    BOSTON

The pd.read_excel() function offers powerful data type control through its dtype and converters parameters. These options ensure your data imports exactly as intended without manual cleanup later.

The dtype parameter enforces specific data types for columns. In this example, it ensures ID and Age remain integers while Name stays as text
Use converters to transform data during import. Here, the City values get stripped of extra spaces and converted to uppercase through a lambda function
The head(2) method displays the first two rows of selected columns to verify the transformations worked correctly

These configurations save time by handling data cleaning and formatting automatically during the import process instead of requiring separate steps afterward.

Working with Excel formulas

import pandas as pd
from openpyxl import load_workbook

wb = load_workbook('data.xlsx', data_only=False)
ws = wb.active
formula = ws['F1'].value
result = pd.read_excel('data.xlsx').iloc[0, 5]
print(f"Formula: {formula}, Result: {result}")

Formula: =SUM(C1:E1), Result: 55

Python offers two distinct approaches to handle Excel formulas. The openpyxl library with data_only=False reveals the actual formula text, while pandas shows the calculated results.

Setting data_only=False in load_workbook() preserves formula expressions instead of their computed values
The ws['F1'].value syntax directly accesses cell contents, including formula text
pd.read_excel() paired with iloc retrieves the formula's calculated result

This dual capability proves valuable when auditing spreadsheet logic or debugging calculation issues. You can verify both the formula structure and its output in a single Python script.

Reading Excel files with multiple engines

import pandas as pd

# Using xlrd engine (legacy Excel files .xls)
df_xls = pd.read_excel('legacy.xls', engine='xlrd')
# Using openpyxl engine (default for .xlsx)
df_xlsx = pd.read_excel('data.xlsx', engine='openpyxl')
print("Excel files loaded successfully with appropriate engines")

Excel files loaded successfully with appropriate engines

Python's pandas library adapts to different Excel file formats by using specialized engines. The engine parameter in read_excel() determines how pandas processes your Excel files.

The xlrd engine handles older .xls files that many organizations still use for legacy data
Modern Excel files (.xlsx) work best with the openpyxl engine which pandas uses by default
Specifying the correct engine prevents compatibility errors when working with different Excel versions

This flexibility means you can work with both legacy and modern Excel files in the same Python script. Pandas automatically manages the complexity of different Excel formats while maintaining consistent DataFrame output for your analysis.

Get unstuck faster with Claude

Claude is an AI assistant created by Anthropic that helps developers write, understand, and debug Python code. It combines deep technical knowledge with natural conversation to provide clear, actionable guidance for your programming challenges.

Working alongside you like an experienced mentor, Claude helps you navigate Excel data handling complexities in Python. It can explain concepts like optimizing read_excel() parameters, suggest the best approach for your specific needs, or help troubleshoot data import issues.

Start accelerating your Python development today. Sign up for free at Claude.ai to get personalized, expert-level assistance with your code.

Some real-world applications

Python's Excel reading capabilities shine in real business scenarios, from automating tedious report consolidation to extracting actionable insights from financial spreadsheets.

Consolidating monthly reports from multiple Excel files

Python's pandas library efficiently combines multiple Excel reports into a unified dataset, enabling quick analysis of data spread across separate monthly spreadsheets.

import pandas as pd
import glob

# Get all Excel files and combine them
excel_files = glob.glob('monthly_reports/*.xlsx')
all_data = [pd.read_excel(file).assign(Source=file.split('/')[-1]) for file in excel_files[:3]]
combined_df = pd.concat(all_data, ignore_index=True)

print(f"Combined data shape: {combined_df.shape}")
print(combined_df[['Source', 'Revenue']].groupby('Source').sum())

This code efficiently processes multiple Excel files in a directory. The glob module finds all Excel files matching the pattern monthly_reports/*.xlsx. A list comprehension then reads each file into a pandas DataFrame while adding a Source column containing the filename.

The assign() method adds the Source column during DataFrame creation
The split('/') and indexing [-1] extracts just the filename from the full path
pd.concat() merges all DataFrames into one, with ignore_index=True creating fresh row indices

The final lines display the combined DataFrame's dimensions and calculate revenue sums grouped by source file. The [:3] slice limits processing to the first three files for demonstration purposes.

Analyzing financial data with `groupby` and `apply`

Python's groupby and apply functions transform raw financial spreadsheet data into meaningful business insights through aggregation and custom calculations.

import pandas as pd

# Read financial data and calculate profit metrics by department
financial_data = pd.read_excel('financial_records.xlsx')
dept_summary = financial_data.groupby('Department').agg({
    'Revenue': 'sum', 'Expenses': 'sum'
})
dept_summary['Profit_Margin'] = (dept_summary['Revenue'] - dept_summary['Expenses']) / dept_summary['Revenue'] * 100

print(dept_summary.sort_values('Profit_Margin', ascending=False).head(3))

This code processes financial data from an Excel file to analyze departmental performance. The groupby('Department') function segments the data by department while agg() calculates the total revenue and expenses for each group.

The code creates a new column called Profit_Margin using a standard financial formula: (Revenue - Expenses) / Revenue * 100
Finally, sort_values() orders departments by their profit margins in descending order
The head(3) function displays only the top 3 most profitable departments

This analysis helps identify which departments generate the highest profit margins relative to their revenue. The approach efficiently transforms raw financial data into actionable business metrics.

Common errors and challenges

Python's Excel reading capabilities can trigger unexpected data type issues, missing value errors, and date parsing challenges that require specific configuration to resolve.

Fixing numeric data read as strings with `dtype`

Excel files can store numbers in formats that pandas misinterprets as text strings. When this happens, numerical operations like sum() fail because Python can't perform math on text values. The code below demonstrates this common issue.

import pandas as pd

# Numbers might be read as strings
df = pd.read_excel('financial_data.xlsx')
print(df['Revenue'].dtype)
# This fails because Revenue is a string
result = df['Revenue'].sum()
print(f"Total revenue: {result}")

When Excel stores numbers with special formatting or currency symbols, pandas defaults to importing them as text strings. This prevents mathematical operations from working correctly. The following code demonstrates the proper solution.

import pandas as pd

# Explicitly set data types
df = pd.read_excel('financial_data.xlsx', dtype={'Revenue': float})
print(df['Revenue'].dtype)
result = df['Revenue'].sum()
print(f"Total revenue: {result}")

The dtype parameter in pd.read_excel() forces numeric columns to import as their correct data type. By explicitly setting Revenue as float, you ensure pandas treats the values as numbers instead of strings.

Watch for this issue when your Excel files contain currency symbols, percentage signs, or special number formatting
The error often surfaces when mathematical operations unexpectedly return incorrect results or raise type errors
Setting the correct data type during import is more efficient than converting types after loading the data

Dealing with `NaN` values in Excel imports

Empty cells in Excel files create NaN (Not a Number) values when imported into pandas DataFrames. These missing values can break numerical calculations and statistical operations. The code below demonstrates how NaN values affect basic operations like calculating averages.

import pandas as pd

# Empty cells become NaN by default
df = pd.read_excel('customer_data.xlsx')
# This raises error if Age column contains non-numeric values
average_age = df['Age'].mean()
print(f"Average age: {average_age}")

When NaN values mix with valid numbers in the Age column, the mean() calculation fails. The DataFrame needs proper handling of these missing values to calculate accurate statistics. The following code demonstrates an effective solution.

import pandas as pd

# Handle missing values during import
df = pd.read_excel('customer_data.xlsx', na_values=['N/A', ''], keep_default_na=True)
# Use dropna or fillna to handle NaN values
average_age = df['Age'].fillna(0).mean()
print(f"Average age: {average_age}")

The na_values parameter tells pandas which values to treat as missing data, while keep_default_na=True preserves pandas' built-in list of NA indicators. This combination catches both explicit markers like 'N/A' and implicit ones like empty cells.

Use fillna(0) to replace missing values with zeros before calculating statistics
Watch for this issue when your Excel files contain blank cells, 'N/A' entries, or other non-standard missing value indicators
Missing values often appear in real-world datasets from data entry errors or incomplete records

The solution ensures your calculations work correctly by explicitly handling these gaps in your data instead of letting them cause errors.

Resolving date parsing issues with `parse_dates`

Excel date formats often confuse pandas during import, causing the read_excel() function to misinterpret date columns as integers or strings. This prevents standard date operations and time-based analysis from working correctly. The code below demonstrates this common challenge.

import pandas as pd

# Excel dates might be read incorrectly
df = pd.read_excel('dates.xlsx')
print(df['Date'].dtype)
# Attempting date operations can fail
next_day = df.loc[0, 'Date'] + pd.Timedelta(days=1)

Without specifying the parse_dates parameter, pandas reads dates as generic objects that lack datetime functionality. This prevents date arithmetic and time-based analysis from working properly. Let's examine the corrected implementation below.

import pandas as pd

# Explicitly parse date columns
df = pd.read_excel('dates.xlsx', parse_dates=['Date'])
print(df['Date'].dtype)
# Now date operations work correctly
next_day = df.loc[0, 'Date'] + pd.Timedelta(days=1)

The parse_dates parameter in pd.read_excel() ensures Excel dates import as proper Python datetime objects instead of integers or strings. This enables date arithmetic and time-based analysis in your code.

Watch for this issue when your Excel files contain date columns that need mathematical operations or filtering
The error typically surfaces when date calculations fail or return unexpected results
Simply include column names in the parse_dates list to automatically convert them during import

This approach proves more efficient than manually converting dates after import. It also handles various Excel date formats consistently across different spreadsheets.

Learning or leveling up? Use Claude

Claude combines advanced language understanding with deep technical expertise to guide you through Python development challenges. As your AI programming companion, it breaks down complex coding concepts into clear, actionable steps while adapting explanations to your skill level.

Excel file troubleshooting: Ask "Why isn't my Excel file importing correctly?" and Claude will help diagnose common issues like data type mismatches or encoding problems
Code optimization: Ask "How can I make this Excel processing code more efficient?" and Claude will suggest improvements like using appropriate engines and parameters
Feature explanation: Ask "What's the difference between read_excel() and load_workbook()?" and Claude will compare their capabilities and use cases
Error resolution: Ask "Why am I getting NaN values in my DataFrame?" and Claude will explain how to handle missing data during Excel imports
Best practices: Ask "What's the best way to read multiple Excel sheets?" and Claude will outline efficient approaches using pandas

Experience personalized coding assistance today by signing up for free at Claude.ai.

For a more integrated development experience, Claude Code brings AI assistance directly to your terminal. Access Claude's capabilities while staying in your preferred development environment.

FAQs

What is the most commonly used library for reading Excel files in Python?

The pandas library dominates Excel file handling in Python through its read_excel() function. This widespread adoption stems from pandas' seamless integration with data analysis workflows and its ability to convert Excel sheets directly into DataFrame objects.

While alternatives like openpyxl and xlrd exist, pandas leverages these libraries internally while providing a more intuitive interface. It handles complex Excel features including multiple worksheets, merged cells, and various data types—making it the clear frontrunner for Excel operations in Python.

How do you install the required dependencies to read Excel files?

To read Excel files in Python, install the pandas library using pip install pandas. This automatically includes openpyxl for newer Excel formats and xlrd for legacy files. The pandas library serves as a high-level interface, translating Excel's complex binary format into Python-friendly data structures.

For enhanced performance with large spreadsheets, consider installing pyarrow. This optional dependency optimizes memory usage and processing speed when handling extensive datasets.

What file formats can pandas read_excel() function handle?

The read_excel() function in pandas handles Excel workbooks in both modern and legacy formats. It supports .xlsx files created by Excel 2007 and later versions, as well as older .xls files. The function also works with .xlsm files containing macros and .xlsb binary workbooks.

This broad format support stems from pandas using the openpyxl and xlrd engines under the hood. These engines parse Excel's XML-based and binary file structures to extract data into DataFrame objects efficiently.

How do you specify which worksheet to read from an Excel file with multiple sheets?

You can specify a worksheet in Excel files using either the sheet name or index. The sheet_name parameter accepts a string with the tab name or an integer (0-based) representing the sheet's position. When working with multiple sheets, pass a list of names or indices to read several worksheets at once.

This flexibility helps when sheet names change frequently but their order remains constant. Excel files often contain both raw data and summary sheets, so targeting the right worksheet ensures you process the intended data.

What happens if you don't specify the 'sheet_name' parameter when reading an Excel file?

When you don't specify sheet_name, pandas reads the first sheet of your Excel workbook by default. This behavior stems from Excel's standard structure, where the leftmost tab typically contains the primary dataset.

This automatic selection works well for simple workbooks with a single sheet. However, for workbooks containing multiple sheets with different datasets, you'll want to explicitly specify the sheet to avoid potential data mix-ups. The default behavior essentially treats sheet_name=0 as the implicit parameter.

Additional Resources

How to use pi in Python

2025-05-30

・

14 min

read

How to append to a dictionary in Python

2025-05-30

・

14 min

read

How to find the average in Python

2025-05-30

・

14 min

read

Leading companies build with Claude

Copy

Expand

Using pandas to read Excel files

Basic Excel reading techniques

Reading specific sheets with pd.read_excel()

Reading specific rows and columns

Using openpyxl for Excel operations

Advanced Excel reading methods

Using advanced pandas options

Working with Excel formulas

Reading Excel files with multiple engines

Get unstuck faster with Claude

Some real-world applications

Consolidating monthly reports from multiple Excel files

Analyzing financial data with groupby and apply

Common errors and challenges

Fixing numeric data read as strings with dtype

Dealing with NaN values in Excel imports

Resolving date parsing issues with parse_dates

Learning or leveling up? Use Claude

FAQs

What is the most commonly used library for reading Excel files in Python?

How do you install the required dependencies to read Excel files?

What file formats can pandas read_excel() function handle?

How do you specify which worksheet to read from an Excel file with multiple sheets?

What happens if you don't specify the 'sheet_name' parameter when reading an Excel file?

Additional Resources

How to use pi in Python

How to append to a dictionary in Python

How to find the average in Python

Leading companies build with Claude

Using `pandas` to read Excel files

Reading specific sheets with `pd.read_excel()`

Using `openpyxl` for Excel operations

Using advanced `pandas` options

Analyzing financial data with `groupby` and `apply`

Fixing numeric data read as strings with `dtype`

Dealing with `NaN` values in Excel imports

Resolving date parsing issues with `parse_dates`