Combining Excel with Python for Powerful Data Science Workflow

Excel is a go-to tool for many data analysts because of its simplicity and flexibility. But for large, repetitive, or complex data tasks, Python provides speed, automation, and advanced analytics. By integrating Excel with Python, you can leverage the best of both worlds.

In this tutorial, we will show how to combine Excel with Python for a powerful data science workflow.

Required Tools & Setup

Before combining Excel with Python, set up the environment. This ensures your workflow is smooth and productive from the very first step.

Prerequisites:

Microsoft Excel: For initial data review and reporting.
Python 3.x: The engine for your data science workflow.
Python Libraries:
- pandas for data analysis.
- matplotlib for plotting.
- openpyxl (optional, for writing Excel files).
- numpy (numerics).
- matplotlib/seaborn for visualization.

Install Python Libraries:

pip install pandas matplotlib openpyxl

1. Read Data into Python

You can load your data into Python using pandas, which makes it easy to manipulate and analyze tabular data.

import pandas as pd

# Read data from Excel file
df = pd.read_excel('SalesData.xlsx')

# Preview data
print(df.head()) # Show the first 5 rows of the data
print(df.info()) # Show info about columns, datatypes, and missing values

pd.read_csv() reads the Excel file into a pandas DataFrame.
df.head() displays the first five rows, it is great for a quick check.
df.info() shows the number of rows, columns, and datatypes.

You will see the first few rows of your sales data, along with a summary like:

   TransactionID       Date CustomerID ProductID ProductName     Category  Quantity  UnitPrice Region Channel SalesRep
0         100001 2024-01-02      C-100     P-101      Laptop  Electronics       2.0      800.0   East  Online    Smith
1         100002 2024-01-02      C-101     P-102     Printer  Electronics       1.0      200.0   West  Retail  Johnson
2         100003 2024-01-03      C-102     P-103       Mouse  Electronics       5.0       25.0  North  Online      Lee
3         100004 2024-01-04      C-103     P-104        Desk    Furniture       1.0      150.0  South  Retail    Brown
4         100005 2024-01-05      C-104     P-105     Monitor  Electronics       3.0      175.0    NaN  Online    Davis
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63 entries, 0 to 62
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   TransactionID  63 non-null     int64
 1   Date           62 non-null     datetime64[ns]
 2   CustomerID     62 non-null     object        
 3   ProductID      61 non-null     object        
 4   ProductName    63 non-null     object        
 5   Category       61 non-null     object        
 6   Quantity       61 non-null     float64       
 7   UnitPrice      62 non-null     float64
 8   Region         62 non-null     object
 9   Channel        62 non-null     object
 10  SalesRep       62 non-null     object
dtypes: datetime64[ns](1), float64(2), int64(1), object(7)
memory usage: 5.5+ KB
None

2. Data Cleaning & Transformation

Raw data is rarely ready for analysis. In this step, you’ll fix missing values, convert columns to the right types, and add new calculated fields.

Remove Duplicates:

# Remove Duplicates
df = df.drop_duplicates()

Drops duplicate values.

Check for Missing Values:

# Print count of missing values per column
print(df.isnull().sum())

It shows how many missing (NaN) values are in each column. If any are found, you can decide to drop or fill them.

#Output:
TransactionID    0
Date             1
CustomerID       1
ProductID        2
ProductName      0
Category         2
Quantity         2
UnitPrice        1
Region           1
Channel          1
SalesRep         1
dtype: int64

Convert Data Types:

# Convert 'Date' column to pandas datetime type for easier filtering/grouping
df['Date'] = pd.to_datetime(df['Date'])

It converts the Date column from text to pandas datetime format for easier filtering and grouping.

Create a ‘TotalSales’ Column:

# Add a new column: total value for each transaction
df['TotalSales'] = df['Quantity'] * df['UnitPrice']

Adds a new column showing the total value for each transaction.

Extract Month for Time Series Analysis:

df['Month'] = df['Date'].dt.to_period('M')

This creates a Month column to group and analyze sales by month.
Now use print(df.head()) to preview the cleaned data.

#Ouput:
TransactionID       Date CustomerID ProductID ProductName     Category  ...  UnitPrice  Region Channel SalesRep TotalSales    Month
0         100001 2024-01-02      C-100     P-101      Laptop  Electronics  ...      800.0    East  Online    Smith     1600.0  2024-01
1         100002 2024-01-02      C-101     P-102     Printer  Electronics  ...      200.0    West  Retail  Johnson      200.0  2024-01
2         100003 2024-01-03      C-102     P-103       Mouse  Electronics  ...       25.0   North  Online      Lee      125.0  2024-01
3         100004 2024-01-04      C-103     P-104        Desk    Furniture  ...      150.0   South  Retail    Brown      150.0  2024-01
4         100005 2024-01-05      C-104     P-105     Monitor  Electronics  ...      175.0     NaN  Online    Davis      525.0  2024-01

3. Analyze Your Data

With a clean dataset, you can now generate insights that drive business value. This includes aggregating sales by month, product, and region.

Total Sales by Month:

# Group by month and sum the total sales for each month
monthly_sales = df.groupby('Month')['TotalSales'].sum()
print(monthly_sales)

Groups the data by Month and sum the TotalSales for each month.

#Output:
Month
2024-01    9075.0
2024-02    9800.0
2024-03    9075.0
Freq: M, Name: TotalSales, dtype: float64

Top-Selling Products:

# Group by product, sum total sales, and sort from highest to lowest
product_sales = df.groupby('ProductName')['TotalSales'].sum().sort_values(ascending=False)
print(product_sales)

Sums sales per product, then sorts them from most to least popular.

#Output:
ProductName
Laptop     15200.0
Monitor     3850.0
Printer     3200.0
Desk        2550.0
Chair       2325.0
Mouse       1125.0
Name: TotalSales, dtype: float64

Sales by Region:

# Group by region and sum total sales per region
region_sales = df.groupby('Region')['TotalSales'].sum()
print(region_sales)

Aggregates total sales by each region.

#Output:
Region
East     6075.0
North    5925.0
South    8225.0
West     7500.0

4. Visualize Key Insights

Data is more powerful when visualized. Let’s create quick charts to help you and stakeholders grasp key trends at a glance.

4.1. Monthly Sales Trend

import matplotlib.pyplot as plt  # Import for plotting

# Create a bar chart of sales by month
monthly_sales.plot(
    kind='bar', 
    title='Total Sales by Month', 
    ylabel='Sales ($)', 
    xlabel='Month'
)
plt.tight_layout()           # Avoid label overlap
plt.savefig('monthly_sales.png')  # Save the figure as a PNG file
plt.show()                  # Display the chart

Plots monthly sales as a bar chart.
plt.savefig saves the chart for reports
A bar chart will show the sales change each month.

4.2. Sales by Region

# Pie chart of sales by region
region_sales.plot(
    kind='pie', 
    autopct='%1.1f%%', 
    title='Sales Distribution by Region'
)
plt.ylabel('')              # Remove default y-label
plt.tight_layout()
plt.savefig('region_sales.png')
plt.show()

Pie chart of sales by region, great for management or marketing.

5. Advanced Analysis & Modeling

Beyond basic grouping and summaries, Python enables advanced statistical analysis, pivot tables, and even machine learning, all with just a few lines of code. Let’s dig deeper into the data to unlock even more insights.

5.1. Descriptive Statistics

Descriptive statistics give you a quick summary of your dataset, showing means, standard deviations, and quantiles for numeric columns.

# Show summary statistics for numeric columns (mean, std, min, max, quartiles, etc.)
print(df.describe())

df.describe() quickly summarizes all numeric columns (like Quantity, UnitPrice, TotalSales).

#Output:
       TransactionID   Quantity   UnitPrice   TotalSales
count      61.000000  59.000000   60.000000    59.000000
mean   100030.180328   2.542373  262.083333   478.813559
std        17.497150   1.534905  277.339497   527.085627
min    100001.000000   1.000000   25.000000    75.000000
25%    100015.000000   1.000000   75.000000   162.500000
50%    100030.000000   2.000000  175.000000   300.000000
75%    100045.000000   3.000000  200.000000   525.000000
max    100060.000000   7.000000  800.000000  2400.000000

5.2. Pivot Tables in pandas

Pivot tables are powerful for interactive reporting in Excel, and pandas can do them too.

# Create a pivot table: sum TotalSales for each Region
pivot = df.pivot_table(index='Region', values='TotalSales', aggfunc='sum')
print(pivot)

pivot_table() summarizes TotalSales for each region, similar to Excel’s Pivot Table.

#Output
        TotalSales
Region
East        6075.0
North       5925.0
South       8225.0
West        7500.0

5.3. Simple Machine Learning Example

Let’s see if we can predict total sales from just the quantity sold using a simple linear regression (machine learning) model.

from sklearn.linear_model import LinearRegression  # Import linear regression from scikit-learn

# Prepare features and target variable
X = df[['Quantity']]        # Feature: Quantity sold
y = df['TotalSales']        # Target: Total sales value

# Create and fit the regression model
model = LinearRegression()
model.fit(X, y)

# Print the regression coefficient (slope)
print('Coefficient:', model.coef_)
# Print the intercept (base value when Quantity=0)
print('Intercept:', model.intercept_)

Imports LinearRegression from scikit-learn.
Uses Quantity to predict TotalSales.
Fits the model and prints the coefficient (how much sales increase per extra unit sold).

#Output:
Coefficient: [-37.65294772]
Intercept: 596.8483500185391

6. Export Cleaned/Analyzed Data Back to Excel

After cleaning, analyzing, and modeling your data, you can export your summary tables and insights to a multi-sheet Excel file. This keeps all your key findings together and ready for review in Excel.

# Export summary and advanced analysis tables to a multi-sheet Excel file

with pd.ExcelWriter('sales_summary.xlsx') as writer:
    # Monthly summary
    monthly_sales.to_frame().to_excel(writer, sheet_name='Monthly Sales')
    # Product summary
    product_sales.to_frame().to_excel(writer, sheet_name='Product Sales')
    # Region summary
    region_sales.to_frame().to_excel(writer, sheet_name='Region Sales')
    # Pivot table (total sales by region)
    pivot.to_excel(writer, sheet_name='Pivot Table')
    # Optionally, you can export descriptive statistics
    df.describe().to_excel(writer, sheet_name='Descriptive Stats')

Context Manager (with … as writer): Ensures the Excel file is properly saved and closed after writing.
.to_excel() for Each Table: Saves each DataFrame or summary to its own sheet for easy access.
Custom Sheet Names: Each sheet is named for clarity, matching your analysis steps.

Open sales_summary.xlsx in Excel.
You’ll see separate sheets for Monthly Sales, Product Sales, Region Sales, your Pivot Table, and Descriptive Statistics.

7. Automate & Scale Your Workflow

By using Python, you can automate recurring reports or analyses. Next time you get a new Excel file, just replace the file and re-run your script; all analysis and reports are refreshed instantly.

Keep all analysis code in one Python file.
To update your reports, replace the CSV and run:

python Excel_to_Python.py

For even more power, you can schedule this as a weekly/monthly task.

Conclusion

By combining Excel’s intuitive data entry and reporting with Python’s data science power, you can process and analyze large, messy datasets efficiently. It automates repetitive reporting tasks. Unlocks machine learning and advanced visualization. You don’t need to become a Python expert overnight. Start with one simple task; once that works, add one more step. Before you know it, you’ll be automating complex reports.

Get FREE Advanced Excel Exercises with Solutions!