"Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Qwen-Agent needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas"
Resources
3Install
npx skillscat add qwen-applications/trace2skill/xlsx Install via the SkillsCat registry.
Important Automation Guidelines
Prefer Python over VBA for Automation: When tasks request VBA macros or spreadsheet automation, implement the logic in Python using openpyxl/pandas instead. This provides better error handling, easier debugging, cross-platform compatibility, and avoids macro security issues.
When users request VBA/macros: Implement the logic directly in Python using openpyxl or pandas instead of generating actual VBA code.# Requirements for Outputs
All Excel files
Zero Formula Errors
- Every Excel model MUST be delivered with ZERO formula errors (#REF!, #DIV/0!, #VALUE!, #N/A, #NAME?)
Preserve Existing Templates (when updating templates)
- Study and EXACTLY match existing format, style, and conventions when modifying files
- Never impose standardized formatting on files with established patterns
- Existing template conventions ALWAYS override these guidelines
Financial models
Color Coding Standards
Unless otherwise stated by the user or existing template
Industry-Standard Color Conventions
- Blue text (RGB: 0,0,255): Hardcoded inputs, and numbers users will change for scenarios
- Black text (RGB: 0,0,0): ALL formulas and calculations
- Green text (RGB: 0,128,0): Links pulling from other worksheets within same workbook
- Red text (RGB: 255,0,0): External links to other files
- Yellow background (RGB: 255,255,0): Key assumptions needing attention or cells that need to be updated
Number Formatting Standards
Required Format Rules
- Years: Format as text strings (e.g., "2024" not "2,024")
- Currency: Use $#,##0 format; ALWAYS specify units in headers ("Revenue ($mm)")
- Zeros: Use number formatting to make all zeros "-", including percentages (e.g., "$#,##0;($#,##0);-")
- Percentages: Default to 0.0% format (one decimal)
- Multiples: Format as 0.0x for valuation multiples (EV/EBITDA, P/E)
- Negative numbers: Use parentheses (123) not minus -123
Formula Construction Rules
Assumptions Placement
- Place ALL assumptions (growth rates, margins, multiples, etc.) in separate assumption cells
- Use cell references instead of hardcoded values in formulas
- Example: Use =B5*(1+$B$6) instead of =B5*1.05
Formula Error Prevention
- Verify all cell references are correct
- Check for off-by-one errors in ranges
- Ensure consistent formulas across all projection periods
- Test with edge cases (zero values, negative numbers)
- Verify no unintended circular references
Documentation Requirements for Hardcodes
- Comment or in cells beside (if end of table). Format: "Source: [System/Document], [Date], [Specific Reference], [URL if applicable]"
- Examples:
- "Source: Company 10-K, FY2024, Page 45, Revenue Note, [SEC EDGAR URL]"
- "Source: Company 10-Q, Q2 2025, Exhibit 99.1, [SEC EDGAR URL]"
- "Source: Bloomberg Terminal, 8/15/2025, AAPL US Equity"
- "Source: FactSet, 8/20/2025, Consensus Estimates Screen"
XLSX creation, editing, and analysis
Overview
A user may ask you to create, edit, or analyze the contents of an .xlsx file. You have different tools and workflows available for different tasks.
⚠️ When to Compute Directly in Python
If LibreOffice/recalc.py cannot evaluate certain formula patterns (complex array formulas, advanced INDEX/MATCH combinations), compute the logic directly in Python and write results as values:
## Multi-Range Data Processing
For spreadsheets containing multiple independent data blocks separated by blank rows, see: **[references/multi-range-data-processing.md](references/multi-range-data-processing.md)**
This covers:
- Identifying range boundaries based on data presence
- Preserving blank row separators between blocks
- Anchoring headers when sorting each independent range
- String-based sorting for mixed alphanumeric data
## New Section
## Data Transformation Patterns
### Expanding Rows with Optional Dimensions
When generating rows from multiple ID columns, handle cases where some dimensions may be missing:
- Check each dimension independently before requiring all to exist
- Generate single-dimensional output when only one field exists
- Generate full cartesian product when multiple dimensions exist
See `references/data-expansion-patterns.md` for detailed implementation.
### Parsing Complex Filter Strings
Filter expressions may contain mixed syntax formats (single equals, IN LIST, etc.)
- Use targeted regex patterns for each field type
- Search independently for each condition regardless of order
- Handle both single values and comma-separated lists
See `references/filter-parsing-patterns.md` for detailed implementation.
## Formula Patterns Reference
For detailed examples of common formula patterns used in financial modeling, see:
- [references/formula-patterns.md](references/formula-patterns.md) - Conditional logic, lookups, and aggregation patterns
## Data Processing Edge Cases
### Handle Whitespace Variations in Headers
When locating header rows, normalize strings to handle inconsistent formatting:
```python
# Good: Strip whitespace before comparison
if str(row[0]).strip() == 'Teacher ID':
header_row_idx = iPreserve Pre-Header Structure
If sheets contain rows before headers (title blocks, empty rows), preserve them separately:
before_header = data[:header_row_idx] # Save title/spacing rows
final_data = before_header + [header] + cleaned_rows # Reconstruct outputColumn-Specific Deduplication
When deduplicating by specific columns only, track those values separately:
seen_values = set()
col_a_values = []
for row in data:
col_a_value = row[0]
if col_a_value not in seen_values:
seen_values.add(col_a_value)
col_a_values.append(row) # Keep full rowAdvanced Patterns
For specialized scenarios like dynamic cross-sheet references, see references/dynamic-sheet-references.md.
Good: Direct computation when formula engine fails
df['result'] = df.apply(lambda row: calculate_value(row), axis=1)
sheet['B2'] = df['result'].values[0] # Write computed value
Use this fallback when:
- Formula returns errors after recalculation despite correct syntax
- Array formulas or complex nested lookups don't work in LibreOffice
- Performance issues with large datasets requiring formula evaluation## Important Requirements
**LibreOffice Required for Formula Recalculation**: You can assume LibreOffice is installed for recalculating formula values using the `recalc.py` script. The script automatically configures LibreOffice on first run
## Reading and analyzing data
**Warning**: pandas truncates trailing empty rows and does not support row shifting operations (insert/delete rows). For any task involving structural changes like row/column insertion or deletion, use **openpyxl** instead.
### Step 1: Explore Data Structure First
Before implementing any changes:
- Read the spreadsheet to identify available sheets, column names, and data types
- Compare raw input structure with expected output to understand required transformations
- Note differences in row counts, column arrangements, or data consolidation needs
- This prevents implementing incorrect logic and validates your solution approach### Data analysis with pandas
For data analysis, visualization, and basic operations, use **pandas** which provides powerful data manipulation capabilities:
```python
import pandas as pd
# Read Excel
df = pd.read_excel('file.xlsx') # Default: first sheet
all_sheets = pd.read_excel('file.xlsx', sheet_name=None) # All sheets as dict
# Analyze
df.head() # Preview data
df.info() # Column info
df.describe() # Statistics
# Write Excel
df.to_excel('output.xlsx', index=False)Excel File Workflows
✅ When Python Calculation + Hardcoding Is Appropriate
Use Python to calculate and hardcode values when:
- Logic involves stateful dependencies across rows (e.g., tracking boundaries, cumulative counts, "count since last occurrence")
- Range endpoints depend on dynamic patterns in the data itself
- Ranges are determined by dynamic boundary markers in the data
- Excel formulas would require complex nested conditions or helper columns beyond practical complexity
- The computation requires iterative processing that Excel cannot express concisely
- Segment-based computations where boundaries vary per row
- Complex dynamic array formulas (TEXTJOIN, FILTER, AGGREGATE) repeatedly fail across spreadsheet engines
- LibreOffice Calc compatibility issues persist after multiple attempts
- Target engine lacks Excel 365 function support
- Formula complexity outweighs benefit of recalculation capability
Example scenario: Counting values between successive occurrences of a marker value in another column requires identifying boundary rows first, then processing each segment—best done in Python.
Decision Rule: If you need to track prior rows or find pattern-based boundaries before computing, use Python. Otherwise, prefer Excel formulas.## CRITICAL: Use Formulas, Not Hardcoded Values
Dynamic vs Static Output Decision
Default: Use Excel formulas instead of calculating values in Python and hardcoding them. This ensures the spreadsheet remains dynamic and updateable.
Choose Approach Based on Task Type
Dynamic Models → Use Excel Formulas
For spreadsheets users will modify or update, embed formulas so calculations auto-recalculate.
Automation Tasks → Prefer Hardcoded Values
When LibreOffice recalculation has been unreliable, compute values programmatically and write static outputs. This is more dependable for one-time data processing where spreadsheet interactivity isn't needed.
⚠️ When Hardcoded Values Are More Reliable
For complex nested formulas (AGGREGATE, INDEX/MATCH with SEARCH) that LibreOffice may fail to evaluate:
- Compute results programmatically using pandas/openpyxl
- Write static calculated values instead of formulas
- This is acceptable when formula evaluation reliability is uncertain
Exception: When user explicitly requests "hardcode", "static output", or "finalized values", write calculated integers directly to cells. Follow user intent over general best practices.
This applies to ALL calculations - totals, percentages, ratios, differences, etc. The spreadsheet should be able to recalculate when source data changes.
Cross-sheet aggregation: Always use Excel formulas like =SUM(Sheet1!A1:A10) instead of computing sums in Python and hardcoding results.
❌ WRONG - Hardcoding Calculated Values
# Bad: Calculating in Python and hardcoding result
total = df['Sales'].sum()
sheet['B10'] = total # Hardcodes 5000
# Bad: Computing growth rate in Python
growth = (df.iloc[-1]['Revenue'] - df.iloc[0]['Revenue']) / df.iloc[0]['Revenue']
sheet['C5'] = growth # Hardcodes 0.15
# Bad: Python calculation for average
avg = sum(values) / len(values)
sheet['D20'] = avg # Hardcodes 42.5✅ CORRECT - Using Excel Formulas
# Good: Let Excel calculate the sum
sheet['B10'] = '=SUM(B2:B9)'
# Good: Growth rate as Excel formula
sheet['C5'] = '=(C4-C2)/C2'
# Good: Average using Excel function
sheet['D20'] = '=AVERAGE(D2:D19)'This applies to ALL calculations - totals, percentages, ratios, differences, etc. The spreadsheet should be able to recalculate when source data changes.
⚠️ CRITICAL WARNING: Formula Recalculation Is Mandatory
If you write ANY formulas to an Excel file using openpyxl, you MUST run recalc.py before considering the task complete.
Formulas written via openpyxl exist only as text strings until recalculated. Without running recalc.py:
- Cells return None/empty when read with data_only=True
- Evaluation fails even if formulas are syntactically correct
- The output file is incomplete
This is non-negotiable. Do not proceed to verification or delivery until recalc.py confirms success.
Tool Selection Warning
CRITICAL: When modifying spreadsheets that contain existing formulas you need to preserve:
- ✅ Use openpyxl (
load_workbook()thensave()) - formulas remain as strings - ❌ Avoid pandas (
to_excel()) - converts formulas to static values permanently
If a user mentions "formula errors" or needs dynamic relationships between cells, always choose openpyxl.
Pre-Formula Checklist
Before writing any formulas:
- Verify data types: Load with pandas or openpyxl to confirm column content types match formula expectations (e.g., datetime vs integers)
- Test sample values: Check 2-3 source cells manually before building full formula set
- Plan cell references: Decide which parts need
$locks for copyable formulas
Step 0.5: Preserve Input File
- When modifying an existing file, copy the input to output path FIRST using
shutil.copy() - This preserves all formulas, macros, charts, conditional formatting, and unchanged data
- Only modify the designated output region after copying
Keyword and Text Pattern Matching
When searching for keywords or patterns in cell text:
- Always normalize text to lowercase and use partial substring matching (
'keyword'.lower() in cell_text.lower()) rather than exact or whole-word matching - This handles variations like "123-Core Design" matching "core design"
Read First: Examine Input File Structure
Before implementing any changes:
- List all sheets: Check available worksheets in the input file
- Compare structures: Identify columns, data types, and row counts between input and expected output
- Note transformations: Document what changed (e.g., rows consolidated, columns added/removed)
This prevents implementing incorrect logic and validates your approach.## Common Workflow - Choose tool: pandas for data, openpyxl for formulas/formatting
- Create/Load: Create new workbook or load existing file
- Modify: Add/edit data, formulas, and formatting
- Save: Write to file
- Recalculate formulas (MANDATORY IF USING FORMULAS): Use the recalc.py script
python recalc.py output.xlsx - Verify and fix any errors:
- The script returns JSON with error details
- If
statusiserrors_found, checkerror_summaryfor specific error types and locations - Fix the identified errors and recalculate again
- Common errors to fix:
#REF!: Invalid cell references#DIV/0!: Division by zero#VALUE!: Wrong data type in formula#NAME?: Unrecognized formula name
Creating new Excel files
# Using openpyxl for formulas and formatting
from openpyxl import Workbook
from openpyxl.styles import Font, PatternFill, Alignment
wb = Workbook()
sheet = wb.active
# Add data
sheet['A1'] = 'Hello'
sheet['B1'] = 'World'
sheet.append(['Row', 'of', 'data'])
# Add formula
sheet['B2'] = '=SUM(A1:A10)'
# Formatting
sheet['A1'].font = Font(bold=True, color='FF0000')
sheet['A1'].fill = PatternFill('solid', start_color='FFFF00')
sheet['A1'].alignment = Alignment(horizontal='center')
# Column width
sheet.column_dimensions['A'].width = 20
wb.save('output.xlsx')Clean Output Pattern for Filtering Tasks
When filtering/deleting rows based on complex criteria:
- Create a new workbook with
Workbook()instead of deleting rows in-place - Write only desired rows to the new file via cell-by-cell iteration
- This avoids residual formatting artifacts, empty rows, and broken cell references
- Ensures predictable output structure regardless of final row count
⚠️ CRITICAL: Multi-Row Deletion Safety
When deleting multiple rows, ALWAYS collect all target row indices first, then sort them in DESCENDING order before deletion:
# ✅ CORRECT - Delete in reverse order to prevent index shifting
to_delete = [11, 10, 9, 8, 7, 6] # Highest first
to_delete.sort(reverse=True)
for row_idx in to_delete:
sheet.delete_rows(row_idx)
# ❌ WRONG - Sequential deletion causes index shifting
to_delete = [6, 7, 8, 9, 10, 11] # This will delete wrong rows
for row_idx in to_delete:
sheet.delete_rows(row_idx)Deleting sequentially from top-to-bottom causes subsequent indices to shift, skipping rows or deleting wrong data.
⚠️ CRITICAL: Deleting Consecutive Rows from Top
When deleting consecutive rows starting from row 1:
# Always delete row 1 repeatedly in a loop, NOT as a range
for _ in range(num_rows_to_delete):
sheet.delete_rows(1) # Row 1 shifts down each iterationPreserve Template Structure
- Copy the input file first with
shutil.copy()before making any modifications - This preserves all formulas, macros, charts, conditional formatting, and unchanged elements automatically
- Only the designated output region needs explicit clearing and population
Preserving Other Sheets When Modifying Partially
When you only need to modify specific sheets:
from openpyxl import load_workbook
wb = load_workbook('file.xlsx')
# Modify ONLY the target sheet
sheet = wb['TargetSheet']
sheet.delete_rows(1, sheet.max_row)
for row in new_data:
sheet.append(row)
# Save without touching other sheets
wb.save('modified.xlsx')This preserves formulas, formatting, and data in all other sheets.
⚠️ CRITICAL: Deleting Multiple Rows
When deleting multiple rows, ALWAYS delete from highest row number to lowest:
rows_to_delete = [10, 15, 20] # Collect all rows first
for row in sorted(rows_to_delete, reverse=True):
sheet.delete_rows(row)Row indices shift upward when rows are deleted; bottom-up deletion preserves remaining references.
Clean Trailing Blank Rows
After filtering or row deletion, remove empty rows beyond the last data row to match expected output range:
last_row = sheet.max_row
while last_row > 1 and all(sheet.cell(row=last_row, column=c).value is None for c in range(1, sheet.max_column + 1)):
sheet.delete_rows(last_row)
last_row -= 1
```### Editing existing Excel files
```python
# Using openpyxl to preserve formulas and formatting
from openpyxl import load_workbook
# Load existing file
wb = load_workbook('existing.xlsx')
sheet = wb.active # or wb['SheetName'] for specific sheet
# Working with multiple sheets
for sheet_name in wb.sheetnames:
sheet = wb[sheet_name]
print(f"Sheet: {sheet_name}")
# Modify cells
sheet['A1'] = 'New Value'
sheet.insert_rows(2) # Insert row at position 2
sheet.delete_cols(3) # Delete column 3
# Add new sheet
new_sheet = wb.create_sheet('NewSheet')
new_sheet['A1'] = 'Data'
wb.save('modified.xlsx')Recalculating formulas
Excel files created or modified by openpyxl contain formulas as strings but not calculated values. Use the provided recalc.py script to recalculate formulas:
python recalc.py <excel_file> [timeout_seconds]Example:
python recalc.py output.xlsx 30The script:
- Automatically sets up LibreOffice macro on first run
- Recalculates all formulas in all sheets
- Scans ALL cells for Excel errors (#REF!, #DIV/0!, etc.)
- Returns JSON with detailed error locations and counts
- Works on both Linux and macOS
Modification Validation Checklist
Before marking tasks complete:
- Read back the output file to confirm expected changes occurred
- Verify row/column counts match expectations across all sheets
- Spot-check modified cells contain correct values/formulas
- Confirm unchanged data was not accidentally modified
- Check all sheets if working with multi-sheet workbooks
- Compare output shape to expected shape (rows/columns)
- Verify all original valid data records remain intact
- Check that blank/gap rows were handled correctly per requirements
- Confirm structural properties match expectations (contiguous data, correct ranges)
This catches off-by-one errors, incomplete deletions, and unintended side effects before submission.
Multi-Range Data Processing
When working with spreadsheets containing multiple independent data sections:
- Identify range boundaries by detecting blank separator rows
- Keep headers anchored at top of each range before sorting
- Use lexicographic (string) sorting for alphanumeric data unless numeric extraction is required
- See Multi-Range Data Processing Patterns for detailed guidance
Post-Transformation Verification
- Read back target range: After any data transformation, explicitly read the exact cell range specified (e.g., C1:K6) to verify all expected data appears correctly before marking complete
- Confirm dimensions: Check that transformed data has expected row/column counts with no extra/missing entries
- Validate content: Ensure all expected data appears in correct positions
- Check boundaries: Verify no extra/missing data outside intended range
This catches off-by-one errors, incorrect ranges, or partial transformations.## Formula Verification Checklist
Quick checks to ensure formulas work correctly:
Dual verification: Read cells twice - once with
data_only=Falseto confirm formulas exist, then withdata_only=Trueto verify computed values match expectationsImmediate recalculation: Run
recalc.pyright after writing formulas, before proceeding to other tasks. Early error detection saves debugging time.Complex lookups: For multi-condition counts with cross-references, use SUMPRODUCT+INDEX/MATCH instead of COUNTIFS alone
Type consistency for comparisons: Normalize key columns before equality checks
- ID/text columns:
.astype(str).str.strip() - Numeric columns:
.astype(float) - Prevents false negatives from type mismatches (e.g., "9963547" vs 9963547)
- ID/text columns:
Inspect full structure first: Read ALL sheets and their shapes/indexes before any modifications
Analyze multi-column dependencies first: Before implementing formulas, trace how each column relates to others and identify whether operations are local (current row) or global (across all rows)
Distinguish scope clearly: Note phrases like "entire column" or "across all rows" that indicate global operations vs. row-local operations
Validate against sample data: When example input/output pairs exist, reverse-engineer the exact calculation method by testing your formula on known cases
Check methodology assumptions: Verify you're using the correct approach (e.g., FIFO vs LIFO, straight-line vs declining balance) by comparing intermediate results
Multi-cell formula check: After bulk formula assignment, verify 2-3 cells at different positions contain DIFFERENT formula strings (not identical copies)
Search all sheets: When aggregating data by pattern, enumerate ALL sheets and search each one—don't assume patterns exist only in seemingly relevant sheets
Validate item counts: Before placing summary rows, verify discovered item count matches expectations to prevent off-by-one placement errors
Verify full data range: Before writing range-based formulas, programmatically identify the last non-empty row/column to avoid excluding valid data
No hardcoded range ends: Never assume range boundaries; scan data to find actual extents
Range transformation consistency: When modifying a specified range, apply transformations to ALL cells unless explicitly instructed otherwise. Do not filter cells based on content patterns
Edge case testing: Verify transformation behavior on headers, empty strings, and unexpected values before applying broadly### Essential Verification
Test 2-3 sample references: Verify they pull correct values before building full model
Column mapping: Confirm Excel columns match (e.g., column 64 = BL, not BK)
Row offset: Remember Excel rows are 1-indexed (DataFrame row 5 = Excel row 6)
Task requirement parsing: Re-read task descriptions for dependency indicators (e.g., "count from entire column" vs. "for this row")
openpyxl formula anchoring: Remember openpyxl writes formulas literally—no auto-adjustment of relative references when assigning to multiple cells
Missing sheets: Data patterns may appear in unexpected sheets (EXPRO, CMSN, CSTR)—search all sheets systematically
Formula vs value confusion: Verify whether output requires
=SUM(...)formulas or computed numeric values before writing totalsData mapping validation: Count filtered/source items BEFORE writing; ensure destination range matches expected count
Task requirement verification: Confirm whether sequential filling or ID-based matching is required before implementing
LibreOffice compatibility: Avoid SUMPRODUCT with boolean arrays (e.g.,
--(A:A>0)), complex nested arrays, dynamic array functions (INDEX/SMALL/IF combinations), or Excel-specific functions that may fail in LibreOfficeMatrix axis orientation: When populating matrices by numeric dimensions (Impact/Likelihood scales), verify whether values increase or decrease along each axis before mapping. Check existing labels, headers, or partial data to confirm direction.
Cross-validate with partial data: Use any pre-existing template structure or partially filled cells as validation hints for expected cell-to-value mappings
Persistent formula errors: If recalc.py shows repeated errors despite fixes, consider computing values in Python instead
Algorithmic reset logic: Ensure you track what resets (counter, reference value, etc.) and what persists across restart conditions
Multi-condition filtering: When filtering by multiple possible values in one dimension (e.g., matching any of several users), use SUMPRODUCT with boolean multiplication instead of trying to combine SUMIFS calls
Range boundary validation: Print first/last cell of computed range to confirm correct dimensions before processing
Numeric type variations: Check for all zero representations - use
isinstance(val, (int, float)) and val == 0for numeric zeros AND'0'for string zeros. Excel may store the same conceptual value as different Python types.### Common PitfallsNaN handling: Check for null values with
pd.notna()Far-right columns: FY data often in columns 50+
Multiple matches: Search all occurrences, not just first
Division by zero: Check denominators before using
/in formulas (#DIV/0!)Wrong references: Verify all cell references point to intended cells (#REF!)
Cross-sheet references: Use correct format (Sheet1!A1) for linking sheets
Target range validation: After transformations, verify results specifically within the required answer_position range rather than assuming global correctness
Boundary checks: Edge cases may behave differently at range boundaries; explicitly validate those cells
Segment-based verification: For boundary-dependent calculations, sample across different segment types (first segment, short segments, long segments, edge cases)
Boundary marker checks: Verify all boundary positions were correctly identified before processing segments
Case-insensitive matching: Use
ISNUMBER(SEARCH("pattern",cell))instead of UPPER()/LOWER() for substring checksVerify calculated values: Load with
data_only=Trueto confirm formulas produce expected results, not just formula stringsOR logic pattern: For multiple criteria in one field, sum separate SUMIFS calls rather than array formulas
Dynamic formula patterns: For row-level operations, use f-strings to embed current cell references (e.g.,
f'=COUNTIF($A$2:A{i}, A{i})') so each row references its own positionVerify with same library: After writing with openpyxl, verify using openpyxl cell-by-cell, not pandas (different parsing behavior)
Complete row validation: For batch operations, test first few rows THEN validate every remaining row to catch off-by-one errors or late-row misses### Formula Testing Strategy
Start small: Test formulas on 2-3 cells before applying broadly
Verify dependencies: Check all cells referenced in formulas exist
Test edge cases: Include zero, negative, and very large values
Post-Recalculation Verification (REQUIRED)
After running recalc.py, ALWAYS verify the file contains calculated values:
from openpyxl import load_workbook
# Load with data_only=True to read calculated values
wb = load_workbook('output.xlsx', data_only=True)
sheet = wb.active
# Check key cells have numbers, not formula strings
print(sheet['B10'].value) # Should show 5000, not '=SUM(B2:B9)'If cell values appear as formula text (e.g., "=SUM(...)") instead of numbers, you skipped the recalculation step.
Verification After Row Removal
After removing rows based on duplicate detection or filtering:
- Load the original file's reference column separately
- Scan the output file's target column for any remaining matches
- Zero matches confirms correct deletion
This verification applies to any deduplication, exclusion, or filtering task where certain values must not persist in output.
When Errors Persist After Fixes
If recalc.py returns errors that cannot be corrected through formula adjustments:
- Identify which formulas cause repeated failures
- For those specific cells, compute values in Python instead
- Write the computed result as a value (not a formula string)
- Add a comment noting: "Computed externally due to formula engine limitation"
This preserves task completion while respecting spreadsheet engine constraints.### Interpreting recalc.py Output
The script returns JSON with error details:
{
"status": "success", // or "errors_found"
"total_errors": 0, // Total error count
"total_formulas": 42, // Number of formulas in file
"error_summary": { // Only present if errors found
"#REF!": {
"count": 2,
"locations": ["Sheet1!B5", "Sheet1!C10"]
}
}
}Advanced Formula Checks
- Modern function prefixes: All AGGREGATE/FILTER/UNIQUE/XLOOKUP use
_xlfn.prefix - Extended ranges: Dynamic formulas use ranges exceeding current data size (10x buffer)
- ArrayFormula objects: Array formulas created via
ArrayFormula()not string assignment - Edge case testing: Formulas tested with empty cells, duplicates, boundary conditions
Logical Validation Before Implementation
- Verify against task examples: If task provides numerical examples, manually calculate through at least 2 iterations to confirm formula logic matches expected output
- Distinguish accumulation types: Confirm whether column requires simple running sum (
=SUM($A$1:A1)) or compound calculation (=(prev_total+new_input)*(1+rate)) - these produce different results - Test recurrence relations: For period-to-period dependent calculations, trace through 2-3 periods manually before applying formula across entire column
- Check edge cases: Verify first row (no previous value), zero inputs, and negative values handle correctly
Post-Construction Validation
- Spot-check formulas: After writing formulas to cells, read back 2-3 cells and verify the formula string matches intent
- Recalculate immediately: Run
recalc.pyright after formula insertion to catch errors before continuing - Cross-reference values: Pick one row, manually compute what the result should be, confirm formula produces same value
Dependency Compatibility Checks
- Verify lookup sources: Ensure columns referenced by formulas don't contain _xlfn.* or other unsupported functions
- Test in target environment: Run recalc.py early to catch #NAME? errors from incompatible functions
- Check array function dependencies: UNIQUE, FILTER, SEQUENCE may fail in LibreOffice - compute values directly if needed
- Confirm output type: Verify whether task requires formulas ('=SUM(...)') or calculated values (plain numbers/text)
Output Location Verification
- Verify answer_position: Parse the full sheet name and range (e.g., 'MyResult'!A1:O20 means sheet='MyResult', range='A1:O20')
- Check sheet existence: If specified sheet doesn't exist, CREATE it before writing
- Never substitute sheets: Do not write to similarly-named sheets (e.g., 'Result' ≠ 'MyResult')
Data Ordering Requirements
- Check for merge/join operations: When combining datasets from multiple sources, verify if output must be sorted
- Identify primary key: Look for ID columns or logical sort keys in the expected output
- Sort before saving: Apply sort by primary key (ascending numerical order typically expected)
Aggregation and Grouping
- Track original row order: Add helper column with indices before
groupby() - Sort by index after aggregation: Restore expected output sequence
- Beware of reordering: Simple
drop_duplicates()orgroupby().first()may change row sequence arbitrarily
Post-Save Verification
- After completing modifications, reload the saved file and verify:
- Key cells contain expected values/formulas
- Formatting was preserved correctly
- No unintended changes to unmodified sheets
- Off-by-one errors in row/column positions
LibreOffice-Specific Limitations
- Avoid unsupported functions: Some Excel 365 dynamic array functions don't work in LibreOffice command-line mode
- Simplify complex lookups: Use INDEX/MATCH instead of complex nested array formulas when possible
- Test before scaling: Verify 2-3 cells calculate correctly with recalc.py before applying formulas broadly
- Check for empty results: If recalc.py returns no errors but cells are empty, the formula may not be supported by LibreOffice
Fallback Strategy for Persistent Errors
If recalc.py reports errors after 2+ fix attempts:
- Stop trying to fix the formula
- Calculate the value programmatically in Python
- Write the result as a direct cell value (not a formula)
- Document why you switched to hardcoded calculation
This ensures task completion when cross-platform formula incompatibilities prevent ideal approaches.
Best Practices
Library Selection
Library Selection
- pandas: Data analysis, bulk filtering/deletion by condition, statistical analysis, large datasets (10k+ rows), complex transformations (groupby, aggregate, sort)
- openpyxl: Complex formatting, formula preservation, cell-by-cell modifications, visual styling, row/column deletions
- Combine both: Use pandas for data manipulation then openpyxl for writing back to preserve sheets, formatting, and formulas
- Critical: Use openpyxl (not pandas) when cell formatting (fonts, colors, borders, number formats) must be preserved across transformations
- openpyxl mandatory for cell-level styling: When modifying specific cells AND applying formatting (alignment, fonts, fills), always use openpyxl. Pandas cannot modify individual cell properties
- Rule of thumb: Use pandas for "what data" questions, openpyxl for "how it looks" questions
Library Selection Decision Tree
New file creation with formulas/formatting: openpyxl
Data analysis/visualization: pandas
**Modifying existing file **(preserve formatting): openpyxl with load_workbook()
Bulk data export without formatting concerns: pandas
User requests VBA: Implement in Python/openpyxl instead
Error value detection: Excel errors like '#N/A', '#REF!' are stored as plain strings. Check with
isinstance(cell.value, str) and cell.value.startswith('#')rather than relying on type checksDelete rows bottom-up: When deleting multiple rows scattered through the sheet, always iterate from max_row downward (
range(max_row, 1, -1)) to avoid index shiftingPreserve special rows: Explicitly exclude protected rows (headers, blank rows) in your loop range rather than filtering afterwards
Data type validation: Before filtering or comparing values, validate data types match expected format. Check sample cell values with
cell.valueand verify type (datetime vs string). Type mismatches cause silent filter failures where no rows match criteriaFor row-based operations (e.g., identifying groups separated by blank rows), iterate cells directly and detect transitions between populated and empty rows
When marking maximum values in groups, use equality comparison against the max value to correctly handle ties (assign 'Y' to all members equal to max, not just first occurrence)### Working with openpyxl
Cell indices are 1-based (row=1, column=1 refers to cell A1)
Use
data_only=Trueto read calculated values:load_workbook('file.xlsx', data_only=True)Warning: If opened with
data_only=Trueand saved, formulas are replaced with values and permanently lostFor large files: Use
read_only=Truefor reading orwrite_only=Truefor writingFormulas are preserved but not evaluated - use recalc.py to update values
Data Transformation Pattern — For columns with mixed types (strings, dates, numbers):
- Check types before transforming:
isinstance(value, (datetime, date, int, float)) - Apply string transformations only to string values, preserve others unchanged
- Verify before/after samples for representative cases
- See references/data-transformation-patterns.md for detailed examples
Batch Transformation Guidelines
For tasks requiring N items per output row:
- Check divisibility: Count total source items before planning output rows
- Partial rows allowed: Fill remaining items in final row without padding
- No empty gaps: Complete each row with available data, don't skip cells
- Example: 22 items with 4-per-row → rows get [4,4,4,4,4,2] not [4,4,4,4,4,4,4]### Working with pandas
- Specify data types to avoid inference issues:
pd.read_excel('file.xlsx', dtype={'id': str}) - For large files, read specific columns:
pd.read_excel('file.xlsx', usecols=['A', 'C', 'E']) - Handle dates properly:
pd.read_excel('file.xlsx', parse_dates=['date_column'])
New Section
Advanced Patterns Reference
For detailed guidance on complex scenarios:
- Multi-section parsing: See
references/multi-section-parsing.mdfor handling sheets with multiple data sections separated by marker rows - Advanced formula patterns: See
references/advanced-formula-patterns.mdfor complex lookup formulas - Multi-range data processing: See
references/multi-range-data-processing.mdfor processing spreadsheets with multiple independent data blocks
Verification and Validation
Post-transformation Verification
After implementing spreadsheet changes:
- Create verification script: Re-read output file independently to validate results
- Display key outputs: Show values, formats, and conditional results from target cells
- Compare against expectations: Check format classifications, cell references, calculated values
- Run recalc.py: Verify no formula errors introduced by your changes
This catches subtle bugs like format misclassification, off-by-one errors, or unintended side effects.
Error Recovery Pattern
If recalc.py finds errors:
- Parse JSON output to identify error type and locations
- Fix root cause (wrong references, division by zero, type mismatches)
- Recalculate again until status shows success
- Document what was fixed for future reference
Output Verification Checklist
After completing spreadsheet modifications:
- Match answer_position range: Verify output sheet has correct row count and column range per task specification
- Spot-check transformations: Compare 2-3 sample rows between input and output to confirm logic applied correctly
- Sheet integrity: Confirm all original sheets preserved (if required) and new sheets created properly
Edge Case Validation Requirements
Before finalizing any numerical operation:
- Test with zero values in denominator/range
- Test with negative values if applicable
- Test with tied values (multiple cells with same result)
- Test with completely empty/null ranges
- Verify output format handles all cases gracefully (no crashes or exceptions)
Script Naming Rules
CRITICAL: Never name your Python scripts after standard library modules (e.g., inspect.py, os.py, sys.py, pandas.py).
When Python imports libraries, it searches the current directory first. A local file named inspect.py shadows the built-in module, causing circular import errors that break all subsequent imports.
Use unique names like process_data.py, spreadsheet_tool.py, etc.
Efficient Row Processing Patterns
For operations spanning multiple rows with conditional dependencies:
- Maintain state variables that persist across iterations (
current_a_value,group_active) - Update state when encountering new group markers, reset when hitting boundaries
- Avoid nested loops—single-pass iteration handles multi-row dependencies cleanly
Validation Scope
After applying transformations, verify results specifically within the required answer_position range rather than assuming global correctness.
Focused verification catches boundary-related bugs that full-scope checks might miss.
Verification After Structural Changes
After any row/column insertion or deletion:
- Verify final row count matches expected result
- Print/log remaining data and confirm retention criteria
- Check formulas still reference correct cells after shifts
- Recalculate with
recalc.pyto catch formula errors from shifted references
Value Transformation Pattern
When replacing multiple specific values:
- Use dictionary mapping with
.map()instead of if-elif chains - Example:
mapping = {'USA': 'United States', 'UK': 'United Kingdom'} df['country'] = df['country'].map(lambda x: mapping.get(x, x)) - Handles unmatched values gracefully (keeps original)
- More maintainable and scalable than conditional logic
Value-Based Number Formatting
For complex formatting requirements that Excel's conditional formatting cannot express:
# After recalc.py has computed values
for row in range(2, max_row + 1):
cell = sheet[f'C{row}']
value = cell.value # Read calculated value
# Determine format based on actual value characteristics
if value == int(value):
cell.number_format = '0'
elif len(str(value).split('.')[1]) <= 1:
cell.number_format = '0.0'
else:
cell.number_format = '0.00'This pattern generalizes to any scenario requiring format differentiation based on cell content.## Code Style Guidelines
IMPORTANT: When generating Python code for Excel operations:
- Write minimal, concise Python code without unnecessary comments
- Avoid verbose variable names and redundant operations
- Avoid unnecessary print statements
For Excel files themselves:
- Add comments to cells with complex formulas or important assumptions
- Document data sources for hardcoded values
- Include notes for key calculations and model sections
Advanced Data Patterns
Detecting Group Boundaries
When processing data organized in blocks separated by blank rows:
- Iterate through rows and detect transitions from populated to empty cells
- Break at empty cells to identify independent groups
- Common in financial reports where categories are visually separated
Example pattern:
# Detect group boundary by checking for empty cell
group_start = current_row
while sheet.cell(row=current_row, column=col).value is not None:
current_row += 1
group_end = current_row - 1Handling Tied Maximums
When marking highest values in a group (e.g., tournament scoring):
- Compute the maximum value for the group first
- Assign marker to ALL members whose value equals the maximum
- Do NOT use index-based selection which only captures the first occurrence
Example pattern:
max_val = max(group_values)
for idx, val in enumerate(group_values):
if val == max_val: # Equality handles ties correctly
sheet.cell(row=row_idx + idx, column=mark_col).value = 'Y'New Section
Cross-Sheet Data Matching
When matching or joining data across multiple sheets:
- Use composite keys (multiple columns) when single identifiers may have duplicates
- Pre-build lookup dictionaries/indexes for O(1) access instead of repeated searches
- Leave cells empty when source data legitimately has no match; do not force incorrect values
See references/cross-sheet-matching-patterns.md for detailed examples.