Recovering data from a corrupted Excel spreadsheet using PowerShell


27 March 2012, by

It’s the last thing anyone wants to see at 5pm after spending the day updating an important spreadsheet:

Excel found unreadable content

I’ve had this once or twice before and Excel has always managed to recover my data perfectly. But on this occasion, Excel simply crashed half way through the recovery process – and continued to do so on all subsequent attempts.

My first thought was of course to look to my backups, but I had nothing more recent than from the previous night – not much comfort after the best part of a day’s effort. Even a Google for “Excel file recovery” didn’t do much good – I downloaded a couple of free trials, but none gave satisfactory results. One did claim to have successfully recovered my data, but did so suspiciously quickly and was unable to display a preview of what it had done. I didn’t trust it sufficiently to click the “Buy Now” button. Help!

Getting inside an Excel file

Fortunately, I use Excel 2010 and the XSLX file format (or XSLM in this case, as my spreadsheet included macros). This means that the spreadsheet is in fact a series of (mostly) XML files, packaged together into a ZIP file as a wrapper. So you can have a browse around inside your Excel file simply by renaming it from .xslx to .zip, and opening it up in your ZIP viewer of preference. I chose 7-Zip – mostly because it allowed me to open the spreadsheet as an archive, extract and replace files without even needing to do the “rename as .zip” step, which was convenient although not essential.

On the whole, the internals of the file were fairly self-explanatory – most of the data is in the “xl” subfolder:

Inside an Excel file

Editing the contents

I had a strong suspicion that the problem lay on the particular sheet I had been editing when Excel first crashed, so I tried a little experiment. The “worksheets” folder contains one XML file per Excel sheet, numbered sequentially (“sheet2.xml” was the culprit in my case). So I took my day-old backup, and again using 7-Zip extracted Sheet2.xml from the backup and used it to overwrite that sheet in the corrupted file.

Opening Excel after this I still got the “Excel found unreadable content” warning, but this time there was no crash – Excel recovered the file and managed to open successfully. Not that the result was entirely satisfactory still:

  • The recovered sheet looked a complete mess. Not everything is contained in the sheet2.xml file (more on this below), so some formatting and content had gone awry.
  • This was the sheet I’d spent most of the day working on anyway! And the same trick in reverse (copy sheet2.xml from the corrupted file to the known good backup) left me back in crash territory.

However at least this gave me an easy way to access the rest of the corrupted file, without any data loss. All I needed was to be able to extract the rest of the data from the corrupted Sheet2.xml.

Enter PowerShell

It’s easy enough to get a rough feel for the structure of the sheet2.xml file by opening it in your favourite XML viewer. The bulk of the data is stored in a <sheetData> tag, on a row-by-row then cell-by-cell basis. But that doesn’t do you much good for getting the data out in a usable format – if I’d lost just one or two cells this would be an easy way to recover them, but this was a fairly large spreadsheet.

Fortunately there are many ways to operate more flexibly on a XML file, of which my favourite is Windows PowerShell. What I particularly like about it is the way you can interactively play with your XML file – for instance to quickly count the number of rows in the spreadsheet to make sure none were missing:

$sheet = 1(Get-Content .sheet2.xml)
$sheet.worksheet.sheetData.row | Measure-Object

Interpreting the XML data

The “row” elements above look something like this:

<row r="6" x14ac:dyDescent="0.25" spans="1:178">
  <c r="A6" t="str" s="1">
    <v>Important Data</v>
  </c>
  <c r="B6" t="s" s="9">
    <v>42</v>
  </c>
</row>

My first step was to convert the data into CSV format, as a starting point to getting it into Excel and stripping out the bits I don’t need. This is simple with PowerShell:

function PrintCell() {
  process {
    $_.r + "," + $_.v
  }
}

function PrintCellsInRow() {
  process {
    $_.c | PrintCell
  }
}

$sheet.worksheet.sheetData.row | PrintCellsInRow | Out-File RecoveredCells.csv

However this wasn’t quite satisfactory – in my example above, the value “42” in cell B6 shouldn’t actually have been a number, but a string. Some important data was missing!

Fortunately I was able to find this missing data in a separate file, “sharedStrings.xml” – apparently an attempt to reduce the size of the overall file by storing reused string values separately. Any cell with t=”s” had as its value a reference to this shared strings file. A little experimenting revealed the following alternative definition for PrintCell:

$strings = 1(Get-Content ./sharedStrings.xml)

function PrintCell() {
  process {
    if ($_.t -eq "s") {
      $_.r + "," + $strings.sst.si[$_.v].innerText
    } else {
      $_.r + "," + $_.v
    }
  }
}

This process could be extended further to extract function definitions and other content from the file – but the above was sufficient for my needs.

And finally… Making the data useful

A CSV file is all very well as a proof of concept, but it gained little over the raw XML in terms of putting the data into a format from which I could easily recover my work.

While I could have written a more complex PowerShell script, I realised that there’s an even easier way to do it, using nothing more than an Excel formula. Assuming your CSV formatted data is in columns A and B of sheet CSVData, the following formula will look up the correct value for its own cell:

=VLOOKUP(ADDRESS(ROW(),COLUMN(),4), CSVData!$A:$B, 2, FALSE)

The formula can be read as:

  • Take the absolute (magic number 4) address of the current cell (row, column)
  • Look it up in the table in CSVData, looking for exact matches only (constant FALSE)
  • Take the second column (2) of the lookup table and use its value

Simply fill down and across your spreadsheet, and you have a complete recreation of your original spreadsheet, laid out just like the original – lacking formatting, formulae etc, but with the data all intact.

From here it was a simple job to copy and paste the relevant data across into my backup (which still had all the formatting and formulae that I needed, because I hadn’t changed them). Several hours of work rescued, and a useful lesson in the innards of Excel for next time it happens…

Tags: ,

Categories: Technical

«
»

One Response to “Recovering data from a corrupted Excel spreadsheet using PowerShell”

  1. Bert Leen says:

    I wanted to take an opportunity to thank “Matthew Richards” originator of this blog for helping me retrieve the excel file data from my corrupt and damaged MS office database. I was so relieved and happy to hear that all my excel file data is recovered. In return I am sharing an another tips, Use Kernel for excel file repair software to recover and repair . XLSX and .XLS files from your computer system.


Leave a Reply

* Mandatory fields


+ 4 = thirteen

Submit Comment