Processing a WordPress export file with PowerShell

So I wanted to make a HTML page of just the titles of my blog items, with the titles hyperlinked. Here’s a solution in PowerShell:

[xml]$xml = get-content 'wordpress.xml'
$items = $xml.rss.channel.item | Select-Object title,link
foreach ($item in $items)
  {
  $s = '<p><a href="' + $item.link + '">' + $item.title + '</a></p>'
  echo $s
  }

I like how the XML handling is just woven into the fabric.

That said, the XML file that WordPress exports is — I just discovered to my chagrin — not actually XML. The comments contain all sorts of junk that choke an XML parser. I couldn’t find an example of a multiline non-greedy regular expression search-and-replace in PowerShell, so I stripped out the comments using Python:

import re
s = open('wordpress.2007-11-09.xml').read()
pat = re.compile('<wp:comment>.+?</wp:comment>',re.DOTALL)
s = re.sub(pat,'',s)
f = open('wordpress.xml','w')
f.write(s)

Mapping idioms from one language to another is such an interesting problem. I’ve always imagined a kind of Rosetta Stone of patterns. It would contain patterns like multiline non-greedy regular expression search-and-replace and then you could map examples from any language into those patterns. Does any resource on the web approximate that kind of pattern vocabulary?

14 Comments

  1. Actually non-greedy matches use the same notation in PowerShell as they do in the Python example. For example:

    PS (9) > ‘afoobfooc’ -replace ‘.+?’,”
    abc

    The PowerShell equvalent of the Python code would probably be something like:

    $modified = (gc wordpress.xml) -replace ‘.+?’,”
    $modified | out-file -encoding ascii wordpress.xml # save as ascii

    If saving the file as unicode is acceptable, then the last line become
    $modified > wordpress.xml

    As an aside, PowerShell just uses .NET regular expressions. The syntax for these expressions is documented at:

    http://msdn2.microsoft.com/en-us/library/1400241x.aspx

    -bruce

    =====================================
    Bruce Payette [MSFT]
    Principal Developer, Windows PowerShell
    Microsoft Corporation.

  2. It’s a multiline match, though. I know how to do it this way in IronPython using the .NET regex system:

    import clr
    from System.Text.RegularExpressions import Regex, RegexOptions
    xml = open(‘wordpress.xml’).read()
    re = Regex(‘<wp:comment>.+?</wp:comment>’,RegexOptions.Singleline)
    xml = re.Replace(xml,”)

    Can you specify the regex option natively in PowerShell or would you need to construct a .NET regex object in similar fashion?

    (And…why Singleline?)

  3. The options for a .NET regular expression can be specified as part of the pattern itself. For example:
    ‘(?s)abd’
    specifies a pattern with SingleLine turned on. (This work in Python too of course). The documentation for this is at:

    http://msdn2.microsoft.com/en-us/library/yd1hzczs.aspx

    Alternatively you could do the same thing in PowerShell that you did in the Python code:

    $re = New-Object regex ’.+?’,Singleline

    However, the other thing you’re going to run into with gc (Get-Content) is that, in PowerShell V1, Get-Content always splits the file into lines. To workaround this, do

    $xml = [io.file]::ReadAllText((resolve-path $path)) -replace ‘(?s)…pattern….’,”
    $xml | out-file -enc ascii

    (We’re planning to fix this issue in the next release.)

    -bruce

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s