Processing a WordPress export file with PowerShell

9 Nov 20079 Nov 2007 ~ Jon Udell

So I wanted to make a HTML page of just the titles of my blog items, with the titles hyperlinked. Here’s a solution in PowerShell:

[xml]$xml = get-content 'wordpress.xml'
$items = $xml.rss.channel.item | Select-Object title,link
foreach ($item in $items)
  {
  $s = '<p><a href="' + $item.link + '">' + $item.title + '</a></p>'
  echo $s
  }

I like how the XML handling is just woven into the fabric.

That said, the XML file that WordPress exports is — I just discovered to my chagrin — not actually XML. The comments contain all sorts of junk that choke an XML parser. I couldn’t find an example of a multiline non-greedy regular expression search-and-replace in PowerShell, so I stripped out the comments using Python:

import re
s = open('wordpress.2007-11-09.xml').read()
pat = re.compile('<wp:comment>.+?</wp:comment>',re.DOTALL)
s = re.sub(pat,'',s)
f = open('wordpress.xml','w')
f.write(s)

Mapping idioms from one language to another is such an interesting problem. I’ve always imagined a kind of Rosetta Stone of patterns. It would contain patterns like multiline non-greedy regular expression search-and-replace and then you could map examples from any language into those patterns. Does any resource on the web approximate that kind of pattern vocabulary?

Published by Jon Udell

View all posts by Jon Udell

14 thoughts on “Processing a WordPress export file with PowerShell”

Bruce Payette [MSFT] says:

9 Nov 2007 at 3:23 pm

Actually non-greedy matches use the same notation in PowerShell as they do in the Python example. For example:

PS (9) > ‘afoobfooc’ -replace ‘.+?’,”
abc

The PowerShell equvalent of the Python code would probably be something like:

$modified = (gc wordpress.xml) -replace ‘.+?’,”
$modified | out-file -encoding ascii wordpress.xml # save as ascii

If saving the file as unicode is acceptable, then the last line become
$modified > wordpress.xml

As an aside, PowerShell just uses .NET regular expressions. The syntax for these expressions is documented at:

http://msdn2.microsoft.com/en-us/library/1400241x.aspx

-bruce

=====================================
Bruce Payette [MSFT]
Principal Developer, Windows PowerShell
Microsoft Corporation.

Loading...

Reply
Jon Udell says:

9 Nov 2007 at 5:22 pm

It’s a multiline match, though. I know how to do it this way in IronPython using the .NET regex system:

import clr
from System.Text.RegularExpressions import Regex, RegexOptions
xml = open(‘wordpress.xml’).read()
re = Regex(‘<wp:comment>.+?</wp:comment>’,RegexOptions.Singleline)
xml = re.Replace(xml,”)

Can you specify the regex option natively in PowerShell or would you need to construct a .NET regex object in similar fashion?

(And…why Singleline?)

Loading...

Reply
Bruce Payette [MSFT] says:

9 Nov 2007 at 8:19 pm

The options for a .NET regular expression can be specified as part of the pattern itself. For example:
‘(?s)abd’
specifies a pattern with SingleLine turned on. (This work in Python too of course). The documentation for this is at:

http://msdn2.microsoft.com/en-us/library/yd1hzczs.aspx

Alternatively you could do the same thing in PowerShell that you did in the Python code:

$re = New-Object regex ’.+?’,Singleline

However, the other thing you’re going to run into with gc (Get-Content) is that, in PowerShell V1, Get-Content always splits the file into lines. To workaround this, do

$xml = [io.file]::ReadAllText((resolve-path $path)) -replace ‘(?s)…pattern….’,”
$xml | out-file -enc ascii

(We’re planning to fix this issue in the next release.)

-bruce

Loading...

Reply
Kevin Reid says:

10 Nov 2007 at 2:00 pm

“a kind of Rosetta Stone of patterns”: http://www.rosettacode.org/ . Small but growing. Your example would fit in fine, I think.

Loading...

Reply
Jon Udell says:

12 Nov 2007 at 9:04 am

> $re = New-Object regex ’.+?’,Singleline

Got it. That’s nice and succinct, thanks!

Loading...

Reply
Jon Udell says:

12 Nov 2007 at 9:10 am

> http://www.rosettacode.org/

Excellent!

Loading...

Reply
Pingback: Multilingual idioms « Jon Udell
Pingback: The Third Bit » Blog Archive » Link Soup Redux
Оля says:

31 Jan 2009 at 5:20 pm

Я писал что-то подобное, но у Вас тема более глубого раскрыта

Loading...

Reply
Юрий says:

19 Feb 2009 at 10:56 pm

Написать пост на пол страницы время есть, а ответить нет? Нормально

Loading...

Reply
Андрей says:

23 Feb 2009 at 4:58 am

Автор молодец))))хих Скажите, а у вас есть RSS поток в этом блоге?

Loading...

Reply
Владимир says:

26 Feb 2009 at 5:28 am

Автор, а вы случайно не из Москвы?

Loading...

Reply
Pingback: PowerShell: Search, Replace Text in Files « House of Blog
Ophelia Westrich says:

27 Oct 2010 at 11:33 am

Intriguing. Have been trying to learn a new language for a while so this is extremely relevant! Thanks.

Loading...

Reply