So I wanted to make a HTML page of just the titles of my blog items, with the titles hyperlinked. Here’s a solution in PowerShell:
[xml]$xml = get-content 'wordpress.xml'
$items = $xml.rss.channel.item | Select-Object title,link
foreach ($item in $items)
{
$s = '<p><a href="' + $item.link + '">' + $item.title + '</a></p>'
echo $s
}
I like how the XML handling is just woven into the fabric.
That said, the XML file that WordPress exports is — I just discovered to my chagrin — not actually XML. The comments contain all sorts of junk that choke an XML parser. I couldn’t find an example of a multiline non-greedy regular expression search-and-replace in PowerShell, so I stripped out the comments using Python:
import re
s = open('wordpress.2007-11-09.xml').read()
pat = re.compile('<wp:comment>.+?</wp:comment>',re.DOTALL)
s = re.sub(pat,'',s)
f = open('wordpress.xml','w')
f.write(s)
Mapping idioms from one language to another is such an interesting problem. I’ve always imagined a kind of Rosetta Stone of patterns. It would contain patterns like multiline non-greedy regular expression search-and-replace and then you could map examples from any language into those patterns. Does any resource on the web approximate that kind of pattern vocabulary?
November 9, 2007 at 3:23 pm
Actually non-greedy matches use the same notation in PowerShell as they do in the Python example. For example:
PS (9) > ‘afoobfooc’ -replace ‘.+?’,”
abc
The PowerShell equvalent of the Python code would probably be something like:
$modified = (gc wordpress.xml) -replace ‘.+?’,”
$modified | out-file -encoding ascii wordpress.xml # save as ascii
If saving the file as unicode is acceptable, then the last line become
$modified > wordpress.xml
As an aside, PowerShell just uses .NET regular expressions. The syntax for these expressions is documented at:
http://msdn2.microsoft.com/en-us/library/1400241x.aspx
-bruce
=====================================
Bruce Payette [MSFT]
Principal Developer, Windows PowerShell
Microsoft Corporation.
November 9, 2007 at 5:22 pm
It’s a multiline match, though. I know how to do it this way in IronPython using the .NET regex system:
import clr
from System.Text.RegularExpressions import Regex, RegexOptions
xml = open(‘wordpress.xml’).read()
re = Regex(‘<wp:comment>.+?</wp:comment>’,RegexOptions.Singleline)
xml = re.Replace(xml,”)
Can you specify the regex option natively in PowerShell or would you need to construct a .NET regex object in similar fashion?
(And…why Singleline?)
November 9, 2007 at 8:19 pm
The options for a .NET regular expression can be specified as part of the pattern itself. For example:
‘(?s)abd’
specifies a pattern with SingleLine turned on. (This work in Python too of course). The documentation for this is at:
http://msdn2.microsoft.com/en-us/library/yd1hzczs.aspx
Alternatively you could do the same thing in PowerShell that you did in the Python code:
$re = New-Object regex ’.+?’,Singleline
However, the other thing you’re going to run into with gc (Get-Content) is that, in PowerShell V1, Get-Content always splits the file into lines. To workaround this, do
$xml = [io.file]::ReadAllText((resolve-path $path)) -replace ‘(?s)…pattern….’,”
$xml | out-file -enc ascii
(We’re planning to fix this issue in the next release.)
-bruce
November 10, 2007 at 2:00 pm
“a kind of Rosetta Stone of patterns”: http://www.rosettacode.org/ . Small but growing. Your example would fit in fine, I think.
November 12, 2007 at 9:04 am
> $re = New-Object regex ’.+?’,Singleline
Got it. That’s nice and succinct, thanks!
November 12, 2007 at 9:10 am
> http://www.rosettacode.org/
Excellent!
November 12, 2007 at 10:19 am
[...] this example I used a combination of PowerShell and Python because each afforded convenient access to a familiar [...]
November 12, 2007 at 2:54 pm
[...] Udell is messing with PowerShell again. I really wish I had time to get into this, and still think a Javascript-based open source [...]
January 31, 2009 at 5:20 pm
Я писал что-то подобное, но у Вас тема более глубого раскрыта
February 19, 2009 at 10:56 pm
Написать пост на пол страницы время есть, а ответить нет? Нормально
February 23, 2009 at 4:58 am
Автор молодец))))хих Скажите, а у вас есть RSS поток в этом блоге?
February 26, 2009 at 5:28 am
Автор, а вы случайно не из Москвы?
January 26, 2010 at 10:51 am
[...] This post was inspired by this discussion. [...]