So I wanted to make a HTML page of just the titles of my blog items, with the titles hyperlinked. Here’s a solution in PowerShell:
[xml]$xml = get-content 'wordpress.xml' $items = $xml.rss.channel.item | Select-Object title,link foreach ($item in $items) { $s = '<p><a href="' + $item.link + '">' + $item.title + '</a></p>' echo $s }
I like how the XML handling is just woven into the fabric.
That said, the XML file that WordPress exports is — I just discovered to my chagrin — not actually XML. The comments contain all sorts of junk that choke an XML parser. I couldn’t find an example of a multiline non-greedy regular expression search-and-replace in PowerShell, so I stripped out the comments using Python:
import re s = open('wordpress.2007-11-09.xml').read() pat = re.compile('<wp:comment>.+?</wp:comment>',re.DOTALL) s = re.sub(pat,'',s) f = open('wordpress.xml','w') f.write(s)
Mapping idioms from one language to another is such an interesting problem. I’ve always imagined a kind of Rosetta Stone of patterns. It would contain patterns like multiline non-greedy regular expression search-and-replace and then you could map examples from any language into those patterns. Does any resource on the web approximate that kind of pattern vocabulary?
Actually non-greedy matches use the same notation in PowerShell as they do in the Python example. For example:
PS (9) > ‘afoobfooc’ -replace ‘.+?’,”
abc
The PowerShell equvalent of the Python code would probably be something like:
$modified = (gc wordpress.xml) -replace ‘.+?’,”
$modified | out-file -encoding ascii wordpress.xml # save as ascii
If saving the file as unicode is acceptable, then the last line become
$modified > wordpress.xml
As an aside, PowerShell just uses .NET regular expressions. The syntax for these expressions is documented at:
http://msdn2.microsoft.com/en-us/library/1400241x.aspx
-bruce
=====================================
Bruce Payette [MSFT]
Principal Developer, Windows PowerShell
Microsoft Corporation.
It’s a multiline match, though. I know how to do it this way in IronPython using the .NET regex system:
import clr
from System.Text.RegularExpressions import Regex, RegexOptions
xml = open(‘wordpress.xml’).read()
re = Regex(‘<wp:comment>.+?</wp:comment>’,RegexOptions.Singleline)
xml = re.Replace(xml,”)
Can you specify the regex option natively in PowerShell or would you need to construct a .NET regex object in similar fashion?
(And…why Singleline?)
The options for a .NET regular expression can be specified as part of the pattern itself. For example:
‘(?s)abd’
specifies a pattern with SingleLine turned on. (This work in Python too of course). The documentation for this is at:
http://msdn2.microsoft.com/en-us/library/yd1hzczs.aspx
Alternatively you could do the same thing in PowerShell that you did in the Python code:
$re = New-Object regex ’.+?’,Singleline
However, the other thing you’re going to run into with gc (Get-Content) is that, in PowerShell V1, Get-Content always splits the file into lines. To workaround this, do
$xml = [io.file]::ReadAllText((resolve-path $path)) -replace ‘(?s)…pattern….’,”
$xml | out-file -enc ascii
(We’re planning to fix this issue in the next release.)
-bruce
“a kind of Rosetta Stone of patterns”: http://www.rosettacode.org/ . Small but growing. Your example would fit in fine, I think.
> $re = New-Object regex ’.+?’,Singleline
Got it. That’s nice and succinct, thanks!
> http://www.rosettacode.org/
Excellent!
Я писал что-то подобное, но у Вас тема более глубого раскрыта
Написать пост на пол страницы время есть, а ответить нет? Нормально
Автор молодец))))хих Скажите, а у вас есть RSS поток в этом блоге?
Автор, а вы случайно не из Москвы?
Intriguing. Have been trying to learn a new language for a while so this is extremely relevant! Thanks.