<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
		>
<channel>
	<title>Comments on: PowerShell data munging</title>
	<atom:link href="http://blog.jonudell.net/2007/10/31/powershell-data-munging/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.jonudell.net/2007/10/31/powershell-data-munging/</link>
	<description>Strategies for Internet citizens</description>
	<lastBuildDate>Sat, 11 Feb 2012 19:45:11 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
	<item>
		<title>By: cabalamat</title>
		<link>http://blog.jonudell.net/2007/10/31/powershell-data-munging/#comment-75963</link>
		<dc:creator><![CDATA[cabalamat]]></dc:creator>
		<pubDate>Mon, 05 Nov 2007 17:55:11 +0000</pubDate>
		<guid isPermaLink="false">http://blog.jonudell.net/2007/10/31/powershell-data-munging/#comment-75963</guid>
		<description><![CDATA[Regarding comment #9,

in Python, if you have (r1) as an object ans set up the right arguments, you can also do:


r1.f1(a1, a2).f2(a3, a4).f3(a5, a6)


&lt;i&gt;But I’m not sure if this code pattern is prevalent enough to justify introducing a new syntax into a language.&lt;/i&gt;

I think for Python, you&#039;re right.]]></description>
		<content:encoded><![CDATA[<p>Regarding comment #9,</p>
<p>in Python, if you have (r1) as an object ans set up the right arguments, you can also do:</p>
<p>r1.f1(a1, a2).f2(a3, a4).f3(a5, a6)</p>
<p><i>But I’m not sure if this code pattern is prevalent enough to justify introducing a new syntax into a language.</i></p>
<p>I think for Python, you&#8217;re right.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: PowerShell data munging, revisited &#171; Jon Udell</title>
		<link>http://blog.jonudell.net/2007/10/31/powershell-data-munging/#comment-75922</link>
		<dc:creator><![CDATA[PowerShell data munging, revisited &#171; Jon Udell]]></dc:creator>
		<pubDate>Mon, 05 Nov 2007 15:43:38 +0000</pubDate>
		<guid isPermaLink="false">http://blog.jonudell.net/2007/10/31/powershell-data-munging/#comment-75922</guid>
		<description><![CDATA[[...] can be dicey to invite comparisons between programming languages, as I did last week in an entry on data munging with PowerShell. But in this case, although I didn&#8217;t at first articulate very well what I found interesting [...]]]></description>
		<content:encoded><![CDATA[<p>[...] can be dicey to invite comparisons between programming languages, as I did last week in an entry on data munging with PowerShell. But in this case, although I didn&#8217;t at first articulate very well what I found interesting [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jon Udell</title>
		<link>http://blog.jonudell.net/2007/10/31/powershell-data-munging/#comment-75423</link>
		<dc:creator><![CDATA[Jon Udell]]></dc:creator>
		<pubDate>Sun, 04 Nov 2007 16:28:45 +0000</pubDate>
		<guid isPermaLink="false">http://blog.jonudell.net/2007/10/31/powershell-data-munging/#comment-75423</guid>
		<description><![CDATA[Posting this for Wai Yip Tung:

&lt;code&gt;
import csv
import re
from operator import itemgetter as select
from itertools import groupby
from pprint import pprint
def get_showhits(filename):
&#160;&#160;regex = re.compile(&#039;/hanselminutes_(\d+).*&#039;)
&#160;&#160;csv_file = csv.reader(open(filename))
&#160;&#160;header = csv_file.next()
&#160;&#160;shows = map( select(0,1), csv_file ) 
&#160;&#160;shows = [(File, int(Hits), regex.match(File).group(1)) for File,Hits in shows]
&#160;&#160;shows.sort(key=select(2))
&#160;&#160;shows_groups = groupby(shows, select(2))
&#160;&#160;showOutput = [(k, sum(map(select(1), g))) for k, g in shows_groups]
&#160;&#160;return showOutput
showOutput = get_showhits(&quot;stats.csv&quot;)
showOutput = sorted(showOutput, key=select(1), reverse=True)
pprint(showOutput)
&lt;/code&gt;

He adds:

I think it mimics the PowerShell script fairly closely. On the other hand
I can cut 1 or 2 lines out if I just go with the Python way. Of course
Python do not have a Table data type so it cannot refer a column by name.
I think it is good enough to refer to column by number such as select(1),
etc.

PowerShell&#039;s &quot;select&quot; keyword is an inspiration. The term is so much more
intuitive than Python&#039;s little known operator.itemgetter method.

About the pipe syntax, right now every Python statement produce an
expression output. They are just throw away in the next statement. Perhaps
we can make some use of it. In fact in the interactive mode, the last
result is bound to the identifier &quot;_&quot;. So we can do something in the
spirit of pipe using _ like below:

&gt;&gt;&gt; get_showhits(&quot;log.csv&quot;)
&gt;&gt;&gt; sorted(_, key=select(1) ,reverse=True)
&gt;&gt;&gt; pprint(_)
[(&#039;0026&#039;, 78173),
 (&#039;0075&#039;, 25814),
 (&#039;0076&#039;, 24626),
 (&#039;0077&#039;, 17204),
 (&#039;0076&#039;, 15796),
 (&#039;0078&#039;, 14832),
 (&#039;0078&#039;, 11058)]
&gt;&gt;&gt;

But again, I&#039;m not sure if this is a common enough code pattern.]]></description>
		<content:encoded><![CDATA[<p>Posting this for Wai Yip Tung:</p>
<p><code><br />
import csv<br />
import re<br />
from operator import itemgetter as select<br />
from itertools import groupby<br />
from pprint import pprint<br />
def get_showhits(filename):<br />
&nbsp;&nbsp;regex = re.compile('/hanselminutes_(\d+).*')<br />
&nbsp;&nbsp;csv_file = csv.reader(open(filename))<br />
&nbsp;&nbsp;header = csv_file.next()<br />
&nbsp;&nbsp;shows = map( select(0,1), csv_file )<br />
&nbsp;&nbsp;shows = [(File, int(Hits), regex.match(File).group(1)) for File,Hits in shows]<br />
&nbsp;&nbsp;shows.sort(key=select(2))<br />
&nbsp;&nbsp;shows_groups = groupby(shows, select(2))<br />
&nbsp;&nbsp;showOutput = [(k, sum(map(select(1), g))) for k, g in shows_groups]<br />
&nbsp;&nbsp;return showOutput<br />
showOutput = get_showhits("stats.csv")<br />
showOutput = sorted(showOutput, key=select(1), reverse=True)<br />
pprint(showOutput)<br />
</code></p>
<p>He adds:</p>
<p>I think it mimics the PowerShell script fairly closely. On the other hand<br />
I can cut 1 or 2 lines out if I just go with the Python way. Of course<br />
Python do not have a Table data type so it cannot refer a column by name.<br />
I think it is good enough to refer to column by number such as select(1),<br />
etc.</p>
<p>PowerShell&#8217;s &#8220;select&#8221; keyword is an inspiration. The term is so much more<br />
intuitive than Python&#8217;s little known operator.itemgetter method.</p>
<p>About the pipe syntax, right now every Python statement produce an<br />
expression output. They are just throw away in the next statement. Perhaps<br />
we can make some use of it. In fact in the interactive mode, the last<br />
result is bound to the identifier &#8220;_&#8221;. So we can do something in the<br />
spirit of pipe using _ like below:</p>
<p>&gt;&gt;&gt; get_showhits(&#8220;log.csv&#8221;)<br />
&gt;&gt;&gt; sorted(_, key=select(1) ,reverse=True)<br />
&gt;&gt;&gt; pprint(_)<br />
[('0026', 78173),<br />
 ('0075', 25814),<br />
 ('0076', 24626),<br />
 ('0077', 17204),<br />
 ('0076', 15796),<br />
 ('0078', 14832),<br />
 ('0078', 11058)]<br />
&gt;&gt;&gt;</p>
<p>But again, I&#8217;m not sure if this is a common enough code pattern.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Wai Yip Tung</title>
		<link>http://blog.jonudell.net/2007/10/31/powershell-data-munging/#comment-75202</link>
		<dc:creator><![CDATA[Wai Yip Tung]]></dc:creator>
		<pubDate>Sat, 03 Nov 2007 19:41:23 +0000</pubDate>
		<guid isPermaLink="false">http://blog.jonudell.net/2007/10/31/powershell-data-munging/#comment-75202</guid>
		<description><![CDATA[rom my understanding, I think pipe can be define as

expr &#124; f(a1,a2,...)  ---&gt;  f(expr, a1, a2, ...)

Sometimes we have a series of functions

r1 = f1(a1, a2)
r2 = f2(r1, b1, b2)
r3 = f3(r2, c1, c2)

This can be compacted in one line as

f3( f2( f1(a1, a2), b1, b2), c1, c2)

It looks complicated here. But we have all used this pattern and it can be very clear when used right.

With a pipe syntax, this becomes

f1(a1, a2) &#124; f2(b1, b2) &#124; f3(c1, c2)

I think this is interesting. But I&#039;m not sure if this code pattern is prevalent enough to justify introducing a new syntax into a language.]]></description>
		<content:encoded><![CDATA[<p>rom my understanding, I think pipe can be define as</p>
<p>expr | f(a1,a2,&#8230;)  &#8212;&gt;  f(expr, a1, a2, &#8230;)</p>
<p>Sometimes we have a series of functions</p>
<p>r1 = f1(a1, a2)<br />
r2 = f2(r1, b1, b2)<br />
r3 = f3(r2, c1, c2)</p>
<p>This can be compacted in one line as</p>
<p>f3( f2( f1(a1, a2), b1, b2), c1, c2)</p>
<p>It looks complicated here. But we have all used this pattern and it can be very clear when used right.</p>
<p>With a pipe syntax, this becomes</p>
<p>f1(a1, a2) | f2(b1, b2) | f3(c1, c2)</p>
<p>I think this is interesting. But I&#8217;m not sure if this code pattern is prevalent enough to justify introducing a new syntax into a language.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jon Udell</title>
		<link>http://blog.jonudell.net/2007/10/31/powershell-data-munging/#comment-74833</link>
		<dc:creator><![CDATA[Jon Udell]]></dc:creator>
		<pubDate>Fri, 02 Nov 2007 21:48:31 +0000</pubDate>
		<guid isPermaLink="false">http://blog.jonudell.net/2007/10/31/powershell-data-munging/#comment-74833</guid>
		<description><![CDATA[Thanks for these examples! 

I see I wasn&#039;t at all clear about what grabbed me in Scott&#039;s example. It&#039;s not the one-linerness of his original, vs the more unpacked version that Lee showed. Rather, it&#039;s the way in which the operations of selection, grouping, and summarization are composed. It feels interestingly different to the style of Python and Ruby. But I guess I&#039;m not sure whether I think that&#039;s an essential or a superficial difference in style.]]></description>
		<content:encoded><![CDATA[<p>Thanks for these examples! </p>
<p>I see I wasn&#8217;t at all clear about what grabbed me in Scott&#8217;s example. It&#8217;s not the one-linerness of his original, vs the more unpacked version that Lee showed. Rather, it&#8217;s the way in which the operations of selection, grouping, and summarization are composed. It feels interestingly different to the style of Python and Ruby. But I guess I&#8217;m not sure whether I think that&#8217;s an essential or a superficial difference in style.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Lee</title>
		<link>http://blog.jonudell.net/2007/10/31/powershell-data-munging/#comment-74466</link>
		<dc:creator><![CDATA[Lee]]></dc:creator>
		<pubDate>Fri, 02 Nov 2007 05:29:30 +0000</pubDate>
		<guid isPermaLink="false">http://blog.jonudell.net/2007/10/31/powershell-data-munging/#comment-74466</guid>
		<description><![CDATA[Jim Culbert: PowerShell lets you whittle scripts down to line noise, too :)

http://www.leeholmes.com/blog/WillItPipeBrevityAndReadability.aspx]]></description>
		<content:encoded><![CDATA[<p>Jim Culbert: PowerShell lets you whittle scripts down to line noise, too :)</p>
<p><a href="http://www.leeholmes.com/blog/WillItPipeBrevityAndReadability.aspx" rel="nofollow">http://www.leeholmes.com/blog/WillItPipeBrevityAndReadability.aspx</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jim Baker</title>
		<link>http://blog.jonudell.net/2007/10/31/powershell-data-munging/#comment-74435</link>
		<dc:creator><![CDATA[Jim Baker]]></dc:creator>
		<pubDate>Fri, 02 Nov 2007 04:20:16 +0000</pubDate>
		<guid isPermaLink="false">http://blog.jonudell.net/2007/10/31/powershell-data-munging/#comment-74435</guid>
		<description><![CDATA[Besides the whitespace being butchered by wordpress, I noticed that it removed the less-than,show,greater-than in the regex, which is used in match.group(&#039;show&#039;). Feel free to contact me for the original source code.]]></description>
		<content:encoded><![CDATA[<p>Besides the whitespace being butchered by wordpress, I noticed that it removed the less-than,show,greater-than in the regex, which is used in match.group(&#8216;show&#8217;). Feel free to contact me for the original source code.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jim Baker</title>
		<link>http://blog.jonudell.net/2007/10/31/powershell-data-munging/#comment-74422</link>
		<dc:creator><![CDATA[Jim Baker]]></dc:creator>
		<pubDate>Fri, 02 Nov 2007 04:09:40 +0000</pubDate>
		<guid isPermaLink="false">http://blog.jonudell.net/2007/10/31/powershell-data-munging/#comment-74422</guid>
		<description><![CDATA[I suspect this is not in the spirit of Scott&#039;s example - one liners are just not terribly Pythonic. (And rather hard to do in Python too! Sorry, golfers.) But it&#039;s clear, concise, and robust:



#!/usr/bin/env python -u
import collections, csv, re, sys

reader = csv.reader(sys.stdin)  # use python -u on windows to ensure
writer = csv.writer(sys.stdout) # that binary mode is used
show_hits = collections.defaultdict(int) # just like a perl auto-vivify
header = reader.next() # consume the first row into the header
show_re = re.compile(r&#039;&#039;&#039;
   .*?_ # non-greedy match upto the show code - simplifies the regex
   (?P\d{4}) # which is a 4-digit show code (or relax as desired)
&#039;&#039;&#039;, re.VERBOSE)
for row in reader:
    data = dict(zip(header, row)) # re-express each row as a record dict
    match = show_re.search(data[&#039;File&#039;])
    if match: # and if no match, perhaps throw an exception...
        show_hits[match.group(&#039;show&#039;)] += int(data[&#039;Hits&#039;])
writer.writerow((&#039;Show&#039;, &#039;Hits&#039;))
for show, hits in sorted(show_hits.iteritems()):
    writer.writerow((show, hits))

]]></description>
		<content:encoded><![CDATA[<p>I suspect this is not in the spirit of Scott&#8217;s example &#8211; one liners are just not terribly Pythonic. (And rather hard to do in Python too! Sorry, golfers.) But it&#8217;s clear, concise, and robust:</p>
<p>#!/usr/bin/env python -u<br />
import collections, csv, re, sys</p>
<p>reader = csv.reader(sys.stdin)  # use python -u on windows to ensure<br />
writer = csv.writer(sys.stdout) # that binary mode is used<br />
show_hits = collections.defaultdict(int) # just like a perl auto-vivify<br />
header = reader.next() # consume the first row into the header<br />
show_re = re.compile(r&#8221;&#8217;<br />
   .*?_ # non-greedy match upto the show code &#8211; simplifies the regex<br />
   (?P\d{4}) # which is a 4-digit show code (or relax as desired)<br />
&#8221;&#8217;, re.VERBOSE)<br />
for row in reader:<br />
    data = dict(zip(header, row)) # re-express each row as a record dict<br />
    match = show_re.search(data['File'])<br />
    if match: # and if no match, perhaps throw an exception&#8230;<br />
        show_hits[match.group('show')] += int(data['Hits'])<br />
writer.writerow((&#8216;Show&#8217;, &#8216;Hits&#8217;))<br />
for show, hits in sorted(show_hits.iteritems()):<br />
    writer.writerow((show, hits))</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Tim</title>
		<link>http://blog.jonudell.net/2007/10/31/powershell-data-munging/#comment-74403</link>
		<dc:creator><![CDATA[Tim]]></dc:creator>
		<pubDate>Fri, 02 Nov 2007 02:44:17 +0000</pubDate>
		<guid isPermaLink="false">http://blog.jonudell.net/2007/10/31/powershell-data-munging/#comment-74403</guid>
		<description><![CDATA[You ought to reference or at least check out Tim Bray&#039;s &quot;Wide Finder&quot; project (at www.ongoing.com) which is looking at parallelized implementations of log analysis. Many different languages have had samples submitted.

As he said - this isn&#039;t major &quot;Application Architecture&quot; stuff but it&#039;s the kind of thing people have to do every day.]]></description>
		<content:encoded><![CDATA[<p>You ought to reference or at least check out Tim Bray&#8217;s &#8220;Wide Finder&#8221; project (at <a href="http://www.ongoing.com" rel="nofollow">http://www.ongoing.com</a>) which is looking at parallelized implementations of log analysis. Many different languages have had samples submitted.</p>
<p>As he said &#8211; this isn&#8217;t major &#8220;Application Architecture&#8221; stuff but it&#8217;s the kind of thing people have to do every day.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jim Culbert</title>
		<link>http://blog.jonudell.net/2007/10/31/powershell-data-munging/#comment-74187</link>
		<dc:creator><![CDATA[Jim Culbert]]></dc:creator>
		<pubDate>Thu, 01 Nov 2007 16:54:17 +0000</pubDate>
		<guid isPermaLink="false">http://blog.jonudell.net/2007/10/31/powershell-data-munging/#comment-74187</guid>
		<description><![CDATA[These types of problems are like mosquito bites for me - I can&#039;t stop itching at them. A better ruby one-liner. Doesn&#039;t print a header but not a big deal...

CSV.read(&quot;test.csv&quot;).inject(Hash.new(0)) {&#124;h,row&#124; h[row[0][/\d{4}/].to_i] += row[1].to_i;h}.sort.each {&#124;i&#124; puts(&quot;#{i[0]}\t#{i[1]}&quot;)}]]></description>
		<content:encoded><![CDATA[<p>These types of problems are like mosquito bites for me &#8211; I can&#8217;t stop itching at them. A better ruby one-liner. Doesn&#8217;t print a header but not a big deal&#8230;</p>
<p>CSV.read(&#8220;test.csv&#8221;).inject(Hash.new(0)) {|h,row| h[row[0][/\d{4}/].to_i] += row[1].to_i;h}.sort.each {|i| puts(&#8220;#{i[0]}\t#{i[1]}&#8221;)}</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jim Culbert</title>
		<link>http://blog.jonudell.net/2007/10/31/powershell-data-munging/#comment-74133</link>
		<dc:creator><![CDATA[Jim Culbert]]></dc:creator>
		<pubDate>Thu, 01 Nov 2007 14:28:19 +0000</pubDate>
		<guid isPermaLink="false">http://blog.jonudell.net/2007/10/31/powershell-data-munging/#comment-74133</guid>
		<description><![CDATA[I could probably put this all on one line if you&#039;d like.

foo = Hash.new(0)
CSV.open(&quot;test.csv&quot;, &quot;r&quot;) { &#124;row&#124; foo[row[0][/\d{4}/].to_i] += row[1].to_i}
puts &quot;Name\t\tHits&quot;
foo.sort.each { &#124;i&#124; puts &quot;#{i[0]}\t\t#{i[1]}&quot; }

Translation...
Create a hash with default value of integer=0

Foreach row in csv file pull the four digit string from the file name (first column), convert to an integer and use to index hash. Add (integer converted) value in second colum to value at hash index.

Do some pretty printing stuff.

Doesn&#039;t quite have the same pipeliney flavor of the powershell example...]]></description>
		<content:encoded><![CDATA[<p>I could probably put this all on one line if you&#8217;d like.</p>
<p>foo = Hash.new(0)<br />
CSV.open(&#8220;test.csv&#8221;, &#8220;r&#8221;) { |row| foo[row[0][/\d{4}/].to_i] += row[1].to_i}<br />
puts &#8220;Name\t\tHits&#8221;<br />
foo.sort.each { |i| puts &#8220;#{i[0]}\t\t#{i[1]}&#8221; }</p>
<p>Translation&#8230;<br />
Create a hash with default value of integer=0</p>
<p>Foreach row in csv file pull the four digit string from the file name (first column), convert to an integer and use to index hash. Add (integer converted) value in second colum to value at hash index.</p>
<p>Do some pretty printing stuff.</p>
<p>Doesn&#8217;t quite have the same pipeliney flavor of the powershell example&#8230;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Rick Morrison</title>
		<link>http://blog.jonudell.net/2007/10/31/powershell-data-munging/#comment-73948</link>
		<dc:creator><![CDATA[Rick Morrison]]></dc:creator>
		<pubDate>Wed, 31 Oct 2007 22:57:23 +0000</pubDate>
		<guid isPermaLink="false">http://blog.jonudell.net/2007/10/31/powershell-data-munging/#comment-73948</guid>
		<description><![CDATA[Hey Jon 

You may want to have a look at Ipython, specifically the ipipe module, which can do some of these tricks:
   http://ipython.scipy.org/moin/UsingIPipe

What&#039;s more, adding support for Ipipe to your own classes is relatively easy (not sure what&#039;s involved in Modad)
   http://ipython.scipy.org/moin/SupportingIPipe

Rick]]></description>
		<content:encoded><![CDATA[<p>Hey Jon </p>
<p>You may want to have a look at Ipython, specifically the ipipe module, which can do some of these tricks:<br />
   <a href="http://ipython.scipy.org/moin/UsingIPipe" rel="nofollow">http://ipython.scipy.org/moin/UsingIPipe</a></p>
<p>What&#8217;s more, adding support for Ipipe to your own classes is relatively easy (not sure what&#8217;s involved in Modad)<br />
   <a href="http://ipython.scipy.org/moin/SupportingIPipe" rel="nofollow">http://ipython.scipy.org/moin/SupportingIPipe</a></p>
<p>Rick</p>
]]></content:encoded>
	</item>
</channel>
</rss>

