PowerShell data munging

31 Oct 200710 Jun 2009 ~ Jon Udell

I first wrote about PowerShell back in 2004, when it was called Monad and/or MSH. What most intrigued me was the way .NET objects could flow through its pipeline. I wouldn’t have thought then, or now, of reaching for PowerShell just to do some basic logfile processing. But Scott Hanselman’s log analysis example today got my attention.

My assumption was that Python would offer easier and more natural ways to absorb, consolidate, sort, and emit the kind of simple log data shown in Scott’s example. But when I tried to recreate that example in Python, I developed a new appreciation for PowerShell’s data munging chops. The selection, grouping, summing, and sorting capabilities — and the pervasive use of the pipeline — are potent ways to manipulate .NET objects. But they’re equally potent when you’re just munging a CSV file.

True, Python has a csv module that’s roughly equivalent to PowerShell’s Import-CSV, so you can easily slurp the data into a dictionary, taking fieldnames from the first row. But how do you perform the selection, grouping, summing, and sorting as succintly and — for lack of a better word — as composably? I didn’t find obvious answers. Of course, just because I failed doesn’t mean it can’t be done. I’d be curious to see solutions in Python (or Ruby, Perl, etc.) that people think capture the spirit of Scott’s example. I suspect there may be interesting lessons to be learned going in both directions.

PS: When I ran the script that Lee Holmes wrote based on Scott’s one-liner, I was initially puzzled to see a one-column display of names, not a two-column display of names and counts. Then I realized it was a formatting problem — the second column was running off the edge of my display. So I changed:

Get-ShowHits | Sort -Desc Hits

To:

Get-ShowHits | Sort -Desc Hits | Format-Table Name,Hits -auto

Published by Jon Udell

View all posts by Jon Udell

12 thoughts on “PowerShell data munging”

Rick Morrison says:

31 Oct 2007 at 5:57 pm

Hey Jon

You may want to have a look at Ipython, specifically the ipipe module, which can do some of these tricks:
http://ipython.scipy.org/moin/UsingIPipe

What’s more, adding support for Ipipe to your own classes is relatively easy (not sure what’s involved in Modad)
http://ipython.scipy.org/moin/SupportingIPipe

Rick

Loading...

Reply
Jim Culbert says:

1 Nov 2007 at 9:28 am

I could probably put this all on one line if you’d like.

foo = Hash.new(0)
CSV.open(“test.csv”, “r”) { |row| foo[row[0][/\d{4}/].to_i] += row[1].to_i}
puts “Name\t\tHits”
foo.sort.each { |i| puts “#{i[0]}\t\t#{i[1]}” }

Translation…
Create a hash with default value of integer=0

Foreach row in csv file pull the four digit string from the file name (first column), convert to an integer and use to index hash. Add (integer converted) value in second colum to value at hash index.

Do some pretty printing stuff.

Doesn’t quite have the same pipeliney flavor of the powershell example…

Loading...

Reply
Jim Culbert says:

1 Nov 2007 at 11:54 am

These types of problems are like mosquito bites for me – I can’t stop itching at them. A better ruby one-liner. Doesn’t print a header but not a big deal…

CSV.read(“test.csv”).inject(Hash.new(0)) {|h,row| h[row[0][/\d{4}/].to_i] += row[1].to_i;h}.sort.each {|i| puts(“#{i[0]}\t#{i[1]}”)}

Loading...

Reply
Tim says:

1 Nov 2007 at 9:44 pm

You ought to reference or at least check out Tim Bray’s “Wide Finder” project (at http://www.ongoing.com) which is looking at parallelized implementations of log analysis. Many different languages have had samples submitted.

As he said – this isn’t major “Application Architecture” stuff but it’s the kind of thing people have to do every day.

Loading...

Reply
Jim Baker says:

1 Nov 2007 at 11:09 pm

I suspect this is not in the spirit of Scott’s example – one liners are just not terribly Pythonic. (And rather hard to do in Python too! Sorry, golfers.) But it’s clear, concise, and robust:

#!/usr/bin/env python -u
import collections, csv, re, sys

reader = csv.reader(sys.stdin) # use python -u on windows to ensure
writer = csv.writer(sys.stdout) # that binary mode is used
show_hits = collections.defaultdict(int) # just like a perl auto-vivify
header = reader.next() # consume the first row into the header
show_re = re.compile(r”’
.*?_ # non-greedy match upto the show code – simplifies the regex
(?P\d{4}) # which is a 4-digit show code (or relax as desired)
”’, re.VERBOSE)
for row in reader:
data = dict(zip(header, row)) # re-express each row as a record dict
match = show_re.search(data[‘File’])
if match: # and if no match, perhaps throw an exception…
show_hits[match.group(‘show’)] += int(data[‘Hits’])
writer.writerow((‘Show’, ‘Hits’))
for show, hits in sorted(show_hits.iteritems()):
writer.writerow((show, hits))

Loading...

Reply
Jim Baker says:

1 Nov 2007 at 11:20 pm

Besides the whitespace being butchered by wordpress, I noticed that it removed the less-than,show,greater-than in the regex, which is used in match.group(‘show’). Feel free to contact me for the original source code.

Loading...

Reply
Lee says:

2 Nov 2007 at 12:29 am

Jim Culbert: PowerShell lets you whittle scripts down to line noise, too :)

http://www.leeholmes.com/blog/WillItPipeBrevityAndReadability.aspx

Loading...

Reply
Jon Udell says:

2 Nov 2007 at 4:48 pm

Thanks for these examples!

I see I wasn’t at all clear about what grabbed me in Scott’s example. It’s not the one-linerness of his original, vs the more unpacked version that Lee showed. Rather, it’s the way in which the operations of selection, grouping, and summarization are composed. It feels interestingly different to the style of Python and Ruby. But I guess I’m not sure whether I think that’s an essential or a superficial difference in style.

Loading...

Reply
Wai Yip Tung says:

3 Nov 2007 at 2:41 pm

rom my understanding, I think pipe can be define as

expr | f(a1,a2,…) —> f(expr, a1, a2, …)

Sometimes we have a series of functions

r1 = f1(a1, a2)
r2 = f2(r1, b1, b2)
r3 = f3(r2, c1, c2)

This can be compacted in one line as

f3( f2( f1(a1, a2), b1, b2), c1, c2)

It looks complicated here. But we have all used this pattern and it can be very clear when used right.

With a pipe syntax, this becomes

f1(a1, a2) | f2(b1, b2) | f3(c1, c2)

I think this is interesting. But I’m not sure if this code pattern is prevalent enough to justify introducing a new syntax into a language.

Loading...

Reply
Jon Udell says:

4 Nov 2007 at 11:28 am

Posting this for Wai Yip Tung:

import csv import re from operator import itemgetter as select from itertools import groupby from pprint import pprint def get_showhits(filename): regex = re.compile('/hanselminutes_(\d+).*') csv_file = csv.reader(open(filename)) header = csv_file.next() shows = map( select(0,1), csv_file ) shows = [(File, int(Hits), regex.match(File).group(1)) for File,Hits in shows] shows.sort(key=select(2)) shows_groups = groupby(shows, select(2)) showOutput = [(k, sum(map(select(1), g))) for k, g in shows_groups] return showOutput showOutput = get_showhits("stats.csv") showOutput = sorted(showOutput, key=select(1), reverse=True) pprint(showOutput)

He adds:

I think it mimics the PowerShell script fairly closely. On the other hand
I can cut 1 or 2 lines out if I just go with the Python way. Of course
Python do not have a Table data type so it cannot refer a column by name.
I think it is good enough to refer to column by number such as select(1),
etc.

PowerShell’s “select” keyword is an inspiration. The term is so much more
intuitive than Python’s little known operator.itemgetter method.

About the pipe syntax, right now every Python statement produce an
expression output. They are just throw away in the next statement. Perhaps
we can make some use of it. In fact in the interactive mode, the last
result is bound to the identifier “_”. So we can do something in the
spirit of pipe using _ like below:

>>> get_showhits(“log.csv”)
>>> sorted(_, key=select(1) ,reverse=True)
>>> pprint(_)
[(‘0026’, 78173),
(‘0075’, 25814),
(‘0076’, 24626),
(‘0077’, 17204),
(‘0076’, 15796),
(‘0078’, 14832),
(‘0078’, 11058)]
>>>

But again, I’m not sure if this is a common enough code pattern.

Loading...

Reply
Pingback: PowerShell data munging, revisited « Jon Udell
cabalamat says:

5 Nov 2007 at 12:55 pm

Regarding comment #9,

in Python, if you have (r1) as an object ans set up the right arguments, you can also do:

r1.f1(a1, a2).f2(a3, a4).f3(a5, a6)

But I’m not sure if this code pattern is prevalent enough to justify introducing a new syntax into a language.

I think for Python, you’re right.

Loading...

Reply