Comments on: PowerShell data munging

By: cabalamat

cabalamat — Mon, 05 Nov 2007 17:55:11 +0000

Regarding comment #9,

in Python, if you have (r1) as an object ans set up the right arguments, you can also do:

r1.f1(a1, a2).f2(a3, a4).f3(a5, a6)

But I’m not sure if this code pattern is prevalent enough to justify introducing a new syntax into a language.

I think for Python, you’re right.

By: PowerShell data munging, revisited « Jon Udell

PowerShell data munging, revisited « Jon Udell — Mon, 05 Nov 2007 15:43:38 +0000

[…] can be dicey to invite comparisons between programming languages, as I did last week in an entry on data munging with PowerShell. But in this case, although I didn’t at first articulate very well what I found interesting […]

By: Jon Udell

Jon Udell — Sun, 04 Nov 2007 16:28:45 +0000

Posting this for Wai Yip Tung:

import csv import re from operator import itemgetter as select from itertools import groupby from pprint import pprint def get_showhits(filename): regex = re.compile('/hanselminutes_(\d+).*') csv_file = csv.reader(open(filename)) header = csv_file.next() shows = map( select(0,1), csv_file ) shows = [(File, int(Hits), regex.match(File).group(1)) for File,Hits in shows] shows.sort(key=select(2)) shows_groups = groupby(shows, select(2)) showOutput = [(k, sum(map(select(1), g))) for k, g in shows_groups] return showOutput showOutput = get_showhits("stats.csv") showOutput = sorted(showOutput, key=select(1), reverse=True) pprint(showOutput)

He adds:

I think it mimics the PowerShell script fairly closely. On the other hand
I can cut 1 or 2 lines out if I just go with the Python way. Of course
Python do not have a Table data type so it cannot refer a column by name.
I think it is good enough to refer to column by number such as select(1),
etc.

PowerShell’s “select” keyword is an inspiration. The term is so much more
intuitive than Python’s little known operator.itemgetter method.

About the pipe syntax, right now every Python statement produce an
expression output. They are just throw away in the next statement. Perhaps
we can make some use of it. In fact in the interactive mode, the last
result is bound to the identifier “_”. So we can do something in the
spirit of pipe using _ like below:

>>> get_showhits(“log.csv”)
>>> sorted(_, key=select(1) ,reverse=True)
>>> pprint(_)
[(‘0026’, 78173),
(‘0075’, 25814),
(‘0076’, 24626),
(‘0077’, 17204),
(‘0076’, 15796),
(‘0078’, 14832),
(‘0078’, 11058)]
>>>

But again, I’m not sure if this is a common enough code pattern.

By: Wai Yip Tung

Wai Yip Tung — Sat, 03 Nov 2007 19:41:23 +0000

rom my understanding, I think pipe can be define as

expr | f(a1,a2,…) —> f(expr, a1, a2, …)

Sometimes we have a series of functions

r1 = f1(a1, a2)
r2 = f2(r1, b1, b2)
r3 = f3(r2, c1, c2)

This can be compacted in one line as

f3( f2( f1(a1, a2), b1, b2), c1, c2)

It looks complicated here. But we have all used this pattern and it can be very clear when used right.

With a pipe syntax, this becomes

f1(a1, a2) | f2(b1, b2) | f3(c1, c2)

I think this is interesting. But I’m not sure if this code pattern is prevalent enough to justify introducing a new syntax into a language.

By: Jon Udell

Jon Udell — Fri, 02 Nov 2007 21:48:31 +0000

Thanks for these examples!

I see I wasn’t at all clear about what grabbed me in Scott’s example. It’s not the one-linerness of his original, vs the more unpacked version that Lee showed. Rather, it’s the way in which the operations of selection, grouping, and summarization are composed. It feels interestingly different to the style of Python and Ruby. But I guess I’m not sure whether I think that’s an essential or a superficial difference in style.

By: Lee

Lee — Fri, 02 Nov 2007 05:29:30 +0000

Jim Culbert: PowerShell lets you whittle scripts down to line noise, too :)

http://www.leeholmes.com/blog/WillItPipeBrevityAndReadability.aspx

By: Jim Baker

Jim Baker — Fri, 02 Nov 2007 04:20:16 +0000

Besides the whitespace being butchered by wordpress, I noticed that it removed the less-than,show,greater-than in the regex, which is used in match.group(‘show’). Feel free to contact me for the original source code.

By: Jim Baker

Jim Baker — Fri, 02 Nov 2007 04:09:40 +0000

I suspect this is not in the spirit of Scott’s example – one liners are just not terribly Pythonic. (And rather hard to do in Python too! Sorry, golfers.) But it’s clear, concise, and robust:

#!/usr/bin/env python -u
import collections, csv, re, sys

reader = csv.reader(sys.stdin) # use python -u on windows to ensure
writer = csv.writer(sys.stdout) # that binary mode is used
show_hits = collections.defaultdict(int) # just like a perl auto-vivify
header = reader.next() # consume the first row into the header
show_re = re.compile(r”’
.*?_ # non-greedy match upto the show code – simplifies the regex
(?P\d{4}) # which is a 4-digit show code (or relax as desired)
”’, re.VERBOSE)
for row in reader:
data = dict(zip(header, row)) # re-express each row as a record dict
match = show_re.search(data[‘File’])
if match: # and if no match, perhaps throw an exception…
show_hits[match.group(‘show’)] += int(data[‘Hits’])
writer.writerow((‘Show’, ‘Hits’))
for show, hits in sorted(show_hits.iteritems()):
writer.writerow((show, hits))

By: Tim

Tim — Fri, 02 Nov 2007 02:44:17 +0000

You ought to reference or at least check out Tim Bray’s “Wide Finder” project (at http://www.ongoing.com) which is looking at parallelized implementations of log analysis. Many different languages have had samples submitted.

As he said – this isn’t major “Application Architecture” stuff but it’s the kind of thing people have to do every day.

By: Jim Culbert

Jim Culbert — Thu, 01 Nov 2007 16:54:17 +0000

These types of problems are like mosquito bites for me – I can’t stop itching at them. A better ruby one-liner. Doesn’t print a header but not a big deal…

CSV.read(“test.csv”).inject(Hash.new(0)) {|h,row| h[row[0][/\d{4}/].to_i] += row[1].to_i;h}.sort.each {|i| puts(“#{i[0]}\t#{i[1]}”)}

By: Jim Culbert

Jim Culbert — Thu, 01 Nov 2007 14:28:19 +0000

I could probably put this all on one line if you’d like.

foo = Hash.new(0)
CSV.open(“test.csv”, “r”) { |row| foo[row[0][/\d{4}/].to_i] += row[1].to_i}
puts “Name\t\tHits”
foo.sort.each { |i| puts “#{i[0]}\t\t#{i[1]}” }

Translation…
Create a hash with default value of integer=0

Foreach row in csv file pull the four digit string from the file name (first column), convert to an integer and use to index hash. Add (integer converted) value in second colum to value at hash index.

Do some pretty printing stuff.

Doesn’t quite have the same pipeliney flavor of the powershell example…

By: Rick Morrison

Rick Morrison — Wed, 31 Oct 2007 22:57:23 +0000

Hey Jon

You may want to have a look at Ipython, specifically the ipipe module, which can do some of these tricks:
http://ipython.scipy.org/moin/UsingIPipe

What’s more, adding support for Ipipe to your own classes is relatively easy (not sure what’s involved in Modad)
http://ipython.scipy.org/moin/SupportingIPipe

Rick