PowerShell data munging

I first wrote about PowerShell back in 2004, when it was called Monad and/or MSH. What most intrigued me was the way .NET objects could flow through its pipeline. I wouldn’t have thought then, or now, of reaching for PowerShell just to do some basic logfile processing. But Scott Hanselman’s log analysis example today got my attention.

My assumption was that Python would offer easier and more natural ways to absorb, consolidate, sort, and emit the kind of simple log data shown in Scott’s example. But when I tried to recreate that example in Python, I developed a new appreciation for PowerShell’s data munging chops. The selection, grouping, summing, and sorting capabilities — and the pervasive use of the pipeline — are potent ways to manipulate .NET objects. But they’re equally potent when you’re just munging a CSV file.

True, Python has a csv module that’s roughly equivalent to PowerShell’s Import-CSV, so you can easily slurp the data into a dictionary, taking fieldnames from the first row. But how do you perform the selection, grouping, summing, and sorting as succintly and — for lack of a better word — as composably? I didn’t find obvious answers. Of course, just because I failed doesn’t mean it can’t be done. I’d be curious to see solutions in Python (or Ruby, Perl, etc.) that people think capture the spirit of Scott’s example. I suspect there may be interesting lessons to be learned going in both directions.

PS: When I ran the script that Lee Holmes wrote based on Scott’s one-liner, I was initially puzzled to see a one-column display of names, not a two-column display of names and counts. Then I realized it was a formatting problem — the second column was running off the edge of my display. So I changed:

Get-ShowHits | Sort -Desc Hits

To:

Get-ShowHits | Sort -Desc Hits | Format-Table Name,Hits -auto

Posted in .

12 thoughts on “PowerShell data munging

  1. I could probably put this all on one line if you’d like.

    foo = Hash.new(0)
    CSV.open(“test.csv”, “r”) { |row| foo[row[0][/\d{4}/].to_i] += row[1].to_i}
    puts “Name\t\tHits”
    foo.sort.each { |i| puts “#{i[0]}\t\t#{i[1]}” }

    Translation…
    Create a hash with default value of integer=0

    Foreach row in csv file pull the four digit string from the file name (first column), convert to an integer and use to index hash. Add (integer converted) value in second colum to value at hash index.

    Do some pretty printing stuff.

    Doesn’t quite have the same pipeliney flavor of the powershell example…

  2. These types of problems are like mosquito bites for me – I can’t stop itching at them. A better ruby one-liner. Doesn’t print a header but not a big deal…

    CSV.read(“test.csv”).inject(Hash.new(0)) {|h,row| h[row[0][/\d{4}/].to_i] += row[1].to_i;h}.sort.each {|i| puts(“#{i[0]}\t#{i[1]}”)}

  3. You ought to reference or at least check out Tim Bray’s “Wide Finder” project (at http://www.ongoing.com) which is looking at parallelized implementations of log analysis. Many different languages have had samples submitted.

    As he said – this isn’t major “Application Architecture” stuff but it’s the kind of thing people have to do every day.

  4. I suspect this is not in the spirit of Scott’s example – one liners are just not terribly Pythonic. (And rather hard to do in Python too! Sorry, golfers.) But it’s clear, concise, and robust:

    #!/usr/bin/env python -u
    import collections, csv, re, sys

    reader = csv.reader(sys.stdin) # use python -u on windows to ensure
    writer = csv.writer(sys.stdout) # that binary mode is used
    show_hits = collections.defaultdict(int) # just like a perl auto-vivify
    header = reader.next() # consume the first row into the header
    show_re = re.compile(r”’
    .*?_ # non-greedy match upto the show code – simplifies the regex
    (?P\d{4}) # which is a 4-digit show code (or relax as desired)
    ”’, re.VERBOSE)
    for row in reader:
    data = dict(zip(header, row)) # re-express each row as a record dict
    match = show_re.search(data[‘File’])
    if match: # and if no match, perhaps throw an exception…
    show_hits[match.group(‘show’)] += int(data[‘Hits’])
    writer.writerow((‘Show’, ‘Hits’))
    for show, hits in sorted(show_hits.iteritems()):
    writer.writerow((show, hits))

  5. Besides the whitespace being butchered by wordpress, I noticed that it removed the less-than,show,greater-than in the regex, which is used in match.group(‘show’). Feel free to contact me for the original source code.

  6. Thanks for these examples!

    I see I wasn’t at all clear about what grabbed me in Scott’s example. It’s not the one-linerness of his original, vs the more unpacked version that Lee showed. Rather, it’s the way in which the operations of selection, grouping, and summarization are composed. It feels interestingly different to the style of Python and Ruby. But I guess I’m not sure whether I think that’s an essential or a superficial difference in style.

  7. rom my understanding, I think pipe can be define as

    expr | f(a1,a2,…) —> f(expr, a1, a2, …)

    Sometimes we have a series of functions

    r1 = f1(a1, a2)
    r2 = f2(r1, b1, b2)
    r3 = f3(r2, c1, c2)

    This can be compacted in one line as

    f3( f2( f1(a1, a2), b1, b2), c1, c2)

    It looks complicated here. But we have all used this pattern and it can be very clear when used right.

    With a pipe syntax, this becomes

    f1(a1, a2) | f2(b1, b2) | f3(c1, c2)

    I think this is interesting. But I’m not sure if this code pattern is prevalent enough to justify introducing a new syntax into a language.

  8. Posting this for Wai Yip Tung:


    import csv
    import re
    from operator import itemgetter as select
    from itertools import groupby
    from pprint import pprint
    def get_showhits(filename):
      regex = re.compile('/hanselminutes_(\d+).*')
      csv_file = csv.reader(open(filename))
      header = csv_file.next()
      shows = map( select(0,1), csv_file )
      shows = [(File, int(Hits), regex.match(File).group(1)) for File,Hits in shows]
      shows.sort(key=select(2))
      shows_groups = groupby(shows, select(2))
      showOutput = [(k, sum(map(select(1), g))) for k, g in shows_groups]
      return showOutput
    showOutput = get_showhits("stats.csv")
    showOutput = sorted(showOutput, key=select(1), reverse=True)
    pprint(showOutput)

    He adds:

    I think it mimics the PowerShell script fairly closely. On the other hand
    I can cut 1 or 2 lines out if I just go with the Python way. Of course
    Python do not have a Table data type so it cannot refer a column by name.
    I think it is good enough to refer to column by number such as select(1),
    etc.

    PowerShell’s “select” keyword is an inspiration. The term is so much more
    intuitive than Python’s little known operator.itemgetter method.

    About the pipe syntax, right now every Python statement produce an
    expression output. They are just throw away in the next statement. Perhaps
    we can make some use of it. In fact in the interactive mode, the last
    result is bound to the identifier “_”. So we can do something in the
    spirit of pipe using _ like below:

    >>> get_showhits(“log.csv”)
    >>> sorted(_, key=select(1) ,reverse=True)
    >>> pprint(_)
    [(‘0026’, 78173),
    (‘0075’, 25814),
    (‘0076’, 24626),
    (‘0077’, 17204),
    (‘0076’, 15796),
    (‘0078’, 14832),
    (‘0078’, 11058)]
    >>>

    But again, I’m not sure if this is a common enough code pattern.

  9. Regarding comment #9,

    in Python, if you have (r1) as an object ans set up the right arguments, you can also do:

    r1.f1(a1, a2).f2(a3, a4).f3(a5, a6)

    But I’m not sure if this code pattern is prevalent enough to justify introducing a new syntax into a language.

    I think for Python, you’re right.

Leave a Reply