Why LLM-assisted table transformation is a big deal

Last week I had to convert a table in a Google Doc to a JSON structure that will render as an HTML page. This is the sort of mundane task that burns staggering amounts of information workers’ time, effort, and attention. Until recently there were two ways to do it. For most, it’s a mind-numbing manual exercise. For some (like me) it can sometimes be possible to write a script to do the transformation, but that’s another mind-numbing exercise, and it’s always tough to decide in advance whether the time and effort saved will be worth the trouble.

Tabular data is famously complex. Recently, I wrote a Steampipe plugin to find tables in web pages and transform them into Postgres tables. Then, naively, I tried to do the same for PDF files. It had been a while since I used any of the available libraries for parsing PDFs. Last time it hadn’t gone well, I thought maybe things had improved since, but no, it remains an unsolved problem. Which is tragic for many reasons, not least that the corpus of scientific knowledge lives in PDF files, and the tabular info they contain remains unavailable to machine processing. That’s not just my opinion. I consulted Peter Murray-Rust, who has spent decades working on this problem, and he confirmed that as yet there’s no general solution.

An eye-opening outcome

So you can imagine my surprise and delight when ChatGPT-4 accomplished the task. As usual, I first tried the Simplest Thing That Could Possibly Work. Rather than exporting the GDoc to one of the available formats, I just copied from the screen, pasted into the chat, and asked for a Markdown table. What you get, when you paste a copied GDoc table, is completely unformatted: one item per line, with no indication of what’s a table header, or a section header, or a normal row. There’s no apparent structure at all. And yet the tool was able to find the structure in that undifferentiated stream of items, and faithfully recreate the table.

I then presented an example of the JSON format, and asked for a transformation of the Markdown table to that format. It was a big table with a dozen sections, each with a handful of rows, and that didn’t go well. What I’m learning, though, is that the same strategy we use as programmers — decompose big tasks into smaller chunks — works well with LLMs. That’s why I opted for Markdown as an intermediate format, instead of asking to transform the raw data directly to the target JSON format.

The next part was especially interesting. I asked for a list of the section names, which it reported correctly. Then I asked to translate each section, by name, to the target format. That worked. It still required a dozen manual steps, because I had to append each generated JSON chunk to the final output, but that took a fraction of the time and effort that would otherwise be required.

We are on the cusp of a new era of cognitive surplus. People shouldn’t be laboriously transposing document formats, as millions (maybe billions) of information workers spend minutes (maybe hours) doing every day. These are tedious chores; it’s impossible to overstate how taxing they are, or what a relief it is to outsource them to the machine. That said…

A few caveats

Beginner’s luck?: I was unable to reproduce the behavior exactly as described here. This seems typical so far with LLMs, and I sometimes wonder if some of the best outcomes are just beginner’s luck. But actually, I think it may have more to do with interaction boundaries. In ChatGPT (and friends), New Chat (or equivalent) doesn’t seem to reset in a way that enables me to reproduce a result. Maybe there is no way?

Context window: I tried prompting with docx and html exports of the GDoc, but they were too large. My guess is that more context headroom wouldn’t necessarily have enabled me to do the whole job in one go, versus multiple interactions, because the path from bigger prompt to a desired transformation becomes less likely to be followed. Again, who knows? We’re in a realm of experimention that’s full of such unknowns.

Verification: My first rule for effective use of an LLM to perform some well-defined task has been: It’s quick and easy to verify the result. Right away, simple bash and Python scripts met that criterion. A script that combines a bunch of isomorphic CSV files into one big file, while deduplicating the headers, either works or it doesn’t. That’s easy to verify. And it’s quick, too, even if you wind up iterating a few times. The table transformation described here falls into the same category. It did require a proofread, and I made a couple of tiny tweaks, but verification was quick and easy. Had it been 10x larger, though, the burden of verification would suck some of the value out of the LLM solution.

Other LLMs: None of the others did well on this task. I expect they’ll all improve, though, and as they do, it’ll become more and more feasible to ask for multiple results in parallel, and look for consensus among them.

PDF tables in the scientific record

It’s my understanding that arxiv.org is in the CommonCrawl, so the LLMs are seeing zillions of tables in scientific papers. Tables are used there for layout, for data, or for a mixture of both purposes. Nowadays, when a conversation turns to the brokenness of scholarly publishing, and the perpetual dominance of data-unfriendly PDF formats, my eyes start to bleed. Semantic markup, smoothly implemented from inception to collaborative editing to HTML publication, was always the dream. Maybe it’ll never happen that way; instead, we’ll just use anything to write and edit, we’ll publish to PDF (but please not two-column!), and then we’ll extract document structure at the end of the toolchain. I never thought that might be possible, I’m not sure it’s a right answer, but this is a good time to be open to unconventional possibiliites. Whatever unlocks the data in the world’s corpus of scholarly PDFs is worth doing.

Posted in .

3 thoughts on “Why LLM-assisted table transformation is a big deal

Leave a Reply