Undertaker

Undertaker is a new-school unix utility ^[1], which helps you structure your projects for sharing from the start.

Undertaker blog is being added gradually. I'll put in a left sitenav or something soonish.

"Ideas rot if you don't do something with them. I used to try to hoard them, but they rotted. Now I just blog them or tell people about them. Sometimes they still rot, but sometimes someone finds them useful in one way or another"
— Edd Dumbill [2]

I cannot emphasize enough that this is a draft of absolutely everything here. I'm not likely to change my mind on the fundamentals described below, but I'm going to reserve the right to, as this is my first foray into formalizing/systematizing my thoughts about these issues.

This whole project is, in a lot of ways, reactionary. It's my attempt to patch over a bunch of shortcomings and pitfalls I've run into over the last 15 years. A lot of my reasoning is grounded in anecdotes, many of them fuzzy memories. I like to believe that I'm not cargo-culting from my experiences, but it may very well be that I am.

As an aside of an aside, one of the people who's making me think about all this is Keith Fenner, of Turn Wright Machine Works. He's a guy who runs a machine shop in Cape Code, and "vlogs" (fuck I hate that word) the work he does on the Turn Wright Machine Works YouTube channel. He's spending a little bit more time doing his work, and making good-if-rough videos of the process for others to see.

The idea that code is something you write and then eventually publish.
The idea that code and documents about that code are different-class citizens.
The idea that it's okay to release half-assed code but half-assed documentation isn't allowable.
The idea that either 1) blogs are permanent or 2) your exposition of your code is ephemeral.
The idea that "publishing" anything that involves server-side scripts to view is actually publishing.
The idea that a project's infrastructure suffices for documentation.
The idea that tarballs are obsolete, and we should all be pulling from HEAD all the time.

The inspiration for this was a sudden realization: "Y'know, I could put 90% of everything on my computer on the web, and it wouldn't cause any harm." It became abundantly clear that I wasn't publishing enough on the web. So, I started thinking about how to make sharing my default, instead of something I did later. The current results of that are here, though they are very much a work-in-progress.

A very short history of code publication

The most traditional unit of publication of software is the venerable tarball. This is just a compressed copy of the directory at a given point in time. Tarballs predate the web; their contents reflect the concerns of a different age. A tarball contained all of the code, and, if you were lucky, some documentation on how to actually use it. It also typically contained a ChangeLog and a TODO file, which were nods to the past and future, respectively. But to get to all of this content, you had to break open the eggshell of the compressed file. When the web came along, that changed, sort of.

Once we had the web, people started putting up homepages for their software packages. These were one step forward: you could view much of the documentation for a project without having to download its tarball, then manually grovel around in the files to find what you were looking for (thanks for Hypertext, Ted Nelson!). At the same time, it was two big steps back: people put their code up on FTP sites (not on the web), and the documentation started moving out of the tarball and onto the web.

READMEs began to contain the fateful line "For all the details, see the homepage at http://www.geocities.com/...". Of course, nobody put these website into their source code tarballs, so now there's a bunch of code floating around the internet with documentation that was lost when their webhost went out of business and they didn't bother uploading the webpage to a new home. Similarly, there's a bunch of websites containing documentation for software that was only stored on FTP servers that are long since shut down.

As software got more complex, people started collaborating more. As part of that, we introduced revision control. The first network revision control system was CVS. This is the technology Sourceforge was based around: SF made it easy for people to set up a new project and get an internet-accessible CVS repository. This was a big step forward for collaboration, especially as they built out additional services, like bug tracking, collaboratively edited webpages, and other niceties.

This unification was very nice on one level. It meant that you only needed to go to one place to get all the artifacts for a given project. This unification was great, and is why Sourceforge was so popular and well-received. However, it also meant that all these projects and all their artifacts only existed so long as Sourceforge did.

This centralization creates a single point of failure (SPOF): when Sourceforge was down, thousands and thousands of open source projects were essentially unavailable. To their credit, SF has suffered very little down-time, but the SPOF persists in a more insidious fashion. When Sourceforge eventually runs out of money and shuts down (as almost all commercial concerns do), all of these projects suddenly disappear.

Many projects have transitioned from SF to other services over the years, notably bitbucket and github. These services are invaluable, and fill important needs around collaboration. However, we shoudln't think of them as publication platforms. Instead, we should be using them to create software, then distributing that software ourselves. It's this separation of creation and publication that I'm driving at. [[TK Footnote http://en.wikipedia.org/wiki/Comparison_of_open_source_software_hosting_facilities ]].

git and a bit of inside-out thinking lets us fix a lot of this

Things like wikis and bug tracking still belong on github et al, but we need to take back our artefacts.

"Publishing is like your parents coming over -- you have to clean it up and make it presentable"
— Danny O'Brien [3]

Most of the project is still very much WIP, so the best documentation is in the README file and the source code. If you want to use this right now, you'll need to be willing to do a fair bit of hacking, kludging, and otherwise fixing up to get around the stuff I have hard-coded for my own settings right now.

Download

You can git clone the code from this URL (right-click and "copy link address"). If that's too much effort for you, don't worry: the code is not in a state that you're going to want to cope with just yet.

Why not github?

There's an argument to be made for just using github (or bitbucket or sourceforge or...) for the source code storage, and then just putting everything up on a blog for the sharing of content. There are a few problems with this approach: durability, archivability, and portability

Durability

"Durability" here means "How long a given publication will last." Of all the services mentioned above, sourceforge has by far the longest track record. It's been up and running for over a decade at this point. However, for the last five years, most of us have had serious concerns for SF's longer-term viability. Not very many people are putting new content up on SF, because it's not clear how much longer they're going to be around. Any content you put into a closed-system service will only exist as long as that service does.

Archivability

"Archivability" is not a word that I just made up, I swear. It refers to how easy it is to create an archive of a work. There are many people trying to create archives of the whole internet for posterity. The two most visible are the Internet Archive's Wayback Machine and Archive Team, a "loose collective of rogue archivists." These nice people are trying to make copies of your stuff so that people a hundred years from now can see it. The least we can do is make their lives easier, by creating structures that are amenable to crawling. We also want to avoid running any code on the webserver: if we can publish as a big chunk of static content, it can be archived trivially. On the other hand, if your site navigation is predicated on form submissions, search boxes, and such, it becomes a part of the "deep web," inaccessible to crawlers and archivists. Your site is also not archivable by yourself: if your homepage was built using PHP3, there's a good chance it won't run on recent PHP releases[4]. This means that your carefully-crafted dynamic homepage has, effectively, disappeared.

Portability

The last major concern is portability. How much work is it to move your project from one service to another? If your software has its whole homepage in github's wiki tool, and github decides to stop supporting that feature, how hard is it going to be to move that content off to another service? Whatever format our final "publication" takes, it should be trivial to move it from one service to another. This lines up nicely with the concerns of "archivability." If we publish a ball of static content, it can easily be crawled, and it's also trivial to deploy to another location. You simply need to scp it to another webserver, and maybe repoint DNS at it.

Undertaker is strongly informed by these three concerns. It attempts to be as simple as possible, both in the structure of its ouput (a directory you can upload to literally any webserver), as well as its input (a directory of files, following an orderly pattern). This makes it easy for you to save copies of your work for backups and archiving, as well as making it easy to share your work widely using very basic, very widespread tools.

Simplicity

It also tries to keep its processing of your input to a minimum, so that an undertaker content repository is a decent archive itself, without requiring anyone actually have undertaker to make sense of it[5]. Locking your information up in proprietary formats is like hiring someone to write it all down in Navajo, Icelandic, or Esperanto. While you're paying her to do this, it's easy to store and retrieve your information yourself. But, at some point, she's going to go away. Sure, there are a few people around the world who will be able to get your content out for you, but why not keep your data in a human-readable format in the first place?

Footnotes

1. ^ New-school unix utilities are utilities made up of subcommands. The best-known examples are git and rails. To do anything, you run 'git foo', and behind the scenes, git is actually just running 'git-foo' for you. Your whole interface with a non-trivial world of software comes via one meta-command, which just multiplexes itself out to subcommands. For a better description (with a fair bit of distraction about accounting software), see Joey Hess's post about hledger) ^return^ .

2. ^ From Cory Doctorow's notes on Danny O'Brien's talk on Life Hacks (http://www.craphound.com/lifehacks2.txt) ^return^

3. ^ From Cory Doctorow's notes on Danny O'Brien's talk on Life Hacks (http://www.craphound.com/lifehacks2.txt), with small edit made per Danny in correspondance ^return^

4. ^ This is something I am acutely worried about. The PHP3 example described comes directly from my experience working at ibiblio, an internet library that provides hosting to nonprofits. We had to upgrade to PHP4 for security reasons, after trying not to for years. We set up a test server, and let all our users know about the pending switch. All the maintained projects had a few weeks to figure out how to get their code to run under PHP4, and then we threw the switch. As soon as we did, large swathes of information that had been available the day before was now disappeared from the internet. ^return^

5. ^ This is another example from personal experience. One of my first jobs in tech was working on old accounting systems, helping customers upgrade to newer systems. A necessary part of this is importing the old data into the new system. None of these systems had "export as CSV," and many of them were orphaned, with no company offering support. So, I was given a few giant blobs of data, and had to reverse engineer the database formats to build custom data export scripts. This isn't as bad as it sounds, since most of them were written in COBOL, which has blissfully consistent/stupid record packing routines. Better examples of "dead" file formats can be found at Archive Team's File Format Problem work.^return^