GitHub Office of the CTO

Flat Data

Flat explores how to make it easy to work with data in git and GitHub. It builds on the “git scraping” approach pioneered by Simon Willison to offer a simple pattern for bringing working datasets into your repositories and versioning them, because developing against local datasets is faster and easier than working with data over the wire.

What's it for?
Bring working sets of data to your repositories
Share
Who made it?

Flat Data eschews the complexity of many tools in the data/ETL space in favor of something simple and flexible enough for many workloads, but which requires no user-maintained infrastructure. While Flat has a ton of utility for developers, we want to make it easier for scientists, journalists, and other developer-adjacent audiences to develop lightweight, data-driven apps.

The Flat Data project incorporates three different pieces:

Flat is an experiment from the Developer Experience team in GitHub's Office of the CTO and is not an official GitHub product. We are publishing it (as well as all documentation and examples) under the terms of the MIT license.

How to use Flat Data

Currently, Flat knows how to fetch data from two kinds of sources: any HTTP endpoint, or a SQL query against a supported datastore. Let's start with the simpler of the two, an HTTP fetch!

To get started with Flat Data, all you need is an endpoint or file that you want to download into a GitHub repository. In this case, we’ll be fetching the current Bitcoin price from this link every five minutes and storing that data in a GitHub repository. The complete example with a detailed tutorial can be found here.

Part I: Get data into your GitHub repository

We hope you’re already starting to see how the Flat Action can be useful. It allows you to download any kind of data (JSON, CSV, images, text files, zip folders, etc.) into your repositories at a repeatable schedule. As long as what you’re fetching has a URL, it can be downloaded into a GitHub repo.

At this point you’ve used Flat Data to get data on a schedule and keep it updated into your repo. Congrats!

But Flat Data can do a bit more. Read the optional parts II and III to see how you can add an extra step to process and view your data in more advanced ways.

Part II: Make the data your own Optional

What if you want to process or change the data in some way before it gets added to your repository? We’ve built in a way for you to run these types of tasks using the postprocess parameter in the Action. Let’s take a look at how to do that.

Part III: Visualizing our data for easy sharing Optional

Now that we have our data, what can we do with it? There are tons of options (throw it in a database, create a dashboard that uses it), but we just want to visualize and share it in a simple way.

For that we built a simple tool that takes any public GitHub repo and returns a nice GUI for viewing data. To use it, just prepend flat to any GitHub repo's URL. For example, https://flatgithub.com/the-pudding/data.

You can use Flat Viewer to take a closer look at any tabular, non-nested CSV and JSON file that lives in a GitHub repo, like these ones from The Pudding.

Once your Action has run a few times, you can see how the data has changed over time by clicking through the Commits dropdown. The table will highlight new, modified, and deleted lines for each commit.

Examples

As we saw in our walkthrough, the basics are very simple. You can get a Flat Action up and running in a few minutes. But Flat Data can be very powerful!

Here are a few examples for your reference, and to show the breadth of what Flat Data can help you with. Even more examples can be found here.

Why Flat Data?

There’s a certain kind of language that dominates the discourse on data today. Big data, bigger data, biggest data. A million rows aren't cool. You know what's cool? A billion rows. Distributed data systems that slip the surly bonds of any one machine to touch the face of scale. Techniques for sampling and transforming data while it moves instead of operating on it after it has come to rest. Strategies for contending with a deluge of events from chatty devices. Directing those data tributaries into undifferentiated data lakes so that we may pose different queries to the data someday. Beyond the pain of actually doing this work, it has become impossible to talk about this stuff without abusing metaphors beyond their safe design limits.

As a developer, all those bits occupying the proverbial lake/warehouse/refinery are as immediately useful as a grape seed is to a winery. Locality of data isn't some abstract concept when you're trying to build things on top of that data — it's the leading term of developer experiences. If I have the data, I can load it, and get to work. If I don't have the data, then it doesn't matter if the data is cleaned, filtered, and sorted. If I don't have the data, then either I alter my application logic to work with data at a distance, or I figure out how to bring a working set to my local environment.

As a result, there's an entire industry of tools that accomplish the chore of getting data to the right place, in the right format, at the right time. Data architectures vary wildly, so these solutions have a wide range of ambition (what they attempt to do) and complexity (what mess they attempt to conceal.) This is in contrast to the application/compute space, where pithy, prescriptive manifestos like 12 factor offer a great lens through which to think: "Do it this way, and thou shalt scale." There is no equivalent for data because there are so many different approaches; one can only nail so many theses to the church doors before running out of room to describe best practices.

It's easy to get dazzled by the complexity and diversity of data tooling, but we're not the first profession to invent hyper-specialized implements. Surgeons have trays full of weird tweezer-like things whose shapes differ but whose ultimate purpose is the same: grabbing hold of stuff that might be difficult to grab with our fingers. Surgery on data also requires specialized tools that offer unusual capabilities or guarantee certain behaviors! However, most medicine is not surgery, and we do ourselves a disservice if we accept that sort of complexity into every situation involving data.

Flat Data aims to simplify everyday data acquisition and cleanup tasks. It runs on GitHub Actions, so there's no infrastructure to provision and monitor. Each Flat workflow fetches the data you specify, and optionally executes a postprocessing script on the fetched data. The resulting data is committed to your repository if the new data is different, with a commit message summarizing the changes. Flat workflows usually run on a periodic timer, but can be triggered by a variety of stimuli, like changes to your code, or manual triggers. That's it! No complicated job dependency graphs or orchestrators. No dependencies, libraries, or package managers. No new mental model to learn and incorporate. Just evergreen data, right in your repo.

GitHub repo
data source
data source
data source
data source
Flat Viewer
Your app
Your database
data.json
process.js
processed data.json

Flat stands squarely on the shoulders of the git scraping pattern first articulated by Simon Willison. It makes this pattern dead simple for developers and accessible to other audiences that are already using GitHub to share and work with data, like journalists and scientists. With codespaces, you can author Flat data workflows and develop applications that use that data without ever leaving your browser, and without that data ever reaching your machine. If you have a static webapp, you can power it with data that is kept evergreen by Flat (we built a simple example using GitHub pages). You can use Flat to monitor systems in an auditable fashion, using git commits to track changes. Flat works for both textual or binary data, and is great for keeping test fixtures up to date.

To abuse the first law of thermodynamics, you can't eliminate the need to fetch data in apps, but you can shift that responsibility around. Instead of assembling data at runtime, we try to think of the ideal dataset that we want to use, and let Flat materialize that dataset for us.

We're excited to see what you do with it!

Feedback

We'd love to hear how you're using (or would like to use) Flat Data! Tweet us at @githubOCTO or start a thread in Flat Data discussions.

Right now, you can configure Flat to fetch data from either a HTTP endpoint or SQL database, but we would love help extending that list! Click that fork button, add support for a different backend, and send us a PR!

✌️ ❤️

GitHub OCTO
Developer Experience Team

From the Community