Analyzing Advent of Code

Posted on Jan 5, 2024

2023 came with another edition of Advent of Code, an advent calendar with programming challenges.

In many ways - e.g. the assessment of both performance of and correctness on official leaderboards - the programming challenges are similar to regular programming competitions. Yet, there is also a large part of the community approaching Advent of Code more casually: people playing around with funky programming languages (e.g. Nix or Nim) or people streaming their daily programming sessions.

Given that lots of data around Advent of Code exists online, I sought out to answer theses questions:

How does the number of successful leaderboard submissions evolve over advent days and years?
Where do participants come from?
Which programming languages do people use?

1. Number of successful submissions

The Advent of Code website indicates the number of successful submissions per advent day, for each year. Importantly, there are usually two parts to a problem. We’re looking at the submissions which were successful in both parts.

Results

The left-hand side plots show the absolute number of successful submissions - once on a linear scale (top) and once on a log scale (bottom). The right-hand side plots indicate the number of successful submissions on a given day, divided by the successful submissions of the first day of that year. Again, we see a linear scale (top) and a log scale (bottom).

Given that the curves look fairly linear with a log scale, it seems fair to say that the number of successful submissions is exponentially decreasing from day to day. I’d be curious to better understand the mechanisms at work here. Rumor has it that the challenges are meant to become harder over time. Yet, the trend could also be caused by inherent funnel shapes of sequences - many of those who start, stop at some point; few join late.

Yet, we also see that the trend is not monotone - days 5 and 6 of 2023 are a counterexample.

2. Locations

The Advent of Code website has an official leaderboard of the 100 people who have accrued the highest scores by consistently submitting fast solutions fast. Some of these users have linked GitHub account to their Advent of Code profile. Some of these GitHub accounts indicate a location. Importantly, these location indications are not only optional but also free text. Hence they could be anything from “Lyon” to “United States”.

Results

Plotting the geo cordinates of these locations on a map yields the following:

Importantly, we are not truly answering the original question of where the best participants come from. Rather, we answer a proxy question of where the successful participants who feel like sharing their location claim that they are located.

Methodology

First, I fetched all GitHub usernames from the leaderboards of 2021, 2022 and 2023. Then, I extracted the locations - if available - from their GitHub profiles. In order to turn the location text into geo coordinates, I used the Google Maps Geocoding API. Given a text query, it responds with geo coordinates it thinks are most likely to best represent the query. These coordinates then flowed into a plot created with geopandas.

3. Languages

Since Advent of Code doesn’t confine the space of programming languages I was curious to look into which programming languages have been used by the larger community. Hence, I looked for public GitHub repositories which seemed associated with Advent of Code. Since GitHub provides handy metadata on the programming languages used within a repository, I just needed to extract those.

Results

I can’t quite decide whether I feel surprised or not.

I would also be interested in seeing whether the empirical distribution over progamming languages used by the people ranking high on the leaderboard - in stark contrast to the overall community - are indeed what I would expect: a single bar at C++.

Methodology

Clearly, there are countless ways to go about this. Mine was to simply “search” for “Advent of Code” with GitHub’s repository search API. Some alternatives that came to mind included:

trying to further specify a year as part of the query to capture recent development
filtering repositories for first commit date to be no earlier than December of the year of interest
adding additional keywords such as “aoc”
using GitHub’s topics

I didn’t notice any clear trade-offs between these approaches and therefore stuck with the appraoch described above. I’m sure investigating this further could improve the quality of the conclusion.

Moreover, GitHub often provides several languages used in a repository, quantified in bytes of code. I arbitrarily chose to only regard the top-most used repository language. Moreover - in lign with Occam’s razor since I didn’t have a prior convincing me otherwise - I decided not to weigh repositories, e.g. by their amount of lines of code, commits or other.

Importantly, there is a GitHub rate limit of 5000 requests per hour. Since I did this on a whim I actually limited myself to these 5000 requests.

All code for this analysis can be found here.