If you want to discover package.json
files for JavaScript projects that are not NPM libraries,
how would you do it?
That's the problem we set out to solve when we created a large simulation of the StackAid platform using real-world projects. Let's have a look at the problems and the ways around them.
To start, most of the package.json
files we’re looking for will be on GitHub, so let’s start with
GitHub Search. We could search for package.json
files and iterate over the results. Let’s try it:
Wow, 460M results. That number seems high, but that doesn't matter for our purposes.
Unfortunately, GitHub Search returns package-lock.json
files in the search results and
package.json
files found in the node_modules
directory. Also, as shown
in the screenshots, the search results aren't stable, and sometimes we don't get a full page of
results. Unfortunately, GitHub Advanced Search doesn't expose the query operations that would
allow us to narrow our search results. For more details take a look at GitHub's search code
documentation.
What now?
GitHub is not the only company that indexes GitHub repositories. SourceGraph provides a powerful search engine for source code and provides a free service for searching public repositories on GitHub.
Let’s give that a shot. SourceGraph supports many search operators. In our query, we’re looking
for package.json
files but exclude any found in node_modules
directories.
Here's the Sourcegraph query:
The query does a number of things to help narrow down the results:
- Excludes
package.json
files found in dot/hidden directories. - Excludes files found in directories such as
node_modules
,tests
andexamples
- Excludes archived and forked repositories
Here are the results of the query on Sourcegraph.com:
Several things to note about the results. The search results are stable and sorted by popularity
of the repositories. When specifying count:all
, Sourcegraph does an exhaustive search and
returns the total match count, 1.3M results, from their search index. The exhaustive search on
Sourcegraph takes approximately 25 seconds. Executing and downloading the results from Sourcegraph
using their CLI takes only 36 seconds for me.
Before moving on, there's one other thing to note about the Sourcegraph results. On the right
side of the results page is a module to allow grouping results. The default is a grouping by
repository. For us, it was obvious that there are a number of repositories with many package. json
files. Something to consider when sampling from the result set.
Finding non-NPM repositories
Every result from SourceGraph has the containing repository. Thankfully, we also have an export list of GitHub repositories for all NPM packages. Our task is to whittle the SourceGraph results down to the list of repositories not in the exported list of NPM package repositories.
Putting it all together
Our tools:
- Taskfile
- Benthos
- jq
- nsq
- sqlite_utils
- Docker Compose
The steps:
- Search: Query and save SourceGraph results
- Filter: Remove NPM package repositories from SourceGraph results
- Fetch: Benthos consume search results, fetches package.json files from GitHub, and publishes them to another NSQ topic.
- Persist: Save package.json files to disk.
- Transform: Create a SQLite DB from filtered search results and fetched package.json files.
The collection pipeline is a series of shell commands executed in order. Our preferred tool is Taskfile instead of bash scripts or Makefiles. The whole pipeline can be executed via Docker Compose to save you the trouble of installing a bunch of utilities.
While there are many frameworks for crawling, the majority of the work bookended the collection step, and so using anything more than Benthos and NSQ felt like overkill. Your mileage may vary for your use case.
Our project for querying Sourcegraph and collecting the package.json
files from GitHub is open
source and available at
github.com/stackaid/non-npm-projects
The results
We rate limit ourselves to be kind to GitHub, and so the full collection takes about 12 hours. To save you the hassle, we're making a snapshot available using Datasette on Fly.io. You can download the entire database or query the collection.
On the packages database overview page, you will see 3 tables.
- npm_package_repositories: A snapshot of all GitHub repositories for NPM packages
- non_npm_packages: All the collected package.json files and where they came from.
- dependencies: Dependencies and dev dependencies extracted from non_npm_packages to make querying across packages easier.
At the bottom of the overview page is a link to download the entire 3.3GB database.
Here's a preview of the non_npm_packages
database:
So what can you do with this data?
- Count of projects with and without tests
- Percentage of projects using Grunt
- How often is NextJS used with Axios
- Most popular dependency and semver constraint combinations
- Most popular co-dependencies for React
We hope the data is useful to you. If you dig into the data, let us know what you find! Big thanks to Sourcegraph and all the other utilities that we used to create our collection pipeline. As always, we fund all projects used here with StackAid.