Background Tasks as Simple Web Services

deadoralive is a broken link checker service that works with any site that implements its simple API. ckanext-deadoralive is a CKAN plugin that implements the API for CKAN sites. Together they demonstrate an approach to implementing background tasks as simple web services.

When designing Dead or Alive we wanted a way to implement a background task that would:

Be easy to install and maintain
Be easy to implement
Provide the background task as a service that can also be used by non-CKAN sites

(Thanks to Ross for the idea)

Examples of background tasks that you might want for CKAN include:

Checking a site for broken links (ckanext-qa)
Automatically rating the datasets on a site according to Tim Berners-Lee’s five stars of openness (ckanext-qa again)
Downloading archive copies of all the resource files a site links to (ckanext-archiver)
“Harvesting” datasets from one CKAN site into another (ckanext-harvest)
Parsing a site’s tabular data files and pushing the data into CKAN’s DataStore (DataPusher)
Analytics
Sending email notifications

Previous approaches to background tasks in CKAN include:

Extensions that use Celery. This requires running a Celery daemon in development and setting up Supervisor to run Celery in production, and then monitoring it over time.
A paster command meant to be run as a cron job. The background job has to run on the same server as CKAN and has to be setup separately for each CKAN site.
CKAN Service Provider apps. These are quite heavy to implement and maintain, each app having its own user accounts and authorization, database, job status tracking, etc.

Simple Install

Our approach minimizes both setup work for the user and implementation work for the developer. It uses two parts:

A CKAN plugin
A simple, stateless Python script, designed to be run on the command-line or by cron, that provides link checking as a service

(The plan is to turn the link checker into an actual web service, but for now a cron job is much simpler and gets us 90% of the way there.)

To install the CKAN plugin, you do:

$ pip install ckanext-deadoralive

To install the link checker:

$ pip install deadoralive

Then add a couple of settings to your CKAN config file and you’re done, there’s no extra setup steps or non-Python dependencies. (See the README for full instructions.)

The link checker can be installed on the same machine that CKAN is on or on a different machine. Since it does all its communication with CKAN over the API, it doesn’t matter. (You can even have the link checker running automatically on the server, but also install it on your laptop and sometimes run it manually from there.)

Not just for CKAN

The link checker service is not CKAN specific. Any site that implements the simple API that the service requires can use it and receive link check results. ckanext-deadoralive provides a reference implementation.

The plugin and service communicate with each other over HTTP, using this protocol:

Link checker service                                            Client site

deadoralive                                                     ckanext-deadoralive
                - Give me up to 50 resources to check ------->
                <-------------------- [<id_1>, <id_2>, ... ] -  (Gets resource IDs from db)

                - What's the URL for resource <id_1>? ------->
                <---- The URL for resource <id_1> is <url_1> -  (Gets URL from db)
(Checks <url_1>) - Resource <id_1>'s URL is broken ----------->  (Saves result)

                - What's the URL for resource <id_2>? ------->
                <---- The URL for resource <id_2> is <url_2> -  (Gets URL from db)
(Checks <url_2>) - Resource <id_2>'s URL is working --------->  (Saves result)

            ...

deadoralive continues the loop of asking for the URL for the next resource ID, checking the URL, then posting the result until it has checked all 50 URLs.

The next time deadoralive is run (either manually or by cron) it starts all over again by asking for another 50 resources to check.

Once all of the site’s resources have been checked, deadoralive will start getting empty lists back when it asks for more resources to check. After 24 hours (configurable) ckanext-deadoralive will start handing out the already-checked resources again to be re-checked.

ckanext-deadoralive then implements some logic on top of the link checker results it has saved in the db: if a link has been checked and found to be broken at least three times consecutively over a period of at least three days, the link is marked as broken. Various broken link reports are added to CKAN based on this.

ckanext-deadoralive is smart about which links it hands out to be checked next:

Older unchecked resources are checked before newer ones.
Checked resources are re-checked after 24 hours.
Resources that haven’t been checked in a long time are re-checked before ones that have been checked more recently.
New resources are checked before re-checking old ones.

The link checker service is as dumb as possible: it just keeps asking for more links and checking them.

We’ll Call You

When you think of a web service you probably think of something that provides an API, and you write code that calls that API. Dead or Alive works the other way round - you implement an API that it specifies, and it calls you. This keeps everything much simpler. If we had the client site calling the service rather than the other way round, then:

The CKAN plugin would need to call the service hourly to have links be re-checked every hour. So we’d need some sort of task queue or scheduler on the CKAN end, adding setup.
The CKAN plugin would need to know which link checker service it gave each link check task to (in our implementation CKAN doesn’t need to know anything about the link checkers).
The service would need to implement authentication (e.g. with user accounts and API keys) duplicating something that CKAN (or any client site using the service) already implements.

When the service makes requests to CKAN it sends along a CKAN API key (you configure it with the API key of an authorized CKAN user), so it’s just using CKAN’s existing authorization code.

Separate State and Asynchronousness

A key aspect of the design is keeping state and asynchronousness separate:

All of the state is in the CKAN plugin. CKAN already provides a database that plugins can use without any extra setup from the user, so we use the existing CKAN database for all storage. This keeps the link checker service simple - no state, no storage, no database to setup.

All of the asynchronousness and periodicity is in the service. Running tasks asynchronously or at periodic time intervals is what CKAN (or a CKAN plugin) can’t do without extra setup, so we avoid any asynchronous or periodic work on the CKAN end.

We keep the link checker service as simple as possible. It does the periodic and asynchronous tasks and nothing else. It leans on the CKAN plugin for everything else.

Failure Resistance not Task Statuses

The protocol is resistant to service failures. CKAN just hands out a resource ID to be checked and expects to be called back with the result. If it doesn’t get a call back after a while, it will just hand the resource ID out to another link checker request when one comes along.

We don’t need task statuses. There’s no API for the client site to query the status of a task (queued, running, succeeded, failed…), for example. To implement that API we’d need a periodic task on the CKAN end to call the API and get updates about the statuses of tasks. And on the service end we’d have to store the statuses of tasks to be able to report them. That would break the rule of keeping all the asynchronous and periodic stuff in the service, and all the state in CKAN. It would complicate both ends.

Distributed Task Processing

The protocol supports distributed tasks: a single client site can be checked by multiple link checkers, and a single link checker can check multiple sites. Each link checker just needs to know about the site(s) it should check. The sites don’t need to know about the link checkers, and the link checkers don’t need to know about each other.

This video shows CKAN running in the top-left, and three link checker processes running in the other three corners. Each process requests 50 links to check from CKAN, and together they check 150 (giant gif warning):

Distributed link checking

In this case all three link checker processes and CKAN are running on the same machine. But it would work exactly the same if they were all on different machines.

ckanext-deadoralive hands out different resource IDs to different link checkers without needing to know how many link checkers there are or needing to be able to distinguish one link checker from another. Once it’s handed out a resource ID to a link checker request and is expecting a result back, it won’t hand out that same resource ID to another request until enough time passes that it decides a result probably isn’t coming. (And if it does end up getting two results back for the same link, this doesn’t do any harm.)

The link checker itself has no state, so there’s no reason not to run more than one of it.

Conclusion

This approach should be applicable to all kinds of background services for CKAN:

Implement an asynchronous and/or periodic task as a simple, standalone web service. Not CKAN-specific.
Implement a CKAN plugin to integrate CKAN sites with the service.
Put all the state in the CKAN plugin, and do all the asynchronous and periodic work in the service.