Script day: persistent memoize in bash

One type of task that I often find myself implementing as a bash script, is to periodically generate some data and display or operate on it – maybe through a cron job, watch or simply a loop. Sometimes part of the process is an expensive computation (could be network based, IO intensive or simply subject to throttling by another entity). The way to deal with issues like that in modern programming languages is a caching technique known as “memoization” (based on the word “memorandum”) in the results of an expensive call is retained in memory after the first time, and returned for future calls instead of running the expensive calculation. We also need to clear the cache every once in a while, but that’s another issue.

So, how to implement in bash?

One of the main problems I normally have with just putting the data in a local variable and returning it, is that often the script is run to generate the data and then terminate, only to be started a second later. So we want the cache to be persistent – so the next process can take advantage of the cache.

Normally that type of cache is stored on the disk, but that has some disadvantages – you need to prepare a directory for the file, make sure permissions are correct and handle cleanup. Instead we’ll use the default memory file system on Linux which is available on /dev/shm. The content of that directory gets cleared on reboot as it isn’t really stored anywhere, but it is persistent enough for our use.

So how would the code look? Its pretty simple – assume we have a function that generates some output (to standard output):

function get_data() {
    expensive_process
}

So instead of running our expensive process each time, we can cache the results very simply:

function get_data() {
    cache_file=/dev/shm/get-data.cache
    if [ -f $cache_file ]; then
        cat $cache_file
    else
        expensive_process | tee $cache_file
    fi
}

So now get_data will run the expensive process once, and then return the cached results for every call after that. You probably want to recompute the expensive every once in a while without rebooting the system so you may want to have some logic to delete the cache file from time to time.

One way to clear the cache is to have get_data clear the cache itself after a some time has passed, by looking at the creation time of the cache:

function detect_worker_count() {
    cache_file=/dev/shm/get-data.cache
    if [ -f $cache_file ] && [ "$(( $(date +"%s") - 60 ))" -ge "$(date +"%s" -f $cache_file)" ]; then
        cat $cache_file
    else
        expensive_process | tee $cache_file
    fi
}

So now the expensive process would run at most every 60 seconds, which is basically what we wanted.

Leave a Reply

 

 


Spam prevention powered by Akismet