ekidd
I've built three distributed job systems at this point. A handy rule of thumb which I have promoted for years is "build for 10x your current scale."

If you need to handle 70 requests/second, design for 700. If you need to handle 20 servers running batch jobs, design for 200 servers. If you're in a startup that grows 100% per year, you'll be at 8x scale in 3 years. So you'll have time to rewrite as you grow!

Out of three job systems I built, the first one tried to avoid SQL for "scalability." Then we hit a bunch of edge cases that needed transactional integrity, and we were in a world of pain.

My two more recent distributed job systems use PostgreSQL as a coordinator. They're literally built around SELECT FOR UPDATE SKIP LOCKED. One of them routinely controls 350 workers and the other does elaborate prioritization for thousands of jobs. Both of them will continue to run just fine until they're earning millions of dollars a year—for example, the one controlling 350 workers should scale to about 2,000 CPUs with a little work.

Hyperscale technologies are fairly cheap right up until you discover you need transactions. At that point, faking transactional semantics on top of an eventually consistent data store becomes an engineering nightmare.

So sit down, and do the math. If your company was earning $100 million/year, how big would your distributed system need to be? Can you easily get a PostgreSQL instance big enough to handle that load? (Or could you just shard per client?) If so, strongly consider using PostgreSQL. It makes a hundred things trivial.

aduffy
For several projects I’ve opted for the even dumber approach, that works out of the box with every ORM/Query DSL framework in every language: using a normal table with SELECT FOR UPDATE SKIP LOCKED

https://www.pgcasts.com/episodes/the-skip-locked-feature-in-...

It’s not “web scale” but it easily extends to several thousand background jobs in my experience

ritzaco
I often see the "engineers copy FAANG infrastructure because they want to be cool, even though their needs are completely different" take as a kind of attack on engineers.

But I think a lot of it is also about knowledge and documentation. If I want to copy FAANG or another startup, and set up an infinitely scalable queue-based architecture, I can find dozens of high quality guides, tutorials, white papers etc, showing me exactly how to do it. Yes maintenance is higher, but I can get set up with redis, SQS, any of the 'scalable' solutions within a few hours of copy-pasting commands and code and configuration from a reputable source.

If I want to use NOTIFY in postgres? I googled "SQLALchemy notify listen postgres" and I find a few unanswered stackoverflow questions and a github gist that has some code but no context.

I would honestly love to use this approach for a side project, but I don't have 2-3 days to figure it out on my own. The direct choice for me might seem to be

* simple, but not scalable (ie just use postgres)

* complex, but scalable (ie redis, sqs, whatever)

and then it's a tradeoff, and the argument goes that I am blinded by cool tech and FAANG and I'm choosing complex but scalable, even though I don't need scalable.

But taking into account guides and other resources, the choice for me is actually

* complex and not scalable (this, because I don't know how to implement it and I can't predict what pitfalls I might face if I try)

* simple and scalable (what everyone actually does)

and that makes the engineer's choice to follow faang look a lot more reasonable.

jpgvm
Few things.

1. The main downside to using PostgreSQL as a pub/sub bus with LISTEN/NOTIFY is that LISTEN is a session feature, making it incompatible with statement level connection pooling.

2. If you are going to do this use advisory locks [0]. Other forms of explicit locking put more pressure on the database while advisory locks are deliberately very lightweight.

My favorite example implementation is que [1] which is ported to several languages.

[0] https://www.postgresql.org/docs/current/explicit-locking.htm...

[1] https://github.com/que-rb/que

mianos
Skype used postgres as queue with a small plugin to process all their CDR many years ago. I have no idea if it used these days but it was 'web scale', 10 years ago. Just working, while people on the internet argued about using a database as a queue is an anti-pattern.

Having transactions is quite handy.

https://wiki.postgresql.org/wiki/SkyTools

I did a few talks on this at Sydpy as I used it at work quite a bit. It's handy when you already have postgresql running well and supported.

This said, I'd use a dedicated queue these days. Anything but RabbitMQ.

5id
One of the biggest benefits imo of using Postgres as your application queue, is that any async work you schedule benefits from transactionality.

That is, say you have a relatively complex backend mutation that needs to schedule some async work (eg sending an email after signup). With a Postgres queue, if you insert the job to send the email and then in a later part of the transaction, something fails and the transaction rollbacks, the email is never queued to be sent.

Ozzie_osman
One thing I love about Kafka is... It's just an append-only log, and a client is essentially just holding an offset. This is conceptually very simple to reason about. It's also persistent and pretty fault-tolerant (you can just go back and read any offset).

Unfortunately, Kafka carries with it enough complexity (due to the distributed nature) that it ends up not being worth it for most use-cases.

Personally I'd love something similar that's easier to operate. You'd probably be able to handle hundreds (if not thousands) of events per second on a single node, and without distributed complexity it'd be really nice.

And yes, in theory you could still use postgres for this (and just never delete rows). And maybe that's the answer.

rubenfiszel
We use exactly this for windmill (OSS Retool alternative + modern airflow) and run benchmarks everyday. On a modest github CI instance where one windmill worker and postgres run as containers, our benchmarks run at 1200jobs/s. Workers can be added and it will scale gracefully up to 5000jobs/s. We are exploring using Citus to cross the barrier of 5000j/s on our multi-tenant instance.

https://github.com/windmill-labs/windmill/tree/benchmarks

vhiremath4
We used postgres for some of our queues back when we were at ~10 msg/s. It scaled quite a bit, but, honestly, setting up SQS or some other queue stack in AWS, GCP, or Azure is so simple and purpose built for the task (with DL queues and the like built in), I don’t know why you wouldn’t just go that route and not have to worry about that system shitting the bed and affecting the rest of the DB’s health.

It seems foolish. I am a big fan of “use the dumbest tool”, but sometimes engineers take it too far and you’re left with the dumbest tool with caveats that don’t seem worth it given the mainstream alternative is relatively cheap and simple.

scraplab
Yep, we process hundreds of thousands and sometimes a few million jobs daily inside Postgres, using Oban in Elixir.

Having transactional semantics around background jobs is incredibly convenient for things like scheduling email only if the transaction is successful, and so on.

You do need to do a little bit of autovacuum tuning, but once sorted it’s been great for us.

dools
I'm always surprised that when I see people talk about queues I never see anyone mention beanstalkd. I've been using it for basically everything for 10 years and it's solid as a rock, incredibly simple and requires basically no maintenance. It Just Works™
MattIPv4
We process around 1 million events a day using a queue like this in Postgres, and have processed over 400 million events since the system this is used in went live. Only issue we've had was slow queries due to the table size, as we keep an archive of all the events processed, but some scheduled vacuums every so often kept that under control.
xdanger
I do enjoy using https://github.com/graphile/worker for my postgresql queuing needs. Very scalable, the next release 0.14 even more so, and easy to use.
levkk
Running this exact implementation with 47M jobs processed and counting. SKIP LOCKED is great for VACUUM, and having durable storage with indexes make otherwise expensive patterns like delayed jobs, retries, status updates, "at least once", etc. really easy to implement.
zacksiri
During my tenure as CTO at a fintech company I built a banking engine using postgres backed queue system using Elixir / Phoenix. It's still in use today. The company processed large volumes of transactions and we were able to do things in real-time in terms of payments. Our system reached a point where I realized that we can scale almost infinitely just using a 2 tier architecture (Elixir / Phoenix / Oban and PostgreSQL)

The industry standard for real-time was anything under 30s end-to-end. We hit many scaling limits as our platform grew. We worked on improving the system instead of looking at postgresql as a problem. We measured and measured and postgresql was never the problem when it came to scaling. Most of the problems that existed was mostly due to design decisions in the system or some technical debt that needed to be solved. When we solved those problems the system flew.

There was one incident where our PostgreSQL instance CPU usage went up to 100% during peak hours and would cause problems in the system. We wondered why, there was a lot of debate around the topic, PostgreSQL doesn't scale, we need a larger instance, or move to a dedicated queue system etc...

I didn't believe the hypothesis that PostgreSQL didn't scale. So I conducted a performance audit on the system myself I discovered one of the engineers forgot to add an index on one of the columns on a major table with > 60m records. There was a lot of back and forth between engineers that discussed whether we should add the index. I simply conducted a small experiment and PG analyze clearly showed a missing index in one of the key tables. After we added a single line of code and deployed the index, during peak hours DB CPU usage would not even exceed 20% we had to scale down our DB to save some $$.

PostgreSQL backed queue system does work.

I've been building distributed systems in large and small scales for over a decade. Before building systems with Elixir / Phoenix I also used to work with Rails where the default is as mentioned in this article is you just adopt Redis. While Redis does have additional overhead when it comes to management of the infrastructure, it also worked well.

Having been through both I'd say a simple PostgreSQL backed solution is the clear winner when it came to not having to manage a 3rd dependency when you already have a complex system. There was no 3rd or 4th piece of the puzzle to theorize hypothesize or experiment with. You just had 2 pieces it is either your application or your database that's the problem. That simplicity saved us a lot of time for debate and doing experiments also squashed any dogmatic values and opinions that just didn't hold any water.

nicoburns
For running queues on Postgres with Node.js backend(s), I highly recommend https://github.com/timgit/pg-boss. I'm sure it has it scale limits. But if you're one of the 90% of the apps that never needs any kind of scale that a modern server can't easily handle then it's fantastic. You get transactional queueing of jobs, and it automatically handles syncing across multiple job processing servers using Postgres locks.
andrewstuart
You don't even need a database to make a message queue. The Linux file system makes a perfectly good basis for a message queue since file moves are atomic.

My guess is that many people are implementing queuing mechanisms just for sending email.

You can see how this works in Arnie SMTP buffer server, a super simple queue just for emails, no database at all, just the file system.

https://github.com/bootrino/arniesmtpbufferserver

paulddraper
USE. ADVISORY. LOCKS.

Do not use SKIP LOCKED unless it is a toy/low throughout.

Row locks require transactions and disk writes.

Advisory locks require neither. (However, you do have to stay inside the configurable memory budget.)

pphysch
Another point is that task queue technology is highly fungible. There's nothing stopping you from starting with cron, adding in Postgres/Redis, then graduating to Kafka or something as need arises. And running all three in parallel, with different jobs in each. I would be surprised if the average Kafka shop didn't also have a bunch of random cron jobs doing things that could be implemented on Kafka or vice versa.

At some point you may want to refactor things to reduce tech debt, but it really is a "and" rather than "or" decision.

codeadict
This is exactly what the Oban https://getoban.pro/ Elixir library uses and combining postgres plus actors for queues scales pretty great for 90% of the needs out there. I have used it at my last few jobs at pretty decent scale and would take it over 10 years using Celery to manage queues + supervisord, setting up RabbitMQ or Redis. Its so simple you only need Elixir and Postgres and not 3 or 4 infrastructure pieces to manage a queue.
tiagod
Temporal, which AFAIK was made by the Uber Cadence team, which was also involved in SQS, uses postgres as a backend.

I used it for a web automation system for an accounting client (automatically read files from a network share, lookup the clients on a database, submit the documents to government websites, using headless browsers, and put the resulting files in the directory). It allows for completely effortless deterministic programs that call workers that run the non deterministic code, with built in configurable retries (react to certain exception type, exponential back off) so you can write code that works almost like there were no issues with api connections, filesystem, etc.

This code has been running for 5 or more years, with barely any maintenance, with 0 issues so far. It keeps everything in postgres, so even full reboots and crashes have no impact, it will just move the work back to the queue and it will run when there's an available worker.

ukd1
I maintain QueueClassic (https://github.com/QueueClassic/queue_classic) for Rails/Ruby folks; which is basically what you're talking about - a queuing system for Postgres. A bonus reason, and why I originally wanted this was the ability to use transactions fully - i.e. I can start one, do some stuff, add a job in to the queue (to send an email), .....and either commit, or roll back - avoiding sending the email. If you use resque, I found sometimes either you can't see the record (still doing other stuff and it's not committed), or it's not there (rollback) - so either way you had to deal with it.

QC (and equivs) use the same db, and same connection, so same transaction. Saves quite a bit of cruft.

KronisLV
> For example, this Hacker News comment stated that using Postgres this way is “hacky” and the commenter received no pushback. I found the comment to be load of BS and straw man arguments. This thinking seems to be “the prevailing wisdom” of the industry – if you want to talk about queue technology in public, it better not be a relational database.

I don't think that there's anything wrong with using a database as a queue, however, I think that there probably could have been better ways to get across the idea, rather than just dismissing an honest opinion as BS. I don't necessarily agree with all of what was said there, but at the same time I can see why those arguments would be reasonable: https://news.ycombinator.com/item?id=20022572

For example:

> Because it is hacky from the perspective of a distributed system architecture. It's coupling 2 components that probably ought not be coupled because it's perceived as "convenient" to do so. The idea that your system's control and data planes are tightly coupled is a dangerous one if your system grows quickly.

To me, this makes perfect sense, if you're using the same database instance for the typical RDBMS use case AND also for the queue. Then again, that could be avoided by having separate database instances/clusters and treating those as separate services: prod-app-database and prod-queue-database.

That said, using something like RabbitMQ or another specialized queue solution might also have the additional benefit of bunches of tutorials and libraries, as well as other resources available, which is pretty much the case whenever you have a well known and a more niche technology, even when the latter might be in some ways better! After all, there is a reason why many would use Sidekiq, resque, rq, Hangfire, asynq and other libraries that were mentioned and already have lots of content around them.

Though whether the inherent complexity of the system or the complexity of your code that's needed to integrate with it is more important, is probably highly situational.

andrelaszlo
One issue with Redis as a queue backend seems to be that persistence is quite expensive, at least for managed Redis instances. Using PG seems like it could be much cheaper, especially if you already have an instance with room to spare.

I thought it was an interesting article, and I'd love to hear more from people using PG for queues in production (my intuition would say you'd get a lot of table bloat and/or vacuum latency, but I haven't tested it myself), but when it comes to the conclusion - "choosing boring technology should be one’s default choice" - I can't think of anything more boring (in a good sense, mostly) than Sidekiq + Redis for a Rails app.

afhammad
There are a few mentions of Oban [1] here. Most people don't realise that Oban in fact uses SKIP LOCKED [2] as well.

Oban's been great, especially if you pay for Web UI and Pro for the extra features [3]

The main issue we've noticed though is that due to its simple fetching mechanism using locks, jobs aren't distributed evenly across your workers due to the greedy `SELECT...LIMIT X` [2]

If you have long running and/or resource intensive jobs, this can be problematic. Lets say you have 3 workers with a local limit of 10 per node. If there are only 10 jobs in the queue, the first node to fetch available jobs will grab and lock all 10, with the other 2 nodes sitting idle.

[1] https://github.com/sorentwo/oban [2] https://github.com/sorentwo/oban/blob/main/lib/oban/engines/... [3] https://getoban.pro/#feature-comparison

figassis
The way I implement my queues (usually as part of my monolith application) is as go routines. Each instance of the app launches with a unique id, and also a role. It can be a worker or the app itself. So when the app generates a queue item, it simply adds it to a table as pending. A worker will then, via transaction, update a set of items to add its instance id as well as an expiration for this lock. If that succeeds, no other worker will pull a queue item with a non null id or with an id different that it’s instance id that is not expired. Worker can then start processing and update item status accordingly. If it crashes, another worker will just repeat the process after the lock expires.

The code that does this is maybe 100 lines at most. It’s very effective especially if you deploy your app in kubernetes where you can expect instances to be ephemeral. It’s one of the components of my apps that has never needed any updates since I first wrote it circa 2017.

jjice
I've implemented queues with tables in RDBMSs a few times and it's always great and usually all you need. Worried about future scale? Make a class to wrapper the queue with a decent interface and swap it for RabbitMQ or whatever you want down the road. Implementation stays opaque and you have an easy upgrade path later on.
LennyWhiteJr
In my Amazon team, we use PostgreSQL as a queue using skip-locked to implement transactional outbox pattern for our database inserts. People commenting 'just use a queue' are totally missing the need for transactional consistency. I agree with the author, it's an amazing tool and scales quite well.
jarofgreen
> I’d love to see more neoq-like libraries for languages other than Go.

Python has Celery, but maybe the author is looking for more choice between brokers. https://docs.celeryq.dev/en/stable/index.html

Xeoncross
> As an industry, we’ve become absolutely obsessed with “scale”.

I wish the industry was even half as concerned with efficiency as it was with scale. Bitcoin? Electron? 5MB web pages? 5/10/25GB downloads to run 20yr old CD-sized games on modern software?

AlexITC
I have been involved in a few projects using postgres-bakend queues for a few years, scale hasn't been a problem so far.

On the other hand, I have done a few experiments with postgres LISTEN/NOTIFY, while the feature seems nice at first glance, I concluded that it wasn't worth it for our use cases, maybe it is different in other languages but in the JVM, you have to allocate 1-thread for polling these results, which also keeps a connection busy.

What I ended up doing is leveraging akka-stream to stream the queue data directly from the db, which makes it simple to define throttling rules, this is super simple and effective.

simonw
One of my favourite pieces of writing about worker queues is this by Brandur Leach:

Transactionally Staged Job Drains in Postgres - https://brandur.org/job-drain

It's about the challenge of matching up transactions with queues - where you want a queue to be populated reliably if a transaction completes, and also reliably NOT be populated if it doesn't.

Brandur's pattern is to have an outgoing queue in a database table that gets updated as part of that transaction, and can then be separately drained to whatever queue system you like.

nijave
If you're on a cloud provider, I'd say just use their offering. For small/medium amounts of messages (single digit millions a day) the cost will be trivial.

A lot of frameworks already have queue/job libraries with adapters (so you're not really locked in) and cloud providers are highly scalable and fault tolerant.

It seems silly to try to build into Postgres something that is already cheap and readily available unless you find yourself in a situation where standing up additional infra is hard (embedded, certain on premise)

welder
When I wrote my own background task queue I looked at Postgres, because it was already in use in the stack. Postgres would work for a simple queue, but supporting queue priorities, delayed/eta tasks, and broadcast tasks was too complicated. I decided on Redis, and it's scaled very well over the last year:

https://github.com/wakatime/wakaq

We currently process ~20 million tasks per day, and I don't have to worry about running VACUUM on my queue ;)

jack_squat
I've done the Postgres skip locked thing at least three times and I'm currently doing it, but IMO it is actually more maintenance and overhead, not less -- at least when compared with the queues made available by the major cloud providers. Compared with Pubsub or SQS you need to handle,

* Metrics, monitoring, alarming on depth, message age

* Autoscaling on your custom metrics

* Managing retries, dead lettering, backoff

* Managing the DB workload: it's update-heavy and may easily be more intensive than the rest of your app. You may need to repeatedly scale your tiny, startup-scale DB, causing customer disruptions, because of your homemade queue.

The arguments for it are either avoidance of lock-in, or the hand-wavy one in this article/many comments: "we're a small startup, we need to minimize complexity, and make things as simple as possible, maintenance will kill us!".

Lock-in makes sense (though other queue alternatives should still be considered), but the latter argument gets used to justify all kinds of harebrained, superficially-simple but actually-complicated engineering schemes. I generally put the PG skip locked approach in that bucket, particularly when the alternative on hand is Pubsub or SQS. If it's between a Postgres table and ActiveMQ I might feel more conflicted.

insanitybit
Something that always bothers me about "Use Postgres as a queue" (something I would suggest, even) is that there are shockingly few people publishing numbers around this. How am I supposed to know what scale I can handle with this solution? I've seen so, so few benchmarks, and maybe no recent benchmarks - a problem since performance has significantly changed across versions.
sgarland
I just want to commend OP - if they’re here - for choosing an int64 for job IDs, and MD5 for hashing the payload in Neoq, the job library linked [0] from the article.

Especially given the emphasis on YAGNI, you don’t need a UUID primary key, and all of its problems they bring for B+trees (that thing RDBMS is built on), nor do you need the collision resistance of SHA256 - the odds of you creating a dupe job hash with MD5 are vanishingly small.

As to the actual topic, it’s fine IFF you carefully monitor for accumulating dead tuples, and adjust auto-vacuum for that table as necessary. While not something you’d run into at the start, at a modest scale you may start to see issues. May. You may also opt to switch to Redis or something else before that point anyway.

EDIT: if you choose ULID, UUIDv7, or some other k-sortable key, the problem isn’t nearly as bad, but you still don’t need it in this situation. Save yourself 8 bytes per key.

[0]: https://github.com/acaloiaro/neoq

polyrand
Some time ago, I wrote a queue using SQLite[0]. Instead of SKIP LOCKED, you can use RETURNING to lock-and-read a message and ensure only one worker is going to pick it up:

  UPDATE ... SET status = 'locked' ... RETURNING message_id

Or you can just use an IMMEDIATE transaction, SELECT the next message ID to retrieve, and UPDATE the row.

On top of that, if you want to be extra safe, you can do:

  UPDATE Queue SET status = 'locked' WHERE status = 'ready' AND message_id = '....'

To make sure you that the message you are trying to retrieve hasn't been locked already by another worker.

[0]: https://github.com/litements/litequeue/

[1]: https://github.com/litements/litequeue/blob/3fece7aa9e9a31e4...

n_time
Could it be people are choosing over-engineered solutions because AWS has bad documentation and that's what the solutions architects tell them to do?

Could it be because microservices and so-called "server-less" have been sold as cost-saving measures that increase the business' flexibility and decrease capital investment?

When... in reality a single deployed Docker container is way more manageable than a distributed system constructed with "lambdas" and requires fewer engineers in the long run?

What I'm trying to say is that FAANG cargo-culting is only part of why developers choose to build solutions that scale larger than they need. Another large part is the cloud development ecosystem writ-large and the consulting culture that has built up around it.

fasteo
Over the years I have used beanstalkd for all things queue. Working flawlessly every single time.

This is an output for our oldest instance (legacy system running Ubuntu 12)

  Trying 127.0.0.1...
  Connected to localhost.
  Escape character is '^]'.
  stats
  OK 952
  ---
  current-jobs-urgent: 0
  current-jobs-ready: 765
  current-jobs-reserved: 2
  current-jobs-delayed: 946
  current-jobs-buried: 0
  cmd-put: 1188640739
  cmd-peek: 2718986
  cmd-peek-ready: 5052
  cmd-peek-delayed: 797
  cmd-peek-buried: 797
  cmd-reserve: 0
  cmd-reserve-with-timeout: 3245006799
  cmd-delete: 1188639093
  cmd-release: 43276760
  cmd-use: 1137988211
  cmd-watch: 166122
  cmd-ignore: 166077
  cmd-bury: 0
  cmd-kick: 1
  cmd-touch: 0
  cmd-stats: 1260
  cmd-stats-job: 43134828
  cmd-stats-tube: 53942209
  cmd-list-tubes: 3251625
  cmd-list-tube-used: 0
  cmd-list-tubes-watched: 0
  cmd-pause-tube: 988
  job-timeouts: 14084
  total-jobs: 1188640739
  max-job-size: 1048576
  current-tubes: 44
  current-connections: 63
  current-producers: 20
  current-workers: 47
  current-waiting: 41
  total-connections: 14996583
  pid: 3959
  version: 1.4.6
  rusage-utime: 170303.331293
  rusage-stime: 399435.543161
  uptime: 321658179
  binlog-oldest-index: 90539
  binlog-current-index: 90983
  binlog-max-size: 10485760
rockwotj
All of shortwave.com is built on this concept. The super powerful bit here that is the mentioned is that enqueuing jobs is transactional with other normal transactions, so you don’t have to architect around using a different system for the queue and the rest of your data.
alexbezhan
I'm using SQLite as my main database. Is it a bad idea to build queues on top of SQLite?
andrewstuart
MS SQL server, Postgres and MySQL all support SKIP LOCKED, which means they are all suitable for running queues.

I built a complete implementation in Python designed to work the same as SQS but be more simple:

https://github.com/starqueue/starqueue

Alternatively if you just want to quickly hack something into your application, here is a complete solution in one Python function with retries (ask ChatGPT to tell you what the table structure is):

    import psycopg2
    import psycopg2.extras
    import random
    
    db_params = {
        'database': 'jobs',
        'user': 'jobsuser',
        'password': 'superSecret',
        'host': '127.0.0.1',
        'port': '5432',
    }
    
    conn = psycopg2.connect(**db_params)
    cur = conn.cursor(cursor_factory=psycopg2.extras.DictCursor)
    
    
    def do_some_work(job_data):
        if random.choice([True, False]):
            print('do_some_work FAILED')
            raise Exception
        else:
            print('do_some_work SUCCESS')
    
    def process_job():
        sql = """DELETE FROM message_queue
        WHERE id = (
            SELECT id
            FROM message_queue
            WHERE status = 'new'
            ORDER BY created ASC
            FOR UPDATE SKIP LOCKED
            LIMIT 1
        )
        RETURNING *;
        """
        cur.execute(sql)
        queue_item = cur.fetchone()
        print('message_queue says to process job id: ', queue_item['target_id'])
        sql = """SELECT * FROM jobs WHERE id =%s AND status='new_waiting' AND attempts <= 3 FOR UPDATE;"""
        cur.execute(sql, (queue_item['target_id'],))
        job_data = cur.fetchone()
        if job_data:
            try:
                do_some_work(job_data)
                sql = """UPDATE jobs SET status = 'complete' WHERE id =%s;"""
                cur.execute(sql, (queue_item['target_id'],))
            except Exception as e:
                sql = """UPDATE jobs SET status = 'failed', attempts = attempts + 1 WHERE id =%s;"""
                # if we want the job to run again, insert a new item to the message queue with this job id
                cur.execute(sql, (queue_item['target_id'],))
            else:
                print('no job found, did not get job id: ', queue_item['target_id'])
                conn.commit()
    
    process_job()
    cur.close()
    conn.close()
gazpacho
I feel like one of the problems with using Postgres as a queue is that it’s hard to get started. There’s a lot you need to know. Getting started with something like Pub/Sun on GCP is much easier for many developers.

I’ve experimented with making this easier via libraries that provide high-level APIs for using Postgres as a queue and manage the schemas, listen/notify, etc for you: https://github.com/adriangb/pgjobq

skybrian
It seems like listen/notify doesn't play well with a serverless architecture. Would it make sense for Postgres to make a web request when there's work in the queue? Is that a thing?
whartung
For those using simple SELECTs, what kind of WHERE clause are you using that works well with lots of qualified pending messages and (somewhat) guarantees the most appropriate (oldest?) message?
winternewt
The official JDBC driver for PostgreSQL provides no way for the database to push events to the client. So you need a dedicated connection to continuously poll the database to see if there are any events available. This seems half-baked and does not give me the warm and fuzzy feeling I crave when making architectural choices. Not to mention, it causes undesirable latency in event delivery unless I flood the database with queries.
encoderer
This has its place. I would gladly use Postgres queues in places where I use redis queues today; but I would not consider replacing my SQS queues with Postgres.
menacingly
I've been thinking a lot lately about how much of tech, and life more broadly, is ruined by the pursuit of scale.

Taking on problems you don't (and will never) have because some vanishingly small minority has experienced them is nuts. Over-engineering is as incorrect as under-engineering. The correctly sized, correctly complicated answer is what we're after.

TalktoCrystal
I think this may help. A control plane for PostGre and Kafka and Pulsar. https://github.com/apecloud/kubeblocks/blob/main/docs/releas...
ComodoHacker
>Postgres queue tech is a thing of beauty, but far from mainstream.

A small nitpick. FOR UPDATE SKIP LOCKED was mainstream (in a sense) way before it was copied from Oracle to Postgres. It's used under the hood in most non-trivial Oracle deployments. Like for replication or refreshing materialized views in parallel.

Rapzid
Not sure this particular article brings anything new to the table, but it's nice to spread the word.

Been meaning to build an Orleans stream provider for Postgres.. I believe that's the main missing component that would allow everything to "JustWork" with Postgres until you outgrow it.

jgavris
I wrote a hacky version of SKIP LOCKED using advisory locks and a recursive CTE before it was released for a job queue. It worked splendidly, along with the transactional semantics of a proper database. I’m surprised more systems don’t realize they need it.
kissgyorgy
Here is a Python example how to use it: https://gist.github.com/kissgyorgy/beccba1291de962702ea9c237...
pyrophane
I don't have a problem with this approach at all, but I would argue that for many use cases a redis backed job library like Celery or Sidekiq might be even easier, especially if you are on a cloud provider that offers managed redis.
jasongi
I’m not against using Postgres for this. But I am against the rolling your own distributed task queue. It always seems like a simple task but snowballs in complexity. Any gains you get simplifying your stack will be wiped out by the fact that things like Celery (for example) don’t support using Postgres as a broker so now you have to do your own DIY Celery instead of say, just using Celery with the SQS broker (which… since we’ve established scale isn’t being considered here, SQS costs shouldn’t be an issue either).

Anyone know if there are Celery or Celery-like tools that support Postgres as a broker?

As a side-note, if you want a simple no-frills task scheduler ap-scheduler is a dead simple option. It’s even more limited than the solution described in OP (you can only run one worker so it’s not distributed at all) but often it is all you need especially for toy projects.

mildavw
For Rails apps, you can do this using the ActiveJob interface via

https://github.com/bensheldon/good_job

Had it in production for about a quarter and it’s worked well.

ezekiel68
I hate to point out something uncomfortable, but this guy's combative writing style and penchant for abstract art reminds me of the articles by a certain infamous creator of the ReiserFS file system for linux.
foreigner
I'm in the market for a Postgres-backed queue system with client libraries in NodeJS _and_ Python. Clients in both languages need to be able to read and write from the queue. Can anybody suggest one?
Ericson2314
Even more fun than notify/listen is using the write-ahead log! :D
xpe
> I hope to disimbue anyone of the notion that Postgres is an inferior queue technology.

I offer this correction to the author: it is "disabuse" not "disimbue".

dikei
IIRC, LISTEN/NOTIFY needs to pin a PostgreSQL connection to the client, so you won't be able to use transaction-level pooling with it.
evil-olive
there's an important dimension of scalability that I think gets overlooked in a lot of these discussions about database-as-a-queue vs queue-system-as-a-queue:

are you queuing jobs, or are you queuing messages?

that's a fuzzy distinction, so somewhat equivalently, what's the expected time it takes for a worker to process a given queue item?

at one end, an item on the queue may take several seconds to a minute or longer to process. at the other end, an item might take only a few milliseconds to process. in that latter case, it's often useful to do micro-batching, where a single worker pulls 100 or 1000 items off the queue at once, and processes them as a batch (such as by writing them to a separate datastore)

the "larger" the items are (in terms of wall-clock processing time, not necessarily in terms of size in bytes of the serialized item payload) the more effective the database-as-a-queue solution is, in my experience.

as queue items get smaller / shorter to process, and start to feel more like "messages" rather than discrete "jobs", that's when I tend to reach for a queue system over a database.

for example, there's a RabbitMQ blog post [0] on cluster sizing where their recommendations start at 1000 messages/second. that same message volume on a database-as-a-queue would require, generally speaking, 3000 write transactions per second (if we assume one transaction to enqueue the message, one for a worker to claim it, and one for a worker to mark it as complete / delete it).

can Postgres and other relational databases be scaled & tuned to handle that write volume? yes, absolutely. however, how much write volume are you expecting from your queue workload, compared to the write volume from its "normal database" workload? [1]

I think that ends up being a useful heuristic when deciding whether or not to use a database-as-a-queue - will you have a relational database with a "side gig" of acting like a queue, or will you have a relational database that in terms of data volume is primarily acting like a queue, with "normal database" work relegated to "side gig" status?

0: https://blog.rabbitmq.com/posts/2020/06/cluster-sizing-and-o...

1: there's also a Postgres-specific consideration here where a lot of very short-lived "queue item" database rows can put excessive pressure on the autovacuum system.

snicker7
I wonder how this is is actually simpler than an a managed queue, say SQS.
dangoodmanUT
It's interesting that Temporal is not addressed in this
debuggerpk
on a sidenote, what is the theme on this blog?
donatj
We have our own queue, because it was easy, fun and has been exceedingly reliable above all else. Far moreso than other things we had tried. Cough Gearman cough SQS cough

One endpoint accepts work to a named queue, writes it to a file in an XFS directory. Another locks a mutex, moves the file to an in progress directory and unlocks the mutex before passing the content to the reader. A third and final endpoint deletes the in progress job file. There is a configurable timeout, after which they end up at a dead letter box. I am simplifying only a little bit. It's a couple hundred lines of Go.

The way this is set up means a message will only ever be handed to one worker. That simplifies things a lot. The workers ask for work when they want it, rather than being constantly listening.

It took a little tuning but we process a couple billion events a day this way and it's been basically zero maintenance for almost 10 years. The wizards in devops even figured out a way to autoscale it.

rconti
[flagged]
what-no-tests
Really unimpressed with so many people thinking it's OK to shoehorn their database as a message queue.

Use the right tool for the job.

HeavyStorm
Terrible idea. Using a database for queues means using a file format that's not optimized for this, so, unless you don't delete the rows (effectively having a always-growing table) you'll run into performance issues. Also the index for a processed flag will cause contention.

I'm certain you can work around those issues, but why if you can use a proper queue?

liminal
My main issue with pretty much all queue approaches is that they don't work across platforms. They are built for one technology stack, be it Python/NodeJS/etc. This is fine if you've only got one stack, but in a microservices world it doesn't work where jobs can span multiple systems. You might be able to find some abandoned library that supports that queue tech on the other platforms you need, but now you've basically become a queue tech maintainer.
s17n
I don't usually downvote posts, but this article is just garbage - a rant about "the cargo cult of scale" but no actual arguments as to why Postgres is better than redis or sqs or anything else. The main reason that people don't use postgres for this kind of thing isn't some kind of misguided obsession with scalability, it's because postgres is way more complicated to deploy / manage and harder to use for application developers than any of the other options.
sr.ht