3 minute read

I had an interesting adventure last night: The phone rang at 22:19, and then I spent a couple of hours with two colleagues assessing the situation, trying to identify what actually was the problem, and setting up a workaround until the root cause could be fixed.

My colleague on on-call duty received notice that mail was either not arriving at all, or at least arriving considerably delayed. After some troubleshooting from the user side inwards, he realized the queues on our on-prem Exchange server were going bananas. We constantly had a thousand emails apparently moving from the submission queue to the delivery queues and back again. That’s when I got involved.

Opening the queues, I saw that we had mix of emails containing alerts from an old system, let’s call it BMC, interspersed among hundreds of non-delivery-reports (NDRs). The recipient addresses did not exist neither on-prem nor in O365, and that was part of the explanation for the wild amount of email: BMC was sending alerts about something unexpected to our on-prem Exchange. Our on-prem Exchange could not find the recipients and asked our O365 tenant whether it recognized the addresses. O365 accepted the messages, and then realized that it didn’t have the recipients there, sending the email back along with an NDR, causing our on-prem Exchange server to respond with the email again, plus an NDR, multiplying the amount of messages going back and forth until our on-prem Exchange system simply couldn’t cope with the load of getting spammed by, seemingly, all O365 transport servers in our region.

The nature of mis-addressed email is that it will time out, sooner or later, but when your mail server is barely keeping it together, later feels like a really long time to wait. To make the problem go away sooner, I would need to make sure the messages got received somehow, to end the incessant NDRs. But I had no idea how many messages we could expect to arrive: This email multiplication had apparently been going on for hours. This means I would also have to purge the incoming emails so I didn’t risk important mail server volumes filling up.

A quick DuckDuckGo search later, I had a plan:

  • A new email account was set up and assigned an unused mailbox database on an unused storage volume.
  • The mailbox database was configured to not store deleted items, and to not wait for the next successful backup before purging deleted data.
  • I wrote a minimal PowerShell script with the following two lines:
Add-PSSnapin Microsoft.Exchange.Management.PowerShell.SnapIn
Search-Mailbox -Identity tempmailbox -DeleteContent -Force
  • I then created a scheduled task that ran the script every minute.
  • Finally I assigned the two previously missing email addresses to this mailbox and confirmed that email started arriving, and that it was periodically purged.

After that I kept an eye on the Exchange mail queues: The behavior was a bit scary, but it all worked out well in the end: It turns out that in our environment, Exchange is able to deliver about 1300 emails per minute to a single mailbox: That process seems to be single-threaded per mailbox database, if I’m not mistaken. But Microsoft was sending us email considerably faster than that, so at any one time, the Exchange submission queue had around a thousand emails in it, and the mailbox database delivery queue held 1000-1500 emails. For a while there, I purged the delivery queue for the database - again, this was the only mailbox in that database - but after a while I figured I couldn’t do a lot more, and I’ll be a lot more useful after a few hours of sleep.

In the morning, the mail storm had abated, and I could turn off the scheduled script and have a developer fix the root cause. I’ll wait another day or two before removing the user and purging the mailbox database: I know what I said to Exchange about it, but I want to make sure I don’t stumble over some weird edge case.